Sequences from Ancestral Single-Stranded DNA Viruses in Vertebrate Genomes: the Parvoviridae and Circoviridae Are More than 40 to 50 Million Years Old |
| |
Authors: | Vladimir A. Belyi Arnold J. Levine Anna Marie Skalka |
| |
Affiliation: | Simons Center for Systems Biology, Institute for Advanced Study, Einstein Drive, Princeton, New Jersey 08540,1. Institute for Cancer Research, Fox Chase Cancer Center, 333 Cottman Avenue, Philadelphia, Pennsylvania 191112. |
| |
Abstract: | ![]() Vertebrate genomic assemblies were analyzed for endogenous sequences related to any known viruses with single-stranded DNA genomes. Numerous high-confidence examples related to the Circoviridae and two genera in the family Parvoviridae, the parvoviruses and dependoviruses, were found and were broadly distributed among 31 of the 49 vertebrate species tested. Our analyses indicate that the ages of both virus families may exceed 40 to 50 million years. Shared features of the replication strategies of these viruses may explain the high incidence of the integrations.It has long been appreciated that retroviruses can contribute significantly to the genetic makeup of host organisms. Genes related to certain other viruses with single-stranded RNA genomes, formerly considered to be most unlikely candidates for such contribution, have recently been detected throughout the vertebrate phylogenetic tree (1, 6, 13). Here, we report that viruses with single-stranded DNA (ssDNA) genomes have also contributed to the genetic makeup of many organisms, stretching back as far as the Paleocene period and possibly the late Cretaceous period of evolution.Determining the evolutionary ages of viruses can be problematic, as their mutation rates may be high and their replication may be rapid but also sporadic. To establish a lower age limit for currently circulating ssDNA viruses, we analyzed 49 published vertebrate genomic assemblies for the presence of sequences derived from the NCBI RefSeq database of 2,382 proteins from known viruses in this category, representing a total of 23 classified genera from 7 virus families. Our survey uncovered numerous high-confidence examples of endogenous sequences related to the Circoviridae and to two genera in the family Parvoviridae: the parvoviruses and dependoviruses (Fig. ).Open in a separate windowPhylogenetic tree of vertebrate organisms and history of ssDNA virus integrations. Times of integration of ancestral dependoviruses (yellow icosahedrons), parvoviruses (blue icosahedrons), and circoviruses (triangles) are approximate.The Dependovirus and Parvovirus genomes are typically 4 to 6 kb in length, include 2 major open reading frames (encoding replicase proteins [Rep and NS1, respectively] and capsid proteins [Cap and VP1, respectively]), and have characteristic hairpin structures at both ends (Fig. ). For replication, these viruses depend on host enzymes that are recruited by the viral replicase proteins to the hairpin regions, where self-primed viral DNA synthesis is initiated (2). Circovirus genomes are typically ∼2-kb circles. DNA of the type species, porcine circovirus 1 (PCV-1), contains a stem-loop structure within the origin of replication (Fig. ), and the largest open reading frame includes sequences that are homologous to the Parvovirus replicase open reading frame (9, 11). The circoviruses also depend on host enzymes for replication, and DNA synthesis is self-primed from a 3′-OH end formed by endonucleolytic cleavage of the stem-loop structure (4). The frequency of Dependovirus infection is estimated to be as high as 90% within an individual''s lifetime. None of the dependoviruses have been associated with human disease, but related viruses in the family Parvoviridae (e.g., erythrovirus B19 and possibly human bocavirus) are pathogenic for humans, and members of both the Parvoviridae and the Circoviridae can cause a variety of animal diseases (2, 4).Open in a separate windowSchematics illustrating the structure and organization of Parvoviridae and Circoviridae genomes and origins of several of the longest-integrated ancestral viral sequences found in vertebrates. Integrations were aligned to the Dependovirus adeno-associated virus 2 (AAV2), the Parvovirus minute virus of mice (MVM), and the Circovirus porcine circovirus 1 (PCV-1). The inverted terminal repeat (ITR) sequences in the Dependovirus and Parvovirus genomes are depicted on an expanded scale. A linear representation of the circular genome of PCV-1 is shown with the 10-bp stem-loop structure on an expanded scale. Horizontal lines beneath the maps indicate the lengths of similar sequences that could be identified by BLAST. The numbers indicate the locations of amino acids in the viral proteins where the sequence similarities in the endogenous insertions start and end. The actual ancestral virus-derived integrated sequences may extend beyond the indicated regions.With some ancestral endogenous sequences that we identified, phylogenetic comparisons can be used to estimate age. For example, as a Dependovirus-like sequence is present at the same location in the genomes of mice and rats, the ancestral virus must have existed before their divergence, more than 20 million years ago. Some Circovirus- and Dependovirus-related integrations also predate the split between dog and panda, about 42 million years ago. However, in most other cases, we rely on an indirect method for estimating age (1). As genomic sequences evolve, they accumulate new stop codons and insertion/deletion-induced frameshifts. The rates of these events can be tied directly to the rates of neutral sequence drift and, therefore, the time of evolution. To apply this method, we first performed a BLAST search of vertebrate genomes for all known ssDNA virus proteins (BLAST options, -p tblastn -M BLOSUM62 -e 1e−4). Candidate sequences were then recorded, along with 5 kb of flanking regions, and then again aligned against the database of ssDNA viruses to find the most complete alignment (BLAST options, -t blastx -F F -w 15 -t 1500 -Z 150 -G 13 -E 1 -e 1e−2). Detected alignments were then compared with a neutral model of genome evolution, as described in the supplemental material, and the numbers of stop codons and frameshifts were converted into the expected genomic drift undergone by the sequences. The age of integration was then estimated from the known phylogeny of vertebrates (7, 10). Using these methods, we discovered that as many as 110 ssDNA virus-related sequences have been integrated into the 49 vertebrate genomes considered, during a time period ranging from the present to over 40 to 60 million years ago (Table | Open in a separate windowaSome ambiguity in choosing the most similar virus is possible. We generally used the alignment with the lowest E value in the BLAST results. However, one or two points in the exponent of an E value were sometimes sacrificed to achieve a longer sequence alignment.baa, amino acids.cThese sequences have long insertions compared to the present-day viruses. In all cases tested, these insertions originated from short interspersed elements (SINEs). These insertions were excluded from the counts of stop codons and frameshifts and the estimation of integration age.dChr, chromosome.It is important to recognize that there is an intrinsic limit on how far back in time we can reach to identify ancient endogenous viral sequences. First, the sequences must be identified with confidence by BLAST or similar programs. This requirement places a lower limit on sequence identity at about 20 to 30% of amino acids, or about 75% of nucleotides (nucleotides evolve nearly 2.5 times slower than the amino acid sequence they encode). Second, the related, present-day virus must have evolved at a rate that is not much higher than that of the endogenous sequences. The viruses for which ancestral endogenous sequences were identified in this study exhibit sequence drift similar to that associated with mammalian genomes. Setting this rate at 0.14% per million years of evolution (8), we arrive at 90 million years as the theoretical limit for the oldest sequences that can be identified using our methods. This limit drops to less than 35 million years for endogenous viral sequences in rodents and even lower for sequences related to viruses that evolve faster than mammalian genomes.The most widespread integrations found in our survey are derived from the dependoviruses. These include nearly complete genomes related to adeno-associated virus (AAV) in microbat, wallaby, dolphin, rabbit, mouse, and baboon (Fig. ). We did not detect inverted terminal repeats in several integrations tested, even though repeats are common in the present-day dependoviruses. This result could be explained by sequence decay or the absence of such structures in the ancestral viruses. However, we do see sequences that resemble degraded hairpin structures to which Dependovirus Rep proteins bind, with an example from microbat integration mlEDLG-1 shown in Fig. . The second most widespread endogenous sequences are related to the parvoviruses. They are found in 6 of 49 vertebrate species considered, with nearly complete genomes in rat, opossum, wallaby, and guinea pig (Fig. ).Open in a separate windowHairpin structure of the inverted terminal repeat of adeno-associated virus 2 (left) and a candidate degraded hairpin structure located close to the 5′ end of the mlEDLG-1 integration in microbats (right). Structures and mountain plots were generated using default parameters of the RNAfold program (5), with nucleotide coloring representing base-pairing probabilities: blue is below average, green is average, and red is above average. Mountain plots represent hairpin structures based on minimum free energy (mfe) calculations and partition function (pf) calculations, as well as the centroid structure (5). Height is expressed in numbers of nucleotides; position represents nucleotide.The Dependovirus AAV2 has strong bias for integration into human chromosome 19 during infection, driven by a host sequence that is recognized by the viral Rep protein(s). Rep mediates the formation of a synapse between viral and cellular sequences, and the cellular sequences are nicked to serve as an origin of viral replication (14). The related integrations in mice and rats, located in the same chromosomal locations, might be explained by such a mechanism. However, the extent of endogenous sequence decay and the frequency of stop codons indicate that these integrations occurred some 30 to 35 million years ago, implying that they are derived from a single event in a rodent ancestor rather than two independent integration events at the same location. Similarly, integrations EDLG-1 in dog and panda lie in chromosomal regions that can be readily aligned (based on University of California—Santa Cruz [UCSC] genome assemblies) and show sequence decay consistent with the age of the common ancestor, about 42 million years. Endogenous sequences related to the family Parvoviridae can thus be traced to over 40 million years back in time, and viral proteins related to this family have remained over 40% conserved.Sequences related to circoviruses were detected in five vertebrate species (Table ).In summary, our results indicate that sequences derived from ancestral members of the families Parvoviridae and Circoviridae were integrated into their host''s genomes over the past 50 million years of evolution. Features of their replication strategies suggest mechanisms by which such integrations may have occurred. It is possible that some of the endogenous viral sequences could offer a selective advantage to the virus or the host. We note that rep open reading frame-derived proteins from some members of these families kill tumor cells selectively (3, 12). The genomic “fossils” we have discovered provide a unique glimpse into virus evolution but can give us only a lower estimate of the actual ages of these families. However, numerous recent integrations suggest that their germ line transfer has been continuing into present times. |