首页 | 本学科首页   官方微博 | 高级检索  
     


Sequences from Ancestral Single-Stranded DNA Viruses in Vertebrate Genomes: the Parvoviridae and Circoviridae Are More than 40 to 50 Million Years Old
Authors:Vladimir A. Belyi  Arnold J. Levine  Anna Marie Skalka
Affiliation:Simons Center for Systems Biology, Institute for Advanced Study, Einstein Drive, Princeton, New Jersey 08540,1. Institute for Cancer Research, Fox Chase Cancer Center, 333 Cottman Avenue, Philadelphia, Pennsylvania 191112.
Abstract:
Vertebrate genomic assemblies were analyzed for endogenous sequences related to any known viruses with single-stranded DNA genomes. Numerous high-confidence examples related to the Circoviridae and two genera in the family Parvoviridae, the parvoviruses and dependoviruses, were found and were broadly distributed among 31 of the 49 vertebrate species tested. Our analyses indicate that the ages of both virus families may exceed 40 to 50 million years. Shared features of the replication strategies of these viruses may explain the high incidence of the integrations.It has long been appreciated that retroviruses can contribute significantly to the genetic makeup of host organisms. Genes related to certain other viruses with single-stranded RNA genomes, formerly considered to be most unlikely candidates for such contribution, have recently been detected throughout the vertebrate phylogenetic tree (1, 6, 13). Here, we report that viruses with single-stranded DNA (ssDNA) genomes have also contributed to the genetic makeup of many organisms, stretching back as far as the Paleocene period and possibly the late Cretaceous period of evolution.Determining the evolutionary ages of viruses can be problematic, as their mutation rates may be high and their replication may be rapid but also sporadic. To establish a lower age limit for currently circulating ssDNA viruses, we analyzed 49 published vertebrate genomic assemblies for the presence of sequences derived from the NCBI RefSeq database of 2,382 proteins from known viruses in this category, representing a total of 23 classified genera from 7 virus families. Our survey uncovered numerous high-confidence examples of endogenous sequences related to the Circoviridae and to two genera in the family Parvoviridae: the parvoviruses and dependoviruses (Fig. ​(Fig.11).Open in a separate windowFIG. 1.Phylogenetic tree of vertebrate organisms and history of ssDNA virus integrations. Times of integration of ancestral dependoviruses (yellow icosahedrons), parvoviruses (blue icosahedrons), and circoviruses (triangles) are approximate.The Dependovirus and Parvovirus genomes are typically 4 to 6 kb in length, include 2 major open reading frames (encoding replicase proteins [Rep and NS1, respectively] and capsid proteins [Cap and VP1, respectively]), and have characteristic hairpin structures at both ends (Fig. ​(Fig.2).2). For replication, these viruses depend on host enzymes that are recruited by the viral replicase proteins to the hairpin regions, where self-primed viral DNA synthesis is initiated (2). Circovirus genomes are typically ∼2-kb circles. DNA of the type species, porcine circovirus 1 (PCV-1), contains a stem-loop structure within the origin of replication (Fig. ​(Fig.2),2), and the largest open reading frame includes sequences that are homologous to the Parvovirus replicase open reading frame (9, 11). The circoviruses also depend on host enzymes for replication, and DNA synthesis is self-primed from a 3′-OH end formed by endonucleolytic cleavage of the stem-loop structure (4). The frequency of Dependovirus infection is estimated to be as high as 90% within an individual''s lifetime. None of the dependoviruses have been associated with human disease, but related viruses in the family Parvoviridae (e.g., erythrovirus B19 and possibly human bocavirus) are pathogenic for humans, and members of both the Parvoviridae and the Circoviridae can cause a variety of animal diseases (2, 4).Open in a separate windowFIG. 2.Schematics illustrating the structure and organization of Parvoviridae and Circoviridae genomes and origins of several of the longest-integrated ancestral viral sequences found in vertebrates. Integrations were aligned to the Dependovirus adeno-associated virus 2 (AAV2), the Parvovirus minute virus of mice (MVM), and the Circovirus porcine circovirus 1 (PCV-1). The inverted terminal repeat (ITR) sequences in the Dependovirus and Parvovirus genomes are depicted on an expanded scale. A linear representation of the circular genome of PCV-1 is shown with the 10-bp stem-loop structure on an expanded scale. Horizontal lines beneath the maps indicate the lengths of similar sequences that could be identified by BLAST. The numbers indicate the locations of amino acids in the viral proteins where the sequence similarities in the endogenous insertions start and end. The actual ancestral virus-derived integrated sequences may extend beyond the indicated regions.With some ancestral endogenous sequences that we identified, phylogenetic comparisons can be used to estimate age. For example, as a Dependovirus-like sequence is present at the same location in the genomes of mice and rats, the ancestral virus must have existed before their divergence, more than 20 million years ago. Some Circovirus- and Dependovirus-related integrations also predate the split between dog and panda, about 42 million years ago. However, in most other cases, we rely on an indirect method for estimating age (1). As genomic sequences evolve, they accumulate new stop codons and insertion/deletion-induced frameshifts. The rates of these events can be tied directly to the rates of neutral sequence drift and, therefore, the time of evolution. To apply this method, we first performed a BLAST search of vertebrate genomes for all known ssDNA virus proteins (BLAST options, -p tblastn -M BLOSUM62 -e 1e−4). Candidate sequences were then recorded, along with 5 kb of flanking regions, and then again aligned against the database of ssDNA viruses to find the most complete alignment (BLAST options, -t blastx -F F -w 15 -t 1500 -Z 150 -G 13 -E 1 -e 1e−2). Detected alignments were then compared with a neutral model of genome evolution, as described in the supplemental material, and the numbers of stop codons and frameshifts were converted into the expected genomic drift undergone by the sequences. The age of integration was then estimated from the known phylogeny of vertebrates (7, 10). Using these methods, we discovered that as many as 110 ssDNA virus-related sequences have been integrated into the 49 vertebrate genomes considered, during a time period ranging from the present to over 40 to 60 million years ago (Table
Virus group and vertebrate speciesInitial genomic search using TBLASTN
Best sequence homology identified using BLASTX
Predicted nucleotide drift (%)Integration labelAge (million yr) or timing of integration based on sequence aging
Chromosomal or scaffold locationProteinBLAST E value/% sequence identityMost similar virusaProteinCoordinatesNo. of stop codons/frameshifts
Circoviruses
    CatScaffold_62068Rep6E−05/37Canary circovirusRep4-2833/7 in 268 aab14.2fcECLG-182
Scaffold_24038Rep6E−06/51Columbid circovirusRep44-3174/5 in 231 aac15.2fcECLG-287
    DogChr5dRep7E−16/46Raven circovirusRep16-2636/5 in 250 aa17.6cfECLG-198
Chr22Rep1E−14/43Beak and feather disease virusRep7-2642/1 in 261 aac4.5cfECLG-254
    OpossumChr3Rep4E−46/44Finch circovirusRep2-2910/2 in 282 aa2.3mdECLG12
Cap6-360/0 in 30 aa
Dependoviruses
    DogChrXRep6E−05/55AAV5Rep239-4453/4 in 200 aa14.0cfEDLG-178
    DolphinGeneScaffold1475Rep8E−39/39Avian AAV DA1Rep79-4863/4 in 379 aac6.6ttEDLG-255
Cap4E−61/47Cap1-7384/7 in 678 aac
    ElephantScaffold_4Rep0/55AAV5Rep3-5890/0 in 579 aa0.0laEDLGRecent
    HyraxGeneScaffold5020Cap3E−34/53AAV3Cap485-7350/5 in 256 aa7.0pcEDLG-129
Scaffold_19252Rep9E−72/47Bovine AAVRep2-3488/4 in 348 aa14.3pcEDLG-260
    MegabatScaffold_5601Rep2E−13/31AAV2Rep315-4791/5 in 175 aa13.1pvEDLG-376
    MicrobatGeneScaffold2026Rep1E−117/50AAV2Rep1-6172/5 in 612 aa5.8mlEDLG-127
Cap9E−33/51Cap1-7312/9 in 509 aac
Scaffold_146492Cap6E−32/42AAV2Cap479-7320/3 in 252 aa4.2mlEDLG-219
    MouseChr1Rep2E−06/34AAV2Rep4-2063/5 in 191 aa17.1mmEDLG-139
Chr3Rep2E−24/31AAV5Rep71-47812/7 in 389 aa16.5mmEDLG-237
Cap2E−22/45Cap22-72412/10 in 649aac
Chr8Rep1E−08/46AAV2Rep314-4733/3 in 147 aa13.8mmEDLG-331
Cap1-1371/2 in 114 aa
    PandaScaffold2359Rep2E−06/37Bovine AAVRep238-4262/3 in 186 aa10.4amEDLG-159
    PikaScaffold_9941Rep4E−14/28AAV5Rep126-4152/2 in 282 aa5.4opEDLG14
    PlatypusChr2Rep9E−10/35Bovine AAVRep297-4374/3 in 138 aa17.1oaEDLG-179
Cap272-4191/2 in 150 aac
Contig12430Rep2E−09/47Bovine AAVRep353-4503/1 in 123 aa12.0oaEDLG-255
Cap2E−05/32Cap253-3672/1 in 116 aa
    RabbitChr10Rep3E−97/39AAV2Rep1-6193/9 in 613 aa9.3ocEDLG43
Cap5E−50/45Cap1-72310/9 in 675 aa
    RatChr13Rep2E−09/33AAV2Rep4-1752/4 in 177 aa13.3rnEDLG-128
Chr2Rep4E−18/40AAV5Rep1-46112/12 in 454 aa22.7rnEDLG-251
Chr19Rep2E−07/33AAV5Rep329-4642/4 in 136 aa16.1rnEDLG-335
Cap31-1332/1 in 93 aa
    TarsierScaffold_178326Rep4E−14/23AAV5Rep96-4652/3 in 356 aa5.3tsEDLG23
Parvoviruses
    Guinea pigScaffold_188Rep3E−24/46Porcine parvovirusRep313-5675/3 in 250 aa12.3cpEPLG-140
Cap1E−16/36Cap10-68911/12 in 672 aa
Scaffold_27Rep1E−50/39Canine parvovirusRep11-6401/4 in 616 aa5.3cpEPLG-217
Cap1E−38/39Porcine parvovirusCap3-7192/14 in 700 aa
    TenrecScaffold_260946Rep2E−20/38LuIII virusRep406-5984/4 in 190 aa19.0etEPLG-260
Cap11-63916/15 in 595 aa
    RatChr5Rep6E−10/56Canine parvovirusRep1-2820/0 in 312 aa0.6rnEPLGRecent
Cap0/62Cap637-6670/2 in 760 aa
Rep0/631-751
    OpossumChr3Rep2E−39/33LuIII virusRep7-57011/3 in 502 aa10.9mdEPLG-256
Cap7E−8/33Cap11-72914/7 in 704 aa
Chr6Rep6E−58/44Porcine parvovirusRep16-5633/7 in 534 aac4.6mdEPLG-324
Cap6E−60/38Cap10-7152/5 in 707 aac
    WallabyScaffold_108040Rep4E−74/62Canine parvovirusRep341-6450/0 in 287 aa1.3meEPLG-37
Cap8E−37/32Cap35-7380/4 in 687 aa
Scaffold_72496Rep2E−61/42Porcine parvovirusRep23-5674/3 in 531 aa5.7meEPLG-630
Cap2E−31/38Cap10-5326/4 in 514 aa
Scaffold_88340Rep7E−37/55Mouse parvovirus 1Rep344-5660/3 in 223 aa6.7meEPLG-1636
Cap7E−22/33Cap11-7136/9 in 700 aa
Open in a separate windowaSome ambiguity in choosing the most similar virus is possible. We generally used the alignment with the lowest E value in the BLAST results. However, one or two points in the exponent of an E value were sometimes sacrificed to achieve a longer sequence alignment.baa, amino acids.cThese sequences have long insertions compared to the present-day viruses. In all cases tested, these insertions originated from short interspersed elements (SINEs). These insertions were excluded from the counts of stop codons and frameshifts and the estimation of integration age.dChr, chromosome.It is important to recognize that there is an intrinsic limit on how far back in time we can reach to identify ancient endogenous viral sequences. First, the sequences must be identified with confidence by BLAST or similar programs. This requirement places a lower limit on sequence identity at about 20 to 30% of amino acids, or about 75% of nucleotides (nucleotides evolve nearly 2.5 times slower than the amino acid sequence they encode). Second, the related, present-day virus must have evolved at a rate that is not much higher than that of the endogenous sequences. The viruses for which ancestral endogenous sequences were identified in this study exhibit sequence drift similar to that associated with mammalian genomes. Setting this rate at 0.14% per million years of evolution (8), we arrive at 90 million years as the theoretical limit for the oldest sequences that can be identified using our methods. This limit drops to less than 35 million years for endogenous viral sequences in rodents and even lower for sequences related to viruses that evolve faster than mammalian genomes.The most widespread integrations found in our survey are derived from the dependoviruses. These include nearly complete genomes related to adeno-associated virus (AAV) in microbat, wallaby, dolphin, rabbit, mouse, and baboon (Fig. ​(Fig.2).2). We did not detect inverted terminal repeats in several integrations tested, even though repeats are common in the present-day dependoviruses. This result could be explained by sequence decay or the absence of such structures in the ancestral viruses. However, we do see sequences that resemble degraded hairpin structures to which Dependovirus Rep proteins bind, with an example from microbat integration mlEDLG-1 shown in Fig. ​Fig.3.3. The second most widespread endogenous sequences are related to the parvoviruses. They are found in 6 of 49 vertebrate species considered, with nearly complete genomes in rat, opossum, wallaby, and guinea pig (Fig. ​(Fig.22).Open in a separate windowFIG. 3.Hairpin structure of the inverted terminal repeat of adeno-associated virus 2 (left) and a candidate degraded hairpin structure located close to the 5′ end of the mlEDLG-1 integration in microbats (right). Structures and mountain plots were generated using default parameters of the RNAfold program (5), with nucleotide coloring representing base-pairing probabilities: blue is below average, green is average, and red is above average. Mountain plots represent hairpin structures based on minimum free energy (mfe) calculations and partition function (pf) calculations, as well as the centroid structure (5). Height is expressed in numbers of nucleotides; position represents nucleotide.The Dependovirus AAV2 has strong bias for integration into human chromosome 19 during infection, driven by a host sequence that is recognized by the viral Rep protein(s). Rep mediates the formation of a synapse between viral and cellular sequences, and the cellular sequences are nicked to serve as an origin of viral replication (14). The related integrations in mice and rats, located in the same chromosomal locations, might be explained by such a mechanism. However, the extent of endogenous sequence decay and the frequency of stop codons indicate that these integrations occurred some 30 to 35 million years ago, implying that they are derived from a single event in a rodent ancestor rather than two independent integration events at the same location. Similarly, integrations EDLG-1 in dog and panda lie in chromosomal regions that can be readily aligned (based on University of California—Santa Cruz [UCSC] genome assemblies) and show sequence decay consistent with the age of the common ancestor, about 42 million years. Endogenous sequences related to the family Parvoviridae can thus be traced to over 40 million years back in time, and viral proteins related to this family have remained over 40% conserved.Sequences related to circoviruses were detected in five vertebrate species (Table ​(Fig.22).In summary, our results indicate that sequences derived from ancestral members of the families Parvoviridae and Circoviridae were integrated into their host''s genomes over the past 50 million years of evolution. Features of their replication strategies suggest mechanisms by which such integrations may have occurred. It is possible that some of the endogenous viral sequences could offer a selective advantage to the virus or the host. We note that rep open reading frame-derived proteins from some members of these families kill tumor cells selectively (3, 12). The genomic “fossils” we have discovered provide a unique glimpse into virus evolution but can give us only a lower estimate of the actual ages of these families. However, numerous recent integrations suggest that their germ line transfer has been continuing into present times.
Keywords:
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号