首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
真核生物转座子鉴定和分类计算方法   总被引:3,自引:0,他引:3  
Xu HE  Zhang HH  Han MJ  Shen YH  Huang XZ  Xiang ZH  Zhang Z 《遗传》2012,34(8):1009-1019
重复序列是真核生物基因组的重要组成成分,根据其序列特征及在基因组中的存在形式,可以进一步分为串联重复、片段重复和散在重复。其中,散在重复大多起源于转座子。根据转座介质的不同,转座子又可分为DNA和逆转录转座子。转座子的转座和扩增对基因的进化和基因组的稳定具有显著的影响;同时与其他类型的重复序列相比,转座子的结构和分类更为复杂多样,使得对转座子的鉴定和分类更为复杂和困难。鉴于此,文章简要概括了转座子的功能及分类,总结了真核生物转座子鉴定、分类和注释的3个步骤:(1)重复序列库的构建;(2)重复序列的校正和分类;(3)基因组注释。着重介绍了每一步骤所采用的不同计算方法,比较了不同方法的优缺点。只有把多种方法结合起来使用才能实现全基因组转座子的精确鉴定、分类和注释,这将为转座子的全基因组鉴定和分类提供借鉴意义。  相似文献   

2.
Repetitive sequences are a major constituent of many eukaryote genomes and play roles in gene regulation, chromosome inheritance, nuclear architecture, and genome stability. The identification of repetitive elements has traditionally relied on in-depth, manual curation and computational determination of close relatives based on DNA identity. However, the rapid divergence of repetitive sequence has made identification of repeats by DNA identity difficult even in closely related species. Hence, the presence of unidentified repeats in genome sequences affects the quality of gene annotations and annotation-dependent analyses (e.g. microarray analyses). We have developed an enhanced repeat identification pipeline using two approaches. First, the de novo repeat finding program PILER-DF was used to identify interspersed repetitive elements in several recently finished Dipteran genomes. Repeats were classified, when possible, according to their similarity to known elements described in Repbase and GenBank, and also screened against annotated genes as one means of eliminating false positives. Second, we used a new program called RepeatRunner, which integrates results from both RepeatMasker nucleotide searches and protein searches using BLASTX. Using RepeatRunner with PILER-DF predictions, we masked repeats in thirteen Dipteran genomes and conclude that combining PILER-DF and RepeatRunner greatly enhances repeat identification in both well-characterized and un-annotated genomes.  相似文献   

3.
Repeat elements are important components of eukaryotic genomes. One limitation in our understanding of repeat elements is that most analyses rely on reference genomes that are incomplete and often contain missing data in highly repetitive regions that are difficult to assemble. To overcome this problem we develop a new method, REPdenovo, which assembles repeat sequences directly from raw shotgun sequencing data. REPdenovo can construct various types of repeats that are highly repetitive and have low sequence divergence within copies. We show that REPdenovo is substantially better than existing methods both in terms of the number and the completeness of the repeat sequences that it recovers. The key advantage of REPdenovo is that it can reconstruct long repeats from sequence reads. We apply the method to human data and discover a number of potentially new repeats sequences that have been missed by previous repeat annotations. Many of these sequences are incorporated into various parasite genomes, possibly because the filtering process for host DNA involved in the sequencing of the parasite genomes failed to exclude the host derived repeat sequences. REPdenovo is a new powerful computational tool for annotating genomes and for addressing questions regarding the evolution of repeat families. The software tool, REPdenovo, is available for download at https://github.com/Reedwarbler/REPdenovo.  相似文献   

4.
MOTIVATION: Complex genomes contain numerous repeated sequences, and genomic duplication is believed to be a main evolutionary mechanism to obtain new functions. Several tools are available for de novo repeat sequence identification, and many approaches exist for clustering homologous protein sequences. We present an efficient new approach to identify and cluster homologous DNA sequences with high accuracy at the level of whole genomes, excluding low-complexity repeats, tandem repeats and annotated interspersed repeats. We also determine the boundaries of each group member so that it closely represents a biological unit, e.g. a complete gene, or a partial gene coding a protein domain. RESULTS: We developed a program called HomologMiner to identify homologous groups applicable to genome sequences that have been properly marked for low-complexity repeats and annotated interspersed repeats. We applied it to the whole genomes of human (hg17), macaque (rheMac2) and mouse (mm8). Groups obtained include gene families (e.g. olfactory receptor gene family, zinc finger families), unannotated interspersed repeats and additional homologous groups that resulted from recent segmental duplications. Our program incorporates several new methods: a new abstract definition of consistent duplicate units, a new criterion to remove moderately frequent tandem repeats, and new algorithmic techniques. We also provide preliminary analysis of the output on the three genomes mentioned above, and show several applications including identifying boundaries of tandem gene clusters and novel interspersed repeat families. AVAILABILITY: All programs and datasets are downloadable from www.bx.psu.edu/miller_lab.  相似文献   

5.
The interspersed repeat content of mammalian genomes has been best characterized in human, mouse and cow. In this study, we carried out de novo identification of repeated elements in the equine genome and identified previously unknown elements present at low copy number. The equine genome contains typical eutherian mammal repeats, but also has a significant number of hybrid repeats in addition to clade-specific Long Interspersed Nuclear Elements (LINE). Equus caballus clade specific LINE 1 (L1) repeats can be classified into approximately five subfamilies, three of which have undergone significant expansion. There are 1115 full-length copies of these equine L1, but of the 103 presumptive active copies, 93 fall within a single subfamily, indicating a rapid recent expansion of this subfamily. We also analysed both interspersed and simple sequence repeats (SSR) genome-wide, finding that some repeat classes are spatially correlated with each other as well as with G+C content and gene density. Based on these spatial correlations, we have confirmed that recently-described ancestral vs. clade-specific genome territories can be defined by their repeat content. The clade-specific Short Interspersed Nuclear Element correlations were scattered over the genome and appear to have been extensively remodelled. In contrast, territories enriched for ancestral repeats tended to be contiguous domains. To determine if the latter territories were evolutionarily conserved, we compared these results with a similar analysis of the human genome, and observed similar ancestral repeat enriched domains. These results indicate that ancestral, evolutionarily conserved mammalian genome territories can be identified on the basis of repeat content alone. Interspersed repeats of different ages appear to be analogous to geologic strata, allowing identification of ancient vs. newly remodelled regions of mammalian genomes.  相似文献   

6.
The mechanisms underlying cleavage of herpesvirus genomes from replicative concatemers are unknown. Evidence from herpes simplex virus type 1 suggests that cleavage occurs by a nonduplicative process; however, additional evidence suggests that terminal repeats may also be duplicated during the cleavage process. This issue has been difficult to resolve due to the variable numbers of reiterated terminal repeats that the herpes simplex virus type 1 genome can contain. Guinea pig cytomegalovirus is a herpesvirus with a simple terminal repeat arrangement that defines two genome types. Type II genomes have a single copy of a 1-kb terminal repeat at both their left and right termini, whereas type I genomes have only one copy at their left termini and lack the repeat at their right termini. In a previous study, we constructed a recombinant guinea pig cytomegalovirus in which certain cis elements were disrupted such that only type II genomes were produced. Here we show that double repeats that are formed by circularization of infecting genomes are rapidly converted to single repeats, such that the junctions between genomes within replicative concatemers formed late in infection almost exclusively contain single copies of the terminal repeat. Therefore, for the recombinant virus, each cleavage event begins with a single repeat within a concatemer yet produces two repeats, one at each of the resulting termini, demonstrating that terminal repeat duplication occurs in conjunction with cleavage. For wild-type guinea pig cytomegalovirus, the formation of type I genomes further suggests that cleavage can also occur by a nonduplicative process and that duplicative and nonduplicative cleavage can occur concurrently. Other herpesviruses having terminal repeats, such as the herpes simplex viruses and human cytomegalovirus, may also utilize repeat duplication and deletion; however, the biological importance of these events remains unknown.  相似文献   

7.
Organisms with a high density of transposable elements (TEs) exhibit nesting, with subsequent repeats found inside previously inserted elements. Nesting splits the sequence structure of TEs and makes annotation of repetitive areas challenging. We present TEnest, a repeat identification and display tool made specifically for highly repetitive genomes. TEnest identifies repetitive sequences and reconstructs separated sections to provide full-length repeats and, for long-terminal repeat (LTR) retrotransposons, calculates age since insertion based on LTR divergence. TEnest provides a chronological insertion display to give an accurate visual representation of TE integration history showing timeline, location, and families of each TE identified, thus creating a framework from which evolutionary comparisons can be made among various regions of the genome. A database of repeats has been developed for maize (Zea mays), rice (Oryza sativa), wheat (Triticum aestivum), and barley (Hordeum vulgare) to illustrate the potential of TEnest software. All currently finished maize bacterial artificial chromosomes totaling 29.3 Mb were analyzed with TEnest to provide a characterization of the repeat insertions. Sixty-seven percent of the maize genome was found to be made up of TEs; of these, 95% are LTR retrotransposons. The rate of solo LTR formation is shown to be dissimilar across retrotransposon families. Phylogenetic analysis of TE families reveals specific events of extreme TE proliferation, which may explain the high quantities of certain TE families found throughout the maize genome. The TEnest software package is available for use on PlantGDB under the tools section (http://www.plantgdb.org/prj/TE_nest/TE_nest.html); the source code is available from (http://wiselab.org).  相似文献   

8.
The genome of the parasitic platyhelminth Schistosoma mansoni is composed of approximately 40% of repetitive sequences of which roughly 20% correspond to transposable elements. When the genome sequence became available, conventional repeat prediction programs were used to find these repeats, but only a fraction could be identified. To exhaustively characterize the repeats we applied a new massive sequencing based strategy: we re-sequenced the genome by next generation sequencing, aligned the sequencing reads to the genome and assembled all multiple-hit reads into contigs corresponding to the repetitive part of the genome. We present here, for the first time, this de novo repeat assembly strategy and we confirm that such assembly is feasible. We identified and annotated 4,143 new repeats in the S. mansoni genome. At least one third of the repeats are transcribed. This strategy allowed us also to identify 14 new microsatellite markers, which can be used for pedigree studies. Annotations and the combined (previously known and new) 5,420 repeat sequences (corresponding to 47% of the genome) are available for download (http://methdb.univ-perp.fr/downloads/).  相似文献   

9.
MOTIVATION: Repeats are ubiquitous in genomes and play important roles in evolution. Transposable elements are a common kind of repeat. Transposon insertions can be nested and make the task of identifying repeats difficult. RESULTS: We develop a novel iterative algorithm, called Greedier, to find repeats in a target genome given a repeat library. Greedier distinguishes itself from existing methods by taking into account the fragmentation of repeats. Each iteration consists of two passes. In the first pass, it identifies the local similarities between the repeat library and the target genome. Greedier then builds graphs from this comparison output. In each graph, a vertex denotes a similar subsequence pair. Edges denote pairs of subsequences that can be connected to form higher similarities. In the second pass, Greedier traverses these graphs greedily to find matches to individual repeat units in the repeat library. It computes a fitness value for each such match denoting the similarity of that match. Matches with fitness values greater than a cutoff are removed, and the rest of the genome is stitched together. The similarity cutoff is then gradually reduced, and the iteration is repeated until no hits are returned from the comparison. Our experiments on the Arabidopsis and rice genomes show that Greedier identifies approximately twice as many transposon bases as those found by cross_match and WindowMasker. Moreover, Greedier masks far fewer false positive bases than either cross_match or WindowMasker. In addition to masking repeats, Greedier also reports potential nested transposon structures.  相似文献   

10.
The current pace of the generation of sequence data requires the development of software tools that can rapidly provide full annotation of the data. We have developed a new method for rapid sequence comparison using the exact match algorithm without repeat masking. As a demonstration, we have identified all perfect simple tandem repeats (STR) within the draft sequence of the human genome. The STR elements (chromosome, position, length and repeat subunit) have been placed into a relational database. Repeat flanking sequence is also publicly accessible at http://grid.abcc.ncifcrf.gov. To illustrate the utility of this complete set of STR elements, we documented the increased density of potentially polymorphic markers throughout the genome. The new STR markers may be useful in disease association studies because so many STR elements manifest multiallelic polymorphism. Also, because triplet repeat expansions are important for human disease etiology, we identified trinucleotide repeats that exist within exons of known genes. This resulted in a list that includes all 14 genes known to undergo polynucleotide expansion, and 48 additional candidates. Several of these are non-polyglutamine triplet repeats. Other examinations of the STR database demonstrated repeats spanning splice junctions and identified SNPs within repeat elements.  相似文献   

11.
12.
Recent advances in gene structure prediction   总被引:9,自引:0,他引:9  
De novo gene predictors are programs that predict the exon-intron structures of genes using the sequences of one or more genomes as their only input. In the past two years, dual-genome de novo predictors, which exploit local rates and patterns of mutation inferred from alignments between two genomes, have led to significant improvements in accuracy. Systems that exploit more than two genomes simultaneously have only recently begun to appear and are not yet competitive on practical tasks, but offer the greatest hope for near-term improvements. Dual-genome de novo prediction for compact eukaryotic genomes such as those of Arabidopsis thaliana and Caenorhabditis elegans is already quite accurate. Although mammalian gene prediction lags behind in accuracy, it is yielding ever more useful results. Coupled with significant improvements in pseudogene detection methods, which have eliminated many false positives, we have reached the point where de novo gene predictions are being used as hypotheses to drive experimental annotation via systematic RT-PCR and sequencing.  相似文献   

13.
Protein sequences are normally the most conserved elements of genomes owing to purifying selection to maintain their functions. We document an extraordinary amount of within-species protein sequence variation in the model eukaryote Dictyostelium discoideum stemming from triplet DNA repeats coding for long strings of single amino acids. D. discoideum has a very large number of such strings, many of which are polyglutamine repeats, the same sequence that causes various human neurological disorders in humans, like Huntington’s disease. We show here that D. discoideum coding repeat loci are highly variable among individuals, making D. discoideum a candidate for the most variable proteome. The coding repeat loci are not significantly less variable than similar non-coding triplet repeats. This pattern is consistent with these amino-acid repeats being largely non-functional sequences evolving primarily by mutation and drift.  相似文献   

14.
Repetitive DNA elements account for a substantial fraction of the mammalian genome. Many are subject to DNA methylation, which is known to undergo dynamic change during mouse germ cell development. We found that repeat sequences of three different classes retain high levels of methylation at E12.5, when methylation is erased from many single-copy genes. Maximal demethylation of repeats was seen later in development and at different times in male and female germ cells. At none of the time points examined (E12.5, E15.5, and E17.5) did we see complete demethylation, suggesting that methylation patterns on repeats may be passed on from one generation to the next. In male germ cells, we observed a de novo methylation event resulting in complete methylation of all the repeats in the interval between E15.5 and E17.5, which was not seen in females. These results suggest that repeat sequences undergo coordinate changes in methylation during germ cell development and give further insights into germ cell reprogramming in mice.  相似文献   

15.
Minisatellites are DNA tandem repeats that are found in all sequenced genomes. In the yeast Saccharomyces cerevisiae, they are frequently encountered in genes encoding cell wall proteins. Minisatellites present in the completely sequenced genome of the pathogenic yeast Candida glabrata were similarly analyzed, and two new types of minisatellites were discovered: minisatellites that are composed of two different intermingled repeats (called compound minisatellites), and minisatellites containing unusually long repeated motifs (126-429 bp). These long repeat minisatellites may reach unusual length for such elements (up to 10 kb). Due to these peculiar properties, they have been named 'megasatellites'. They are found essentially in genes involved in cell-cell adhesion, and could therefore be involved in the ability of this opportunistic pathogen to colonize the human host. In addition to megasatellites, found in large paralogous gene families, there are 93 minisatellites with simple shorter motifs, comparable to those found in S. cerevisiae. Most of the time, these minisatellites are not conserved between C. glabrata and S. cerevisiae, although their host genes are well conserved, raising the question of an active mechanism creating minisatellites de novo in hemiascomycetes.  相似文献   

16.
A survey of bacterial insertion sequences using IScan   总被引:4,自引:0,他引:4  
Bacterial insertion sequences (ISs) are the simplest kinds of bacterial mobile DNA. Evolutionary studies need consistent IS annotation across many different genomes. We have developed an open-source software package, IScan, to identify bacterial ISs and their sequence elements—inverted and target direct repeats—in multiple genomes using multiple flexible search parameters. We applied IScan to 438 completely sequenced bacterial genomes and 20 IS families. The resulting data show that ISs within a genome are extremely similar, with a mean synonymous divergence of Ks = 0.033. Our analysis substantially extends previously available information, and suggests that most ISs have entered bacterial genomes recently. By implication, their population persistence may depend on horizontal transfer. We also used IScan's ability to analyze the statistical significance of sequence similarity among many IS inverted repeats. Although the inverted repeats of insertion sequences are evolutionarily highly flexible parts of ISs, we show that this ability can be used to enrich a dataset for ISs that are likely to be functional. Applied to the thousands of genomes that will soon be available, IScan could be used for many purposes, such as mapping the evolutionary history and horizontal transfer patterns of different ISs.  相似文献   

17.
Repetitive elements may comprise over two-thirds of the human genome   总被引:1,自引:0,他引:1  
Transposable elements (TEs) are conventionally identified in eukaryotic genomes by alignment to consensus element sequences. Using this approach, about half of the human genome has been previously identified as TEs and low-complexity repeats. We recently developed a highly sensitive alternative de novo strategy, P-clouds, that instead searches for clusters of high-abundance oligonucleotides that are related in sequence space (oligo "clouds"). We show here that P-clouds predicts >840 Mbp of additional repetitive sequences in the human genome, thus suggesting that 66%-69% of the human genome is repetitive or repeat-derived. To investigate this remarkable difference, we conducted detailed analyses of the ability of both P-clouds and a commonly used conventional approach, RepeatMasker (RM), to detect different sized fragments of the highly abundant human Alu and MIR SINEs. RM can have surprisingly low sensitivity for even moderately long fragments, in contrast to P-clouds, which has good sensitivity down to small fragment sizes (~25 bp). Although short fragments have a high intrinsic probability of being false positives, we performed a probabilistic annotation that reflects this fact. We further developed "element-specific" P-clouds (ESPs) to identify novel Alu and MIR SINE elements, and using it we identified ~100 Mb of previously unannotated human elements. ESP estimates of new MIR sequences are in good agreement with RM-based predictions of the amount that RM missed. These results highlight the need for combined, probabilistic genome annotation approaches and suggest that the human genome consists of substantially more repetitive sequence than previously believed.  相似文献   

18.
X Zhao  Y Tian  R Yang  H Feng  Q Ouyang  Y Tian  Z Tan  M Li  Y Niu  J Jiang  G Shen  R Yu 《BMC genomics》2012,13(1):435
ABSTRACT: BACKGROUND: Relationship between the level of repetitiveness in genomic sequence and genome size has been investigated by making use of complete prokaryotic and eukaryotic genomes, but relevant studies have been rarely made in virus genomes. RESULTS: In this study, a total of 257 viruses were examined, which cover 90% of genera. The results showed that simple sequence repeats (SSRs) is strongly, positively and significantly correlated with genome size. Certain repeat class is distributed in a certain range of genome sequence length. Mono-, di- and tri- repeats are widely distributed in all virus genomes, tetra- SSRs as a common component consist in genomes which more than 100 kb in size; in the range of genome < 100 kb, genomes containing penta- and hexa- SSRs are not more than 50%. Principal components analysis (PCA) indicated that dinucleotide repeat affects the differences of SSRs most strongly among virus genomes. Results showed that SSRs tend to accumulate in larger virus genomes; and the longer genome sequence, the longer repeat units. CONCLUSIONS: We conducted this research standing on the height of the whole virus. We concluded that genome size is an important factor in affecting the occurrence of SSRs; hosts are also responsible for the variances of SSRs content to a certain degree.  相似文献   

19.
The problem of predicting non-long terminal repeats (LTR) like long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs) from the DNA sequence is still an open problem in bioinformatics. To elevate the quality of annotations of LINES and SINEs an automated tool "RetroPred" was developed. The pipeline allowed rapid and thorough annotation of non-LTR retrotransposons. The non-LTR retrotransposable elements were initially predicted by Pairwise Aligner for Long Sequences (PALS) and Parsimonious Inference of a Library of Elementary Repeats (PILER). Predicted non-LTR elements were automatically classified into LINEs and SINEs using ANN based on the position specific probability matrix (PSPM) generated by Multiple EM for Motif Elicitation (MEME). The ANN model revealed a superior model (accuracy = 78.79 +/- 6.86 %, Q(pred) = 74.734 +/- 17.08 %, sensitivity = 84.48 +/- 6.73 %, specificity = 77.13 +/- 13.39 %) using four-fold cross validation. As proof of principle, we have thoroughly annotated the location of LINEs and SINEs in rice and Arabidopsis genome using the tool and is proved to be very useful with good accuracy. Our tool is accessible at http://www.juit.ac.in/RepeatPred/home.html.  相似文献   

20.
Simultaneous identification and comparison of perfect and imperfect microsatellites within a genome is a valuable tool both to overcome the lack of a consensus definition of SSRs and to assess repeat history. Detailed analysis of the overall distribution of perfect and imperfect microsatellites in closely related bacterial taxa is expected to give new insight into the evolution of prokaryotic genomes. We have performed a genome-wide analysis of microsatellite distribution in four Escherichia coli and seven Chlamydial strains. Chlamydial strains generally have a higher density of SSRs and show greater intra-group differences of SSR distribution patterns than E. coli genomes. In most investigated genomes the distribution of the total lengths of matching perfect and imperfect trinucleotide repeats are highly similar, with the notable exception of C. muridarum. Closely related strains show more similar repeat distribution patterns than strains separated by a longer divergence time. The discrepancy between the preferred classes of perfect and imperfect repeats in C. muridarum implies accelerated evolution of SSRs in this particular strain. Our results suggest that microsatellites, although considerably less abundant than in eukaryotic genomes, may nevertheless play an important role in the evolution of prokaryotic genomes and several gene families.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号