首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Repetitive sequences are a major constituent of many eukaryote genomes and play roles in gene regulation, chromosome inheritance, nuclear architecture, and genome stability. The identification of repetitive elements has traditionally relied on in-depth, manual curation and computational determination of close relatives based on DNA identity. However, the rapid divergence of repetitive sequence has made identification of repeats by DNA identity difficult even in closely related species. Hence, the presence of unidentified repeats in genome sequences affects the quality of gene annotations and annotation-dependent analyses (e.g. microarray analyses). We have developed an enhanced repeat identification pipeline using two approaches. First, the de novo repeat finding program PILER-DF was used to identify interspersed repetitive elements in several recently finished Dipteran genomes. Repeats were classified, when possible, according to their similarity to known elements described in Repbase and GenBank, and also screened against annotated genes as one means of eliminating false positives. Second, we used a new program called RepeatRunner, which integrates results from both RepeatMasker nucleotide searches and protein searches using BLASTX. Using RepeatRunner with PILER-DF predictions, we masked repeats in thirteen Dipteran genomes and conclude that combining PILER-DF and RepeatRunner greatly enhances repeat identification in both well-characterized and un-annotated genomes.  相似文献   

2.
3.
4.
We isolated clones and determined the sequence of portions of mouse and human cellular DNA which cross-hybridize strongly with the IR3 repetitive region of Epstein-Barr virus. The sequences were found to be tandem arrays of a simple sequence based on the triplet GGA, very similar to the IR3 repeat. The cellular repeats have distinct differences from the viral repeat region, however, and their sequences do not appear capable of being translated into a purely glycine-plus-alanine protein domain like the portion of the Epstein-Barr nuclear antigen coded by IR3. Although the relationship between IR3 and the cellular repeats is left unclear, the cellular repeats have many interesting features. The tandem arrays are about 1 to several kilobases long, much shorter than satellite tandem repeats and larger than other interspersed, tandem repeats. Each of the repeats is a distinct variation, perhaps diverged from a common sequence, (GGA)n. This family is present in the genomes of all species tested and appears to be a ubiquitous feature of all higher eucaryotic genomes.  相似文献   

5.
真核生物转座子鉴定和分类计算方法   总被引:3,自引:0,他引:3  
Xu HE  Zhang HH  Han MJ  Shen YH  Huang XZ  Xiang ZH  Zhang Z 《遗传》2012,34(8):1009-1019
重复序列是真核生物基因组的重要组成成分,根据其序列特征及在基因组中的存在形式,可以进一步分为串联重复、片段重复和散在重复。其中,散在重复大多起源于转座子。根据转座介质的不同,转座子又可分为DNA和逆转录转座子。转座子的转座和扩增对基因的进化和基因组的稳定具有显著的影响;同时与其他类型的重复序列相比,转座子的结构和分类更为复杂多样,使得对转座子的鉴定和分类更为复杂和困难。鉴于此,文章简要概括了转座子的功能及分类,总结了真核生物转座子鉴定、分类和注释的3个步骤:(1)重复序列库的构建;(2)重复序列的校正和分类;(3)基因组注释。着重介绍了每一步骤所采用的不同计算方法,比较了不同方法的优缺点。只有把多种方法结合起来使用才能实现全基因组转座子的精确鉴定、分类和注释,这将为转座子的全基因组鉴定和分类提供借鉴意义。  相似文献   

6.
基于后缀列的基因序列最大串联重复查找技术   总被引:1,自引:0,他引:1  
重复序列分析在全基因组研究中起着重要作用,其首要任务就是在DNA序列中识别并定位所有的重复结构。本文提出了一种新的算法,此算法基于一种简单的数据结构——后缀数,用于查找给定的DNA序列中所有的最大串联重复。并且在该算法的基础上编写了一个有效实用的软件——RepLocate,同时给出了它应用到已知的DNA序列的实例。  相似文献   

7.
Pairwise local sequence alignment methods have been the prevailing technique to identify homologous nucleotides between related species. However, existing methods that identify and align all homologous nucleotides in one or more genomes have suffered from poor scalability and limited accuracy. We propose a novel method that couples a gapped extension heuristic with an efficient filtration method for identifying interspersed repeats in genome sequences. During gapped extension, we use the MUSCLE implementation of progressive global multiple alignment with iterative refinement. The resulting gapped extensions potentially contain alignments of unrelated sequence. We detect and remove such undesirable alignments using a hidden Markov model (HMM) to predict the posterior probability of homology. The HMM emission frequencies for nucleotide substitutions can be derived from any time-reversible nucleotide substitution matrix. We evaluate the performance of our method and previous approaches on a hybrid data set of real genomic DNA with simulated interspersed repeats. Our method outperforms a related method in terms of sensitivity, positive predictive value, and localizing boundaries of homology. The described methods have been implemented in freely available software, Repeatoire, available from: http://wwwabi.snv.jussieu.fr/public/Repeatoire.  相似文献   

8.
9.
Summary We report a collection of 53 prototypic sequences representing known families of repetitive elements from the human genome. The prototypic sequences are either consensus sequences or selected examples of repetitive sequences. The collection includes: prototypes for high and medium reiteration frequency interspersed repeats, long terminal repeats of endogenous retroviruses, alphoid repeats, telomere-associated repeats, and some miscellaneous repeats. The collection is annotated and available electronically.[/ap ]Offprint requests to: J. Jurka  相似文献   

10.
The presence of repeated sequences is a fundamental feature of genomes. Tandemly repeated DNA appears in both eukaryotic and prokaryotic genomes, it is associated with various regulatory mechanisms and plays an important role in genomic fingerprinting. In this paper, we describe mreps, a powerful software tool for a fast identification of tandemly repeated structures in DNA sequences. mreps is able to identify all types of tandem repeats within a single run on a whole genomic sequence. It has a resolution parameter that allows the program to identify 'fuzzy' repeats. We introduce main algorithmic solutions behind mreps, describe its usage, give some execution time benchmarks and present several case studies to illustrate its capabilities. The mreps web interface is accessible through http://www.loria.fr/mreps/.  相似文献   

11.
Several complementary procedures were used to identify and characterize DNA sequences which are repeated within a 44 kilobase (kb) segment of rabbit chromosomal DNA containing four different rabbit β-like globin genes (β1–β4). Cross-hybridization between cloned DNAs from different regions of the gene cluster indicates the presence of a complex array of repeat sequences interspersed with the globin genes. We classified 20 different repeat sequences into five families whose members cross-hybridize. Electron microscopy was used to determine the location, size and relative orientations of many of the repeat sequences. Both direct and inverted repeats were identified, with sizes ranging from 140 to 1400 base pairs (bp). Each of the four closely linked globin genes is flanked by at least one pair of inverted repeats of 140–400 bp, and the entire set of four genes is flanked by an inverted repeat of 1400 bp. Two of the five repeat families contain repeat sequences of different sizes. We found that the smaller sequence elements can occur individually or in association with the larger repeat sequences, suggesting that the larger repeats may be composed of more than one smaller repeat sequence. The restriction fragments containing the intracluster repeats also contain sequences which are repeated many times in total rabbit genomic DNA, but it is not known whether the genomic and intracluster repeats are the same sequences. The results provide the first demonstration of the relationship between single-copy and repetitive DNA sequences in a large segment of chromosomal DNA containing a well characterized set of developmentally regulated genes.  相似文献   

12.
Transposable elements (TEs) are mobile, repetitive DNA sequences that are almost ubiquitous in prokaryotic and eukaryotic genomes. They have a large impact on genome structure, function and evolution. With the recent development of high-throughput sequencing methods, many genome sequences have become available, making possible comparative studies of TE dynamics at an unprecedented scale. Several methods have been proposed for the de novo identification of TEs in sequenced genomes. Most begin with the detection of genomic repeats, but the subsequent steps for defining TE families differ. High-quality TE annotations are available for the Drosophila melanogaster and Arabidopsis thaliana genome sequences, providing a solid basis for the benchmarking of such methods. We compared the performance of specific algorithms for the clustering of interspersed repeats and found that only a particular combination of algorithms detected TE families with good recovery of the reference sequences. We then applied a new procedure for reconciling the different clustering results and classifying TE sequences. The whole approach was implemented in a pipeline using the REPET package. Finally, we show that our combined approach highlights the dynamics of well defined TE families by making it possible to identify structural variations among their copies. This approach makes it possible to annotate TE families and to study their diversification in a single analysis, improving our understanding of TE dynamics at the whole-genome scale and for diverse species.  相似文献   

13.
The non-coding fraction of the human genome, which is approximately 98%, is mainly constituted by repeats. Transpositions, expansions and deletions of these repeat elements contribute to a number of diseases. None of the available databases consolidates information on both tandem and interspersed repeats with the flexibility of FASTA based homology search with reference to disease genes. Repeats in diseases database (RiDs db) is a web accessible relational database, which aids analysis of repeats associated with Mendelian disorders. It is a repository of disease genes, which can be searched by FASTA program or by limitedor free- text keywords. Unlike other databases, RiDs db contains the sequences of these genes with access to corresponding information on both interspersed and tandem repeats contained within them, on a unified platform. Comparative analysis of novel or patient sequences with the reference sequences in RiDs db using FASTA search will indicate change in structure of repeats, if any, with a particular disorder. This database also provides links to orthologs in model organisms such as zebrafish, mouse and Drosophila. AVAILABILITY: The database is available for free at http://115.111.90.196/ridsdb/index.php.  相似文献   

14.
4.5SH RNA is a 94-nt small RNA with unknown function. This RNA is known to be present in the mouse, rat, and hamster cells; however, it is not found in human, rabbit, and chicken. In the mouse genome, the 4.5SH RNA gene is a part of a long (4.2 kb) tandem repeat ( approximately 800 copies) unit. Here, we found that 4.5SH RNA genes are present only in rodents of six families that comprise the Myodonta clade: Muridae, Cricetidae, Spalacidae, Rhizomyidae, Zapodidae, and Dipodidae. The analysis of complementary DNA derived from the rodents of these families showed general evolutionary conservation of 4.5SH RNA and some intraspecific heterogeneity of these RNA molecules. 4.5SH RNA genes in the Norway rat, mole rat, hamster and jerboa genomes are included in the repeated sequences. In the jerboa genome these repeats are 4.0-kb long and arranged tandemly, similar to the corresponding arrangements in the mouse and rat genomic DNA. Sequencing of the rat and jerboa DNA repeats containing 4.5SH RNA genes showed fast evolution of the gene-flanking sequences. The repeat sequences of the distantly related rodents (mouse and rat vs. jerboa) have no apparent similarity except for the 4.5SH RNA gene itself. Conservation of the 4.5SH RNA gene nucleotide sequence indicates that this RNA is likely to be under selection pressure and, thus, may have a function. The repeats from the different rodents have similar lengths and contain many simple short repeats. The data obtained suggest that long insertions, deletions, and simple sequence amplifications significantly contribute in the evolution of the repeats containing 4.5SH RNA genes. The 4.5SH RNA gene seems to have originated 50-85 MYA in a Myodonta ancestor from a copy of the B1 short interspersed element. The amplification of the gene with the flanking sequences could result from the supposed cellular requirement of the intensive synthesis of 4.5SH RNA. Further Myodonta evolution led to dramatic changes of the repeat sequences in every lineage with the conservation of the 4.5SH RNA genes only. This gene, like some other relatively recently originated genes, could be a useful model for studying generation and evolution of non-protein-coding genes.  相似文献   

15.
We studied the occurrence of mammalian interspersed repeats (MIRs) in DNA and RNA of vertebrates, invertebrates, and bacteria using the data from GenBank. A special algorithm based on a weight position matrix with optimal alignment using dynamic programming was developed to search for the traces of MIR dissemination. This allowed us to search for highly divergent MIRs carrying deletions and insertions. MIRs were detected in genomes of various fishes, includingLatimeria. This suggests that the origin of MIRs dates back more than 400 million years. The method to search for similarity between highly divergent sequences may be used to find the genome fragments from various ancient repeat families and from various gene families.  相似文献   

16.
Stupar RM  Song J  Tek AL  Cheng Z  Dong F  Jiang J 《Genetics》2002,162(3):1435-1444
The heterochromatin in eukaryotic genomes represents gene-poor regions and contains highly repetitive DNA sequences. The origin and evolution of DNA sequences in the heterochromatic regions are poorly understood. Here we report a unique class of pericentromeric heterochromatin consisting of DNA sequences highly homologous to the intergenic spacer (IGS) of the 18S.25S ribosomal RNA genes in potato. A 5.9-kb tandem repeat, named 2D8, was isolated from a diploid potato species Solanum bulbocastanum. Sequence analysis indicates that the 2D8 repeat is related to the IGS of potato rDNA. This repeat is associated with highly condensed pericentromeric heterochromatin at several hemizygous loci. The 2D8 repeat is highly variable in structure and copy number throughout the Solanum genus, suggesting that it is evolutionarily dynamic. Additional IGS-related repetitive DNA elements were also identified in the potato genome. The possible mechanism of the origin and evolution of the IGS-related repeats is discussed. We demonstrate that potato serves as an interesting model for studying repetitive DNA families because it is propagated vegetatively, thus minimizing the meiotic mechanisms that can remove novel DNA repeats.  相似文献   

17.
We designed a simple but sensitive program, IntraCompare, for identifying internal repeats in families of homologous proteins. The protein sequences are aligned (Clustal X), the regions to be compared are selected, and all potential repeat sequences are compared with all others. The output provides comparison scores (GAP program) expressed in standard deviations.  相似文献   

18.
FORRepeats: detects repeats on entire chromosomes and between genomes   总被引:1,自引:0,他引:1  
MOTIVATION: As more and more whole genomes are available, there is a need for new methods to compare large sequences and transfer biological knowledge from annotated genomes to related new ones. BLAST is not suitable to compare multimegabase DNA sequences. MegaBLAST is designed to compare closely related large sequences. Some tools to detect repeats in large sequences have already been developed such as MUMmer or REPuter. They also have time or space restrictions. Moreover, in terms of applications, REPuter only computes repeats and MUMmer works better with related genomes. RESULTS: We present a heuristic method, named FORRepeats, which is based on a novel data structure called factor oracle. In the first step it detects exact repeats in large sequences. Then, in the second step, it computes approximate repeats and performs pairwise comparison. We compared its computational characteristics with BLAST and REPuter. Results demonstrate that it is fast and space economical. We show FORRepeats ability to perform intra-genomic comparison and to detect repeated DNA sequences in the complete genome of the model plant Arabidopsis thaliana.  相似文献   

19.
A new class of human interspersed repeated sequences distinct from the AluI family was found by screening a human gene library with a mouse ribosomal gene non-transcribed spacer probe (rDNA NTS). A member of this sequence family was localized to a 251 bp segment between the human delta and beta globin genes: a region previously judged to be devoid of repeated DNA. The complete nucleotide sequence of this segment revealed a tandem block of 17 TG dinucleotides, a feature hypothesized by others to be a recombination hot spot responsible for gene conversion in the gamma globin locus region. When the genomes of Xenopus, pigeon, slime mold and yeast were examined, reiterated sequences homologous to both the mouse rDNA NTS and human globin repeat were found in every case. The discovery of this extraordinarily conserved repeated sequence family appears to have depended upon not using salmon sperm DNA during hybridization. The use of eucaryotic carrier DNA may bias the search for repeated sequences against any which may be highly conserved during eucaryotic evolution.  相似文献   

20.
MOTIVATION: Tandemly organized repetitive sequences (satellite DNA) are widespread in complex eukaryotic genomes. In plants, satellite repeats often represent a substantial part of nuclear DNA but only a little is known about the molecular mechanisms of their amplification and their possible role(s) in genome evolution and function. Unfortunately, addressing these questions via characterization of general sequence properties of known satellite repeats has been hindered by a difficulty in obtaining a complete and unbiased set of sequence data for this analysis. This is mainly due to the presence of multiple entries of homologous sequences and of single entries that contain more than one repeated unit (monomer) in the public databases. RESULTS: We have established a computer database specialized for plant satellite repeats (PlantSat) that integrates sequence data available from various resources with supplementary information including repeat consensus sequences, abundances, and chromosomal localizations. The sequences are stored as individual repeat monomers grouped into families, which simplifies their computer analysis and makes it more accurate. Using this feature, we have performed a basic sequence analysis of the whole set of plant satellite repeats with respect to their monomer length and nucleotide composition. The analysis revealed several preferred length ranges of the monomers (approximately 165 bp and its multiples) and an over-representation of the AA/TT dinucleotide in the repeats. We have also detected an enrichment of satellite DNA sequences for the motif CAAAA that is supposed to be involved in breakage-reunion of repeated sequences.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号