首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 867 毫秒
1.
2.
X Zhao  Y Tian  R Yang  H Feng  Q Ouyang  Y Tian  Z Tan  M Li  Y Niu  J Jiang  G Shen  R Yu 《BMC genomics》2012,13(1):435
ABSTRACT: BACKGROUND: Relationship between the level of repetitiveness in genomic sequence and genome size has been investigated by making use of complete prokaryotic and eukaryotic genomes, but relevant studies have been rarely made in virus genomes. RESULTS: In this study, a total of 257 viruses were examined, which cover 90% of genera. The results showed that simple sequence repeats (SSRs) is strongly, positively and significantly correlated with genome size. Certain repeat class is distributed in a certain range of genome sequence length. Mono-, di- and tri- repeats are widely distributed in all virus genomes, tetra- SSRs as a common component consist in genomes which more than 100 kb in size; in the range of genome < 100 kb, genomes containing penta- and hexa- SSRs are not more than 50%. Principal components analysis (PCA) indicated that dinucleotide repeat affects the differences of SSRs most strongly among virus genomes. Results showed that SSRs tend to accumulate in larger virus genomes; and the longer genome sequence, the longer repeat units. CONCLUSIONS: We conducted this research standing on the height of the whole virus. We concluded that genome size is an important factor in affecting the occurrence of SSRs; hosts are also responsible for the variances of SSRs content to a certain degree.  相似文献   

3.
MOTIVATION: Complex genomes contain numerous repeated sequences, and genomic duplication is believed to be a main evolutionary mechanism to obtain new functions. Several tools are available for de novo repeat sequence identification, and many approaches exist for clustering homologous protein sequences. We present an efficient new approach to identify and cluster homologous DNA sequences with high accuracy at the level of whole genomes, excluding low-complexity repeats, tandem repeats and annotated interspersed repeats. We also determine the boundaries of each group member so that it closely represents a biological unit, e.g. a complete gene, or a partial gene coding a protein domain. RESULTS: We developed a program called HomologMiner to identify homologous groups applicable to genome sequences that have been properly marked for low-complexity repeats and annotated interspersed repeats. We applied it to the whole genomes of human (hg17), macaque (rheMac2) and mouse (mm8). Groups obtained include gene families (e.g. olfactory receptor gene family, zinc finger families), unannotated interspersed repeats and additional homologous groups that resulted from recent segmental duplications. Our program incorporates several new methods: a new abstract definition of consistent duplicate units, a new criterion to remove moderately frequent tandem repeats, and new algorithmic techniques. We also provide preliminary analysis of the output on the three genomes mentioned above, and show several applications including identifying boundaries of tandem gene clusters and novel interspersed repeat families. AVAILABILITY: All programs and datasets are downloadable from www.bx.psu.edu/miller_lab.  相似文献   

4.
The presence of repeated sequences is a fundamental feature of genomes. Tandemly repeated DNA appears in both eukaryotic and prokaryotic genomes, it is associated with various regulatory mechanisms and plays an important role in genomic fingerprinting. In this paper, we describe mreps, a powerful software tool for a fast identification of tandemly repeated structures in DNA sequences. mreps is able to identify all types of tandem repeats within a single run on a whole genomic sequence. It has a resolution parameter that allows the program to identify 'fuzzy' repeats. We introduce main algorithmic solutions behind mreps, describe its usage, give some execution time benchmarks and present several case studies to illustrate its capabilities. The mreps web interface is accessible through http://www.loria.fr/mreps/.  相似文献   

5.
The abundance and genomic organization of six simple sequence repeats, consisting of di-, tri-, and tetranucleotide sequence motifs, and a minisatellite repeat have been analyzed in different gymnosperms by Southern hybridization. Within the gymnosperm genomes investigated, the abundance and genomic organization of micro- and minisatellite repeats largely follows taxonomic groupings. We found that only particular simple sequence repeat motifs are amplified in gymnosperm genomes, while others such as (CAC)5 and (GACA)4 are present in only low copy numbers. The variation in abundance of simple sequence motifs reflects a similar situation to that found in angiosperms. Species of the two- and three-needle pine section Pinus are relatively conserved and can be distinguished from Pinus strobus which belongs to the five-needle pine section Strobus. The hybridization pattern of Picea species, bald cypress and gingko were different from the patterns detected in the Pinus species. Furthermore, sequences with homology to the plant telomeric repeat (TTTAGGG)n have been analyzed in the same set of gymnosperms. Telomere-like repeats are highly amplified within two- and three- needle pine genomes, such as slash pine (Pinus elliottii Engelm. var. elliottii), compared to P. strobus, Picea species, bald cypress and gingko. P. elliottii var. elliottii was used as a representative species to investigate the chromosomal organization of telomere-like sequences by fluorescence in situ hybridization (FISH). The telomere-like sequences are not restricted to the ends of chromosomes; they form large intercalary and pericentric blocks showing that they are a repeated component of the slash pine genome.Conifers have genomes larger than 20000 Mbp, and our results clearly demonstrate that repeats of low sequence complexity, such to (CA)8, (GA)8, (GGAT)4 and (GATA)4, and minisatellite- and telomere-like sequences represent a large fraction of the repetitive DNA of these species. The striking differences in abundance and genome organization of the various repeat motifs suggest that these repetitive sequences evolved differently in the gymnosperm genomes investigated. Received: 1 October 1999 / Accepted: 3 November 1999  相似文献   

6.
R L Neve  D M Kurnit 《Gene》1983,23(3):355-367
We studied the sequence repetitiveness of human cDNA and genomic DNA fragments inserted in the miniplasmid piVX. Sequence repetitiveness was assayed by the frequency with which a given insert mediated recombination between the chimeric miniplasmid and a recombinant bacteriophage library constructed from large random human genomic fragments. The methodology allows rapid analysis and isolation of sequences of a given copy number in the genome: few (1 to 10 copies), low order-repeated (10 to 100 copies) and a more highly repeated (over 100 copies). In a model application of the method, the distribution of these classes of sequences was compared in cDNA and genomic DNA libraries constructed in piVX. The major difference observed between cDNA and genomic DNA repeat structure was the paucity of highly repeated elements in cDNA copies from high-molecular-weight cytoplasmic poly(A) + RNA.  相似文献   

7.
8.
We have sequenced two complete chloroplast genomes in the Asteraceae, Helianthus annuus (sunflower), and Lactuca sativa (lettuce), which belong to the distantly related subfamilies, Asteroideae and Cichorioideae, respectively. The Helianthus chloroplast genome is 151?104 bp and the Lactuca genome is 152?772 bp long, which is within the usual size range for chloroplast genomes in flowering plants. When compared to tobacco, both genomes have two inversions: a large 22.8-kb inversion and a smaller 3.3-kb inversion nested within it. Pairwise sequence divergence across all genes, introns, and spacers in Helianthus and Lactuca has resulted in the discovery of new, fast-evolving DNA sequences for use in species-level phylogenetics, such as the trnY-rpoB, trnL-rpl32, and ndhC-trnV spacers. Analysis and categorization of shared repeats resulted in seven classes useful for future repeat studies: double tandem repeats, three or more tandem repeats, direct repeats dispersed in the genome, repeats found in reverse complement orientation, hairpin loops, runs of A's or T's in excess of 12 bp, and gene or tRNA similarity. Results from BLAST searches of our genomic sequence against expressed sequence tag (EST) databases for both genomes produced eight likely RNA edited sites (C → U changes). These detailed analyses in Asteraceae contribute to a broader understanding of plastid evolution across flowering plants.  相似文献   

9.
The Restriction On Computer (ROC) program (freely available at http://www.mcb.harvard.edu/gilbert/ROC) was developed and used to analyze the restriction fragment length distribution in the human genome. In contrast to other programs searching for restriction sites, ROC simultaneously analyzes several long nucleotide sequences, such as the entire genomes, and in essence simulates electrophoretic analysis of DNA restriction fragments. In addition, this program extracts and analyzes DNA repeats that account for peaks in the restriction fragment length distribution. The ROC analysis data are consistent with the experimental data obtained via in vitro restriction enzyme analysis (taxonomic printing). A difference between the in vitro and in silico results is explained by underrepresentation of tandem DNA repeats in genomic databases. The ROC analysis of individual genome fragments elucidated the nature of several DNA markers, which were earlier revealed by taxonomic printing, and showed that L1 and Alu repeats are nonrandomly distributed in various chromosomes. Another advantage is that the ROC procedure makes it possible to analyze the nonrandom character of a genomic distribution of short DNA sequences. The ROC analysis showed that a low poly(G) frequency is characteristic of the entire human genome, rather than of only coding sequences. The method was proposed for a more complex in silico analysis of the genome. For instance, it is possible to simulate DNA restriction together with blot hybridization and then to analyze the nature of markers revealed.  相似文献   

10.
The Restriction On Computer (ROC) program (freely available at http://www.mcb.harvard.edu/ gilbert/ROC) was developed and used to analyze the restriction fragment length distribution in the human genome. In contrast to other programs searching for restriction sites, ROC simultaneously analyzes several long nucleotide sequences, such as the entire genomes, and in essence simulates electrophoretic analysis of DNA restriction fragments. In addition, this program extracts and analyzes DNA repeats that account for peaks in the restriction fragment length distribution. The ROC analysis data are consistent with the experimental data obtained via in vitro restriction enzyme analysis (DNA taxonoprint). A difference between the in vitro and in silico results is explained by underrepresentation of tandem DNA repeats in genomic databases. The ROC analysis of individual genome fragments elucidated the nature of several DNA markers, which were earlier revealed by DNA taxonoprint, and showed that L1 and Alurepeats are nonrandomly distributed in various chromosomes. Another advantage is that the ROC procedure makes it possible to analyze the nonrandom character of a genomic distribution of short DNA sequences. The ROC analysis showed that a low poly(G) frequency is characteristic of the entire human genome, rather than of only coding sequences. The method was proposed for a more complex in silico analysis of the genome. For instance, it is possible to simulate DNA restriction together with blot hybridization and then to analyze the nature of markers revealed.  相似文献   

11.
Linguistic complexity is a simple and elegant way of calculating complexity of strings of data. It is based on the concept that the greater the vocabulary one uses, the more complex the data. Until now, it has been used only on one-dimensional data, such as DNA and protein sequences and various human language texts. The basic definition can be extended to higher dimensions, thus allowing a practical and simple calculation of linguistic complexity of images, 3D objects and other multi-dimensional data. A simple extension of linguistic complexity is introduced, followed by 2D presentations and a discussion of parametric considerations. An example of linguistic complexity calculations, demonstrating its image processing and medical diagnostic power is presented. The subjects of this paper are patent application pending.  相似文献   

12.
13.
MOTIVATION: Microsatellites, also known as simple sequence repeats, are the tandem repeats of nucleotide motifs of the size 1-6 bp found in every genome known so far. Their importance in genomes is well known. Microsatellites are associated with various disease genes, have been used as molecular markers in linkage analysis and DNA fingerprinting studies, and also seem to play an important role in the genome evolution. Therefore, it is of importance to study distribution, enrichment and polymorphism of microsatellites in the genomes of interest. For this, the prerequisite is the availability of a computational tool for extraction of microsatellites (perfect as well as imperfect) and their related information from whole genome sequences. Examination of available tools revealed certain lacunae in them and prompted us to develop a new tool. RESULTS: In order to efficiently screen genome sequences for microsatellites (perfect as well as imperfect), we developed a new tool called IMEx (Imperfect Microsatellite Extractor). IMEx uses simple string-matching algorithm with sliding window approach to screen DNA sequences for microsatellites and reports the motif, copy number, genomic location, nearby genes, mutational events and many other features useful for in-depth studies. IMEx is more sensitive, efficient and useful than the available widely used tools. IMEx is available in the form of a stand-alone program as well as in the form of a web-server. AVAILABILITY: A World Wide Web server and the stand-alone program are available for free access at http://203.197.254.154/IMEX/ or http://www.cdfd.org.in/imex.  相似文献   

14.
Identifying and predicting the structural characteristics of novel repeats throughout the genome can lend insight into biological function. Specific repeats are believed to have biological significance as a function of their distribution patterns. We have developed 'GenomeMark,' a computer program that detects and statistically analyzes candidate repeats. Specifically, 'GenomeMark' identifies the periodic distribution of unique words, calculating their chi2 and Z-score values. Using 'GenomeMark,' we identified novel sequence words present in tandem throughout genomes. We found that these sequences have remarkable spacer sequence distributions and many were genome specific, validating the genome signature theory. Further analysis confirmed that many of these sequences have a specific biological function. The program is available from the authors upon request and is freely available for non-commercial and academic entities.  相似文献   

15.
On the complexity measures of genetic sequences   总被引:7,自引:0,他引:7  
MOTIVATION: It is well known that the regulatory regions of genomes are highly repetitive. They are rich in direct, symmetric and complemented repeats, and there is no doubt about the functional significance of these repeats. Among known measures of complexity, the Ziv-Lempel complexity measure reflects most adequately repeats occurring in the text. But this measure does not take into account isomorphic repeats. By isomorphic repeats we mean fragments that are identical (or symmetric) modulo some permutation of the alphabet letters. RESULTS: In this paper, two complexity measures of symbolic sequences are proposed that generalize the Ziv-Lempel complexity measure by taking into account any isomorphic repeats in the text (rather than just direct repeats as in Ziv-Lempel). The first of them, the complexity vector, is designed for small alphabets such as the alphabet of nucleotides. The second is based on a search for the longest isomorphic fragment in the history of sequence synthesis and can be used for alphabets of arbitrary cardinality. These measures have been used for recognition of structural regularities in DNA sequences. Some interesting structures related to the regulatory region of the human growth hormone are reported.  相似文献   

16.
Repetitive sequences are a major constituent of many eukaryote genomes and play roles in gene regulation, chromosome inheritance, nuclear architecture, and genome stability. The identification of repetitive elements has traditionally relied on in-depth, manual curation and computational determination of close relatives based on DNA identity. However, the rapid divergence of repetitive sequence has made identification of repeats by DNA identity difficult even in closely related species. Hence, the presence of unidentified repeats in genome sequences affects the quality of gene annotations and annotation-dependent analyses (e.g. microarray analyses). We have developed an enhanced repeat identification pipeline using two approaches. First, the de novo repeat finding program PILER-DF was used to identify interspersed repetitive elements in several recently finished Dipteran genomes. Repeats were classified, when possible, according to their similarity to known elements described in Repbase and GenBank, and also screened against annotated genes as one means of eliminating false positives. Second, we used a new program called RepeatRunner, which integrates results from both RepeatMasker nucleotide searches and protein searches using BLASTX. Using RepeatRunner with PILER-DF predictions, we masked repeats in thirteen Dipteran genomes and conclude that combining PILER-DF and RepeatRunner greatly enhances repeat identification in both well-characterized and un-annotated genomes.  相似文献   

17.
Complete archaeal genomes were probed for the presence of long (> or = 25 bp) oligonucleotide repeats (words). We detected the presence of many words distributed in tandem with narrow ranges of periodicity (i.e., spacer length between repeats). Similar words were not identified in genomes of non-archaeal species, namely Escherichia coli, Bacillus subtilis, Haemophilus influenzae, Mycoplasma genitalium and Mycoplasma pneumoniae. BLAST similarity searches against the GenBank nucleotide sequence database revealed that these words were archaeal species-specific, indicating that they are of a signature character. Sequence analysis and genome viewing tools showed these repeats to be restricted to non-coding regions. Thus, archaea appear to possess a non-coding genomic signature that is absent in bacterial species. The identification of a species-specific genomic signature would be of great value to archaeal genome mapping, evolutionary studies and analyses of genome complexity.  相似文献   

18.
Repetitive DNA sequences comprise a large percentage of plant genomes, and their characterization provides information about both species and genome evolution. We have isolated a recombinant clone containing a highly repeated DNA element (SB92) that is homologous to ca. 0.9% of the soybean genome or about 105 copies. This repeated sequence is tandemly arranged and is found in four or five major genomic locations. FISH analysis of metaphase chromosomes suggests that two of these locations are centromeric. We have determined the sequence of two cloned repeats and performed genomic sequencing to obtain a consensus sequence. The consensus repeat size was 92 bp and exhibited an average of 10% nucleotide substitution relative to the two cloned repeats. This high level of sequence diversity suggests an ancient origin but is inconsistent with the limited phylogenetic distribution of SB92, which is found an high copy number only in the annual soybeans. It therefore seems likely that this sequence is undergoing very rapid evolution.  相似文献   

19.
All organisms that have been studied until now have been found to have differential distribution of simple sequence repeats (SSRs), with more SSRs in intergenic than in coding sequences. SSR distribution was investigated in Archaea genomes where complete chromosome sequences of 19 Archaea were analyzed with the program SPUTNIK to find di- to penta-nucleotide repeats. The number of repeats was determined for the complete chromosome sequences and for the coding and non-coding sequences. Different from what has been found for other groups of organisms, there is an abundance of SSRs in coding regions of the genome of some Archaea. Dinucleotide repeats were rare and CG repeats were found in only two Archaea. In general, trinucleotide repeats are the most abundant SSR motifs; however, pentanucleotide repeats are abundant in some Archaea. Some of the tetranucleotide and pentanucleotide repeat motifs are organism specific. In general, repeats are short and CG-rich repeats are present in Archaea having a CG-rich genome. Among the 19 Archaea, SSR density was not correlated with genome size or with optimum growth temperature. Pentanucleotide density had an inverse correlation with the CG content of the genome.  相似文献   

20.
Hancock JM 《Genetica》2002,115(1):93-103
The relationship between the level of repetitiveness in genomic sequences and genome size has been re-investigated making use of the rapidly growing database of complete eubacterial and archaeal genome sequences combined with the fragmentary but now large amount of data from eukaryotic genomes. Relative simplicity factors (RSFs), which measure the repetitiveness of sequences, were calculated and significantly simple motifs (SSMs), which identify the kinds of sequences that are repeated, were identified. A previously reported correlation between genome size and repetitiveness was confirmed, but it was shown that the higher RSFs seen in eukaryotic genomes also reflect a generally higher level of repetitiveness independent of genome size differences. Differences in genome size are responsible for about 10% of the variance in RSF seen between species. The spectrum of SSMs seen within a genome differed markedly within the eubacteria but less so in eukaryotes and, particularly, in archaea. Species with SSM spectra that differ from the norm tend also to have high RSFs for their genome size and to be pathogens that make use of repetitive sequences to avoid host defence responses. Some of the variance in repetitiveness seen in other species may therefore also reflect the action of selection, although other forces such as variation in the effectiveness of mechanisms for regulating slippage errors of replication, may also be important.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号