首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
FORRepeats: detects repeats on entire chromosomes and between genomes   总被引:1,自引:0,他引:1  
MOTIVATION: As more and more whole genomes are available, there is a need for new methods to compare large sequences and transfer biological knowledge from annotated genomes to related new ones. BLAST is not suitable to compare multimegabase DNA sequences. MegaBLAST is designed to compare closely related large sequences. Some tools to detect repeats in large sequences have already been developed such as MUMmer or REPuter. They also have time or space restrictions. Moreover, in terms of applications, REPuter only computes repeats and MUMmer works better with related genomes. RESULTS: We present a heuristic method, named FORRepeats, which is based on a novel data structure called factor oracle. In the first step it detects exact repeats in large sequences. Then, in the second step, it computes approximate repeats and performs pairwise comparison. We compared its computational characteristics with BLAST and REPuter. Results demonstrate that it is fast and space economical. We show FORRepeats ability to perform intra-genomic comparison and to detect repeated DNA sequences in the complete genome of the model plant Arabidopsis thaliana.  相似文献   

2.
Complete chromosome/genome sequences available from humans, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana, and Saccharomyces cerevisiae were analyzed for the occurrence of mono-, di-, tri-, and tetranucleotide repeats. In all of the genomes studied, dinucleotide repeat stretches tended to be longer than other repeats. Additionally, tetranucleotide repeats in humans and trinucleotide repeats in Drosophila also seemed to be longer. Although the trends for different repeats are similar between different chromosomes within a genome, the density of repeats may vary between different chromosomes of the same species. The abundance or rarity of various di- and trinucleotide repeats in different genomes cannot be explained by nucleotide composition of a sequence or potential of repeated motifs to form alternative DNA structures. This suggests that in addition to nucleotide composition of repeat motifs, characteristic DNA replication/repair/recombination machinery might play an important role in the genesis of repeats. Moreover, analysis of complete genome coding DNA sequences of Drosophila, C. elegans, and yeast indicated that expansions of codon repeats corresponding to small hydrophilic amino acids are tolerated more, while strong selection pressures probably eliminate codon repeats encoding hydrophobic and basic amino acids. The locations and sequences of all of the repeat loci detected in genome sequences and coding DNA sequences are available at http://www.ncl-india.org/ssr and could be useful for further studies.  相似文献   

3.
MRD is a database system to access the microsatellite repeats information of genomes such as archea, eubacteria, and other eukaryotic genomes whose sequence information is available in public domains. MRD stores information about simple tandemly repeated k-mer sequences where k= 1 to 6, i.e. monomer to hexamer. The web interface allows the users to search for the repeat of their interest and to know about the association of the repeat with genes and genomic regions in the specific organism. The data contains the abundance and distribution of microsatellites in the coding and non-coding regions of the genome. The exact location of repeats with respect to genomic regions of interest (such as UTR, exon, intron or intergenic regions) whichever is applicable to organism is highlighted. MRD is available on the World Wide Web at and/or . The database is designed as an open-ended system to accommodate the microsatellite repeats information of other genomes whose complete sequences will be available in future through public domain.  相似文献   

4.
Koressaar T  Remm M 《DNA research》2012,19(3):219-230
Prokaryotes are in general believed to possess small, compactly organized genomes, with repetitive sequences forming only a small part of them. Nonetheless, many prokaryotic genomes in fact contain species-specific repeats (>85 bp long genomic sequences with less than 60% identity to other species) as we have previously demonstrated. However, it is not known at present how frequent such species-specific repeats are and what their functional roles in bacterial genomes may be. Therefore, we have conducted a comprehensive survey of prokaryotic species-specific repeats and characterized them to examine as to whether there are functional classes among different repeats or not and how they are mutually related to each other. Of the 613 distinct prokaryotic species analyzed, 97% were found to contain at least one species-specific repeats. It seems interesting to note that the species-specific repeats thus identified appear to be functionally variable in different genomes: in some genomes, they are mostly associated with duplicated protein-coding genes, whereas in some other genomes with rRNA and tRNA genes. Contrary to what may be expected, only one-fourth of the species-specific repeats were found to be associated with mobile genetic elements.  相似文献   

5.
Viruses are the most numerous biological entity, existing in all environments and infecting all cellular organisms. Compared with cellular life, the evolution and origin of viruses are poorly understood; viruses are enormously diverse, and most lack sequence similarity to cellular genes. To uncover viral sequences without relying on either reference viral sequences from databases or marker genes that characterize specific viral taxa, we developed an analysis pipeline for virus inference based on clustered regularly interspaced short palindromic repeats (CRISPR). CRISPR is a prokaryotic nucleic acid restriction system that stores the memory of previous exposure. Our protocol can infer CRISPR-targeted sequences, including viruses, plasmids, and previously uncharacterized elements, and predict their hosts using unassembled short-read metagenomic sequencing data. By analyzing human gut metagenomic data, we extracted 11,391 terminally redundant CRISPR-targeted sequences, which are likely complete circular genomes. The sequences included 2,154 tailed-phage genomes, together with 257 complete crAssphage genomes, 11 genomes larger than 200 kilobases, 766 genomes of Microviridae species, 56 genomes of Inoviridae species, and 95 previously uncharacterized circular small genomes that have no reliably predicted protein-coding gene. We predicted the host(s) of approximately 70% of the discovered genomes at the taxonomic level of phylum by linking protospacers to taxonomically assigned CRISPR direct repeats. These results demonstrate that our protocol is efficient for de novo inference of CRISPR-targeted sequences and their host prediction.  相似文献   

6.
MOTIVATION: Complex genomes contain numerous repeated sequences, and genomic duplication is believed to be a main evolutionary mechanism to obtain new functions. Several tools are available for de novo repeat sequence identification, and many approaches exist for clustering homologous protein sequences. We present an efficient new approach to identify and cluster homologous DNA sequences with high accuracy at the level of whole genomes, excluding low-complexity repeats, tandem repeats and annotated interspersed repeats. We also determine the boundaries of each group member so that it closely represents a biological unit, e.g. a complete gene, or a partial gene coding a protein domain. RESULTS: We developed a program called HomologMiner to identify homologous groups applicable to genome sequences that have been properly marked for low-complexity repeats and annotated interspersed repeats. We applied it to the whole genomes of human (hg17), macaque (rheMac2) and mouse (mm8). Groups obtained include gene families (e.g. olfactory receptor gene family, zinc finger families), unannotated interspersed repeats and additional homologous groups that resulted from recent segmental duplications. Our program incorporates several new methods: a new abstract definition of consistent duplicate units, a new criterion to remove moderately frequent tandem repeats, and new algorithmic techniques. We also provide preliminary analysis of the output on the three genomes mentioned above, and show several applications including identifying boundaries of tandem gene clusters and novel interspersed repeat families. AVAILABILITY: All programs and datasets are downloadable from www.bx.psu.edu/miller_lab.  相似文献   

7.
The structure of plant mitochondrial genomes has proven to be complex and difficult to study. Recombination across large and small repeated sequences can result in genome diversity within individual plants, as well as rapid evolutionary change in genome structure. The role of these repeats is becoming more obvious as mitochondrial genomes are examined in detail.  相似文献   

8.
Seven barley species have been compared for organization of repeated sequences. Quantitative variation of repeated DNA fractions is demonstrated, though the total amount of sequences (reassociation up to Cot=10) in most cases does not vary. The repeats are divided into four groups by the mode of interspecific variability, with the help of dot and blot hybridization of the genomes under study with cloned highly repeated sequences of Hordeum vulgare. The first group contains the pHv7161 family of the most conservative sequences. The second group comprises moderately changing repeats. The third group includes highly variable Hind III repeats of Hordeum genomes, and the fourth group is represented by pHv7191 family of repeats that are highly amplified in H. vulgare genome. Comparative analysis of content and organization of highly repeated sequences in genome helps to clarify phylogenetic relationships in the genus and can be used for prediction of successfullness of interspecific hybridization.  相似文献   

9.
Exact Tandem Repeats Analyzer 1.0 (E-TRA) combines sequence motif searches with keywords such as ‘organs’, ‘tissues’, ‘cell lines’ and ‘development stages’ for finding simple exact tandem repeats as well as non-simple repeats. E-TRA has several advanced repeat search parameters/options compared to other repeat finder programs as it not only accepts GenBank, FASTA and expressed sequence tags (EST) sequence files, but also does analysis of multiple files with multiple sequences. The minimum and maximum tandem repeat motif lengths that E-TRA finds vary from one to one thousand. Advanced user defined parameters/options let the researchers use different minimum motif repeats search criteria for varying motif lengths simultaneously. One of the most interesting features of genomes is the presence of relatively short tandem repeats (TRs). These repeated DNA sequences are found in both prokaryotes and eukaryotes, distributed almost at random throughout the genome. Some of the tandem repeats play important roles in the regulation of gene expression whereas others do not have any known biological function as yet. Nevertheless, they have proven to be very beneficial in DNA profiling and genetic linkage analysis studies. To demonstrate the use of E-TRA, we used 5,465,605 human EST sequences derived from 18,814,550 GenBank EST sequences. Our results indicated that 12.44% (679,800) of the human EST sequences contained simple and non-simple repeat string patterns varying from one to 126 nucleotides in length. The results also revealed that human organs, tissues, cell lines and different developmental stages differed in number of repeats as well as repeat composition, indicating that the distribution of expressed tandem repeats among tissues or organs are not random, thus differing from the un-transcribed repeats found in genomes.  相似文献   

10.
Despite the agricultural importance of both potato and tomato, very little is known about their chloroplast genomes. Analysis of the complete sequences of tomato, potato, tobacco, and Atropa chloroplast genomes reveals significant insertions and deletions within certain coding regions or regulatory sequences (e.g., deletion of repeated sequences within 16S rRNA, ycf2 or ribosomal binding sites in ycf2). RNA, photosynthesis, and atp synthase genes are the least divergent and the most divergent genes are clpP, cemA, ccsA, and matK. Repeat analyses identified 33–45 direct and inverted repeats ≥30 bp with a sequence identity of at least 90%; all but five of the repeats shared by all four Solanaceae genomes are located in the same genes or intergenic regions, suggesting a functional role. A comprehensive genome-wide analysis of all coding sequences and intergenic spacer regions was done for the first time in chloroplast genomes. Only four spacer regions are fully conserved (100% sequence identity) among all genomes; deletions or insertions within some intergenic spacer regions result in less than 25% sequence identity, underscoring the importance of choosing appropriate intergenic spacers for plastid transformation and providing valuable new information for phylogenetic utility of the chloroplast intergenic spacer regions. Comparison of coding sequences with expressed sequence tags showed considerable amount of variation, resulting in amino acid changes; none of the C-to-U conversions observed in potato and tomato were conserved in tobacco and Atropa. It is possible that there has been a loss of conserved editing sites in potato and tomato.Electronic Supplementary Material Supplementary material is available for this article at and is accessible for authorized users.  相似文献   

11.
A clustering method for repeat analysis in DNA sequences   总被引:1,自引:0,他引:1  
Volfovsky N  Haas BJ  Salzberg SL 《Genome biology》2001,2(8):research0027.1-research002711

Background

A computational system for analysis of the repetitive structure of genomic sequences is described. The method uses suffix trees to organize and search the input sequences; this data structure has been used previously for efficient computation of exact and degenerate repeats.

Results

The resulting software tool collects all repeat classes and outputs summary statistics as well as a file containing multiple sequences (multi fasta), that can be used as the target of searches. Its use is demonstrated here on several complete microbial genomes, the entire Arabidopsis thaliana genome, and a large collection of rice bacterial artificial chromosome end sequences.

Conclusions

We propose a new clustering method for analysis of the repeat data captured in suffix trees. This method has been incorporated into a system that can find repeats in individual genome sequences or sets of sequences, and that can organize those repeats into classes. It quickly and accurately creates repeat databases from small and large genomes. The associated software (RepeatFinder), should prove helpful in the analysis of repeat structure for both complete and partial genome sequences.  相似文献   

12.
The genomes of many species are dominated by short sequences repeated consecutively. It is estimated that over 10% of the human genome consists of tandemly repeated sequences. Finding repeated regions in long sequences is important in sequence analysis. We develop a software, LocRepeat, that finds regions of pseudo-periodic repeats in a long sequence. We use the definition of Li et al. 1 for the pseudo-periodic partition of a region and extend the algorithm that can select the repeated region from a given long sequence and give the pseudo-periodic partition of the region.  相似文献   

13.
We have examined the organization of the repeated and single copy DNA sequences in the genomes of two insects, the honeybee (Apis mellifera) and the housefly (Musca domestica). Analysis of the reassociation kinetics of honeybee DNA fragments 330 and 2,200 nucleotides long shows that approximately 90% of both size fragments is composed entirely of non-repeated sequences. Thus honeybee DNA contains few or no repeated sequences interspersed with nonrepeated sequences at a distance of less than a few thousand nucleotides. On the other hand, the reassociation kinetics of housefly DNA fragments 250 and 2,000 nucleotides long indicates that less than 15% of the longer fragments are composed entirely of single copy sequences. A large fraction of the housefly DNA therefore contains repeated sequences spaced less than a few thousand nucleotides apart. Reassociated repetitive DNA from the housefly was treated with S1 nuclease and sized on agarose A-50. The S1 resistant sequences have a bimodal distribution of lengths. Thirty-three percent is greater than 1,500 nucleotide pairs, and 67% has an average size about 300 nucleotide pairs. The genome of the housefly appears to have at least 70% of its DNA arranged as short repeats interspersed with single copy sequences in a pattern qualitatively similar to that of most eukaryotic genomes.  相似文献   

14.
The presence of repeated sequences is a fundamental feature of genomes. Tandemly repeated DNA appears in both eukaryotic and prokaryotic genomes, it is associated with various regulatory mechanisms and plays an important role in genomic fingerprinting. In this paper, we describe mreps, a powerful software tool for a fast identification of tandemly repeated structures in DNA sequences. mreps is able to identify all types of tandem repeats within a single run on a whole genomic sequence. It has a resolution parameter that allows the program to identify 'fuzzy' repeats. We introduce main algorithmic solutions behind mreps, describe its usage, give some execution time benchmarks and present several case studies to illustrate its capabilities. The mreps web interface is accessible through http://www.loria.fr/mreps/.  相似文献   

15.
Comparative study of papovavirus DNA: BKV(MM), BKV(WT) and SV40.   总被引:8,自引:2,他引:6       下载免费PDF全文
Extensive physical mapping revealed that approximately 90% of the genomes of BKV(prototype, WT) and BKV (MM strain) are identical or closely related. Nucleotide sequences of the non-homologous regions and a large portion of the homologous regions have been determined for both genomes. The coding sequence of small t antigen of BKV(MM) is 216 nucleotides shorter than that of BKV(WT), even though no differences in biological function of the t antigen was observed. Both genomes contain three similar sets of 44-61 base-pair repeated sequences. However, the DNA sequence of the tandem repeats is totally different between BKV (human cell as host) and SV40 (monkey cell as host). On the other hand, the region between the N-terminus of the T antigen genes and the origin of replication is dominated by a similar set of palindromic sequences in BKV and SV40 DNA. There is also extensive homology between the regions which code for proteins in BKV and SV40, suggesting a close evolutionary relationship.  相似文献   

16.
Integrated retroviral genomes are flanked by direct repeats of sequences derived from the termini of the viral RNA genome. These sequences are designated long terminal repeats (LTRs). We have determined and analyzed the nucleotide sequence of the LTRs from several exogenous and endogenous avian retroviruses. These LTRs possess several structural similarities with eukaryotic and prokaryotic transposable elements: 1) inverted complementary repeats at the termini, 2) deletions of sequences adjacent to the LTR, 3) small duplications of host sequences flanking the integrated provirus, and 4) sequence homologies with transposable and other genetic elements. These observations suggest that LTRs function in the integration and perhaps transposition of retrovirus genomes. Evidence exists for the presence of a strong promoter sequence within the LTR. The retroviral LTR also contains a "Hogness box" up-stream of the capping site and a poly(A) signal. These features suggest an additional role for the LTR in the regulation of gene expression.  相似文献   

17.
Direct or inverse repeated sequences are important functional features of prokaryotic and eukaryotic genomes. Considering the unique mechanism, involving single-stranded genomic intermediates, by which adenovirus (Ad) replicates its genome, we investigated whether repetitive homologous sequences inserted into E1-deleted adenoviral vectors would affect replication of viral DNA. In these studies we found that inverted repeats (IRs) inserted into the E1 region could mediate predictable genomic rearrangements, resulting in vector genomes devoid of all viral genes. These genomes (termed DeltaAd.IR) contained only the transgene cassette flanked on both sides by precisely duplicated IRs, Ad packaging signals, and Ad inverted terminal repeat sequences. Generation of DeltaAd.IR genomes could also be achieved by coinfecting two viruses, each providing one inverse homology element. The formation of DeltaAd.IR genomes required Ad DNA replication and appeared to involve recombination between the homologous inverted sequences. The formation of DeltaAd. IR genomes did not depend on the sequence within or adjacent to the inverted repeat elements. The small DeltaAd.IR vector genomes were efficiently packaged into functional Ad particles. All functions for DeltaAd.IR replication and packaging were provided by the full-length genome amplified in the same cell. DeltaAd.IR vectors were produced at a yield of approximately 10(4) particles per cell, which could be separated from virions with full-length genomes based on their lighter buoyant density. DeltaAd.IR vectors infected cultured cells with the same efficiency as first-generation vectors; however, transgene expression was only transient due to the instability of deleted genomes within transduced cells. The finding that IRs present within Ad vector genomes can mediate precise genetic rearrangements has important implications for the development of new vectors for gene therapy approaches.  相似文献   

18.
Pigeon genome long sequences containing clusters of moderately repeating elements have been cloned. Molecular analysis has shown a dispersed distribution of the repeats in both pigeon and chicken genomes. Within a single cluster, a scrambled distribution of elements belonging to different families of repeats has been shown. Similar repeated sequences have been revealed within clusters. The analysed clusters of repeats are characterized by a limited structural variability in the genomes. In situ hybridization revealed the localization of sequences complementary to the cloned clusters in pigeon and chicken macrochromosomes. Preferential localization has been demonstrated in telomeric and centromeric chromosome regions as well as in the region of R-bands.  相似文献   

19.
Several plant mitochondrial genomes contain repeated sequences that are postulated to be sites of homologous intragenomic recombination (1-3). In this report, we have used filter hybridizations to investigate sequence relationships between the cloned mitochondrial DNA (mtDNA) recombination repeats from turnip, spinach and maize and total mtDNA isolated from thirteen species of angiosperms. We find that strong sequence homologies exist between the spinach and turnip recombination repeats and essentially all other mitochondrial genomes tested, whereas a major maize recombination repeat does not hybridize to any other mtDNA. The sequences homologous to the turnip repeat do not appear to function in recombination in any other genome, whereas the spinach repeat hybridizes to reiterated sequences within the mitochondrial genomes of wheat and two species of pokeweed that do appear to be sites of recombination. Thus, although intragenomic recombination is a widespread phenomenon in plant mitochondria, it appears that different sequences either serve as substrates for this function in different species, or else surround a relatively short common recombination site which does not cross-hybridize under our experimental conditions. Identified gene sequences from maize mtDNA were used in heterologous hybridizations to show that the repeated sequences implicated in recombination in turnip and spinach/pokeweed/wheat mitochondria include, or are closely linked to genes for subunit II of cytochrome c oxidase and 26S rRNA, respectively. Together with previous studies indicating that the 18S rRNA gene in wheat mtDNA is contained within a recombination repeat (3), these results imply an unexpectedly frequent association between recombination repeats and plant mitochondrial genes.  相似文献   

20.
ABSTRACT. Dinoflagellates have among the largest nuclear genomes known, but we know little about their contents or organisation. Given the interest in dinoflagellate ecology, cell biology, and evolutionary biology, there are many reasons to thoroughly investigate the contents of dinoflagellate genomes, but because of their large size the only thorough samples to date have relied on expressed sequence tag surveys to analyse cDNAs. To complement this, there are some studies of the physical properties of dinoflagellate chromosomes, but no direct survey of the nature of the sequences contained within them. To start to build a picture of the contents of these genomes, we have sequenced over 230,000 bp from the nuclear genome of Heterocapsa triquetra, which has been estimated to be 18–23 billion base pairs in total. The survey includes one putative gene with two relict spliced leaders, one putative pseudogene, and a small number of low‐complexity repeats, transposons, and other putative selfish elements, all of which account for about 5% of the survey. Another 5% of the survey was long, complex repeats, some highly represented. By far the greatest fraction of the survey (89.5%) is made up of non‐repeated sequence with no similarity to any other known sequence.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号