首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 953 毫秒
1.
Tandem repeats finder: a program to analyze DNA sequences.   总被引:66,自引:3,他引:63       下载免费PDF全文
A tandem repeat in DNA is two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats have been shown to cause human disease, may play a variety of regulatory and evolutionary roles and are important laboratory and analytic tools. Extensive knowledge about pattern size, copy number, mutational history, etc. for tandem repeats has been limited by the inability to easily detect them in genomic sequence data. In this paper, we present a new algorithm for finding tandem repeats which works without the need to specify either the pattern or pattern size. We model tandem repeats by percent identity and frequency of indels between adjacent pattern copies and use statistically based recognition criteria. We demonstrate the algorithm's speed and its ability to detect tandem repeats that have undergone extensive mutational change by analyzing four sequences: the human frataxin gene, the human beta T cellreceptor locus sequence and two yeast chromosomes. These sequences range in size from 3 kb up to 700 kb. A World Wide Web server interface atc3.biomath.mssm.edu/trf.html has been established for automated use of the program.  相似文献   

2.
Tandem repeats occur frequently in biological sequences. They are important for studying genome evolution and human disease. A number of methods have been designed to detect a single tandem repeat in a sliding window. In this article, we focus on the case that an unknown number of tandem repeat segments of the same pattern are dispersively distributed in a sequence. We construct a probabilistic generative model for the tandem repeats, where the sequence pattern is represented by a motif matrix. A Bayesian approach is adopted to compute this model. Markov chain Monte Carlo (MCMC) algorithms are used to explore the posterior distribution as an effort to infer both the motif matrix of tandem repeats and the location of repeat segments. Reversible jump Markov chain Monte Carlo (RJMCMC) algorithms are used to address the transdimensional model selection problem raised by the variable number of repeat segments. Experiments on both synthetic data and real data show that this new approach is powerful in detecting dispersed short tandem repeats. As far as we know, it is the first work to adopt RJMCMC algorithms in the detection of tandem repeats.  相似文献   

3.

Background  

Identification of approximate tandem repeats is an important task of broad significance and still remains a challenging problem of computational genomics. Often there is no single best approach to periodicity detection and a combination of different methods may improve the prediction accuracy. Discrete Fourier transform (DFT) has been extensively used to study primary periodicities in DNA sequences. Here we investigate the application of DFT method to identify and study alphoid higher order repeats.  相似文献   

4.
Genomic DNA contains a wide variety of repetitive sequences. In Escherichia coli, there have been several classes of repetitive sequences reported, some of which cluster as tandem repeats. We propose a novel method for analyzing symbolic sequences by two-dimensional pattern formation with color-coding. We applied this method for searching tandem repeats in the E. coli genome and found approximately 50 repeats with periods longer than 30 bases. The longest repeat has a period of 1267 bases.  相似文献   

5.
Original spectral-statistical methods were developed to recognize a new type of latent periodicity in DNA, called latent profile periodicity, or latent profility. Searching for latent profility allows the detection of different levels of information coding in genes and local DNA segments.  相似文献   

6.
MOTIVATION: One of the main tasks of DNA sequence analysis is identification of repetitive patterns. DNA symbol repetitions play a key role in a number of applications, including prediction of gene and exon locations, identification of diseases, reconstruction of human evolutionary history and DNA forensics. RESULTS: A new approach towards identification of tandem repeats in DNA sequences is proposed. The approach is a refinement of previously considered method, based on the complex periodicity transform. The refinement is obtained, among others, by mapping of DNA symbols to pure quaternions. This mapping results in an enhanced, symbol-balanced sensitivity of the transform to DNA patterns, and an unambiguous threshold selection criterion. Computational efficiency of the transform is further improved, and coupling of the computation with the period value is removed, thereby facilitating parallel implementation of the algorithm. Additionally, a post-processing stage is inserted into the algorithm, enabling unambiguous display of results in a convenient graphical format. Comparison of the quaternionic periodicity transform with two well-known pattern detection techniques shows that the new approach is competitive with these two techniques in detection of exact and approximate repeats.  相似文献   

7.
MOTIVATION: Repetitive DNA sequences are abundant in genomes and efficient mining of significant repeats is important as the first step of repetitive sequence research. Although many computational tools for the purpose, either automatic or visualization ones, have been developed, detection and analysis of approximate repeats are still non-trivial task. RESULTS: Auto Dot PLOT (Adplot), a dotplot-like repetitive pattern visualization program with a window filtering based on iid Bernoulli trials, is developed and applied to yeast chromosomes and human T cell receptor locus sequence. Typical examples found in yeast chromosomes 1 and 10 and a tandem repeat of periods longer than 10,000 bp in human T cell receptor locus are presented. A complex structure composed of both direct and palindromic repeats found in yeast chromosome 10 is also visualized as specific dot pattern. Computational time measured by a Pentium 3 PC for each yeast auto chromosome with a standard parameter setting is linearly scaled and below 10 s per one chromosome, indicating efficiency of the program. From the examples, it is shown that Adplot can visualize approximate local repeat structures and give us a diagnosis power for inferring a duplicational history of repeats. AVAILABILITY: Adplot can be obtained by an e-mail request.  相似文献   

8.
In this study we have identified and characterized dopamine receptor D4 (DRD4) exon III tandem repeats in 33 public available nucleotide sequences from different mammalian species. We found that the tandem repeat in canids could be described in a novel and simple way, namely, as a structure composed of 15- and 12- bp modules. Tandem repeats composed of 18-bp modules were found in sequences from the horse, zebra, onager, and donkey, Asiatic bear, polar bear, common raccoon, dolphin, harbor porpoise, and domestic cat. Several of these sequences have been analyzed previously without a tandem repeat being found. In the domestic cow and gray seal we identified tandem repeats composed of 36-bp modules, each consisting of two closely related 18-bp basic units. A tandem repeat consisting of 9-bp modules was identified in sequences from mink and ferret. In the European otter we detected an 18-bp tandem repeat, while a tandem repeat consisting of 27-bp modules was identified in a sequence from European badger. Both these tandem repeats were composed of 9-bp basic units, which were closely related with the 9-bp repeat modules identified in the mink and ferret. Tandem repeats could not be identified in sequences from rodents. All tandem repeats possessed a high GC content with a strong bias for C. On phylogenetic analysis of the tandem repeats evolutionary related species were clustered into the same groups. The degree of conservation of the tandem repeats varied significantly between species. The deduced amino acid sequences of most of the tandem repeats exhibited a high propensity for disorder. This was also the case with an amino acid sequence of the human DRD4 exon III tandem repeat, which was included in the study for comparative purposes. We identified proline-containing motifs for SH3 and WW domain binding proteins, potential phosphorylation sites, PDZ domain binding motifs, and FHA domain binding motifs in the amino acid sequences of the tandem repeats. The numbers of potential functional sites varied pronouncedly between species. Our observations provide a platform for future studies of the architecture and evolution of the DRD4 exon III tandem repeat, and they suggest that differences in the structure of this tandem repeat contribute to specialization and generation of diversity in receptor function.  相似文献   

9.
MOTIVATION: A tandem repeat in DNA is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats occur in the genomes of both eukaryotic and prokaryotic organisms. They are important in numerous fields including disease diagnosis, mapping studies, human identity testing (DNA fingerprinting), sequence homology and population studies. Although tandem repeats have been used by biologists for many years, there are few tools available for performing an exhaustive search for all tandem repeats in a given sequence. RESULTS: In this paper we describe an efficient algorithm for finding all tandem repeats within a sequence, under the edit distance measure. The contributions of this paper are two-fold: theoretical and practical. We present a precise definition for tandem repeats over the edit distance and an efficient, deterministic algorithm for finding these repeats. AVAILABILITY: The algorithm has been implemented in C++, and the software is available upon request and can be used at http://www.sci.brooklyn.cuny.edu/~sokol/trepeats. The use of this tool will assist biologists in discovering new ways that tandem repeats affect both the structure and function of DNA and protein molecules.  相似文献   

10.
A variable number of tandem repeat from a porcine glucosephosphate isomerase intron has been isolated and sequenced. The repeat has a unit size of 39 bp, is highly conserved and is present in at least 14 copies. Flanking sequences show a sequence periodicity of 53-54 bp and some sequence homology to the 39 bp repeat. A considerable part of the genomic DNA has been lost during subcloning and is considered to be deletion prone or refractory to propagation in E. coli. The tandem repeat is locus specific and detects at least six alleles in BamHI digested porcine DNA. No homology to other tandem repeat sequences has been found.  相似文献   

11.
A method of computer analysis of DNA sequences has been proposed. It is based on information similarity of compared sequences and it significantly increases the usefulness of the computer analysis. This approach has been applied to the search of interconnected areas of Alu-repeats and replication origins of p15A and R6K plasmids. An Alu-like region located in the first stem of the secondary structure of RNA-1 and E. coli RNA-polymerase binding site has been found in the p15A. On R6K replication origin, Alu-like repeats have been found in the area of tandem 22 bp repeats. This comparison also allowed to reveal hidden periodicity of the sequence of human Alu-repeat. A hypothesis that explained the data obtained has been proposed. The proposed approach may be used as a method for revealing DNA sequences that have similar genetic functions.  相似文献   

12.
An algorithm for approximate tandem repeats.   总被引:4,自引:0,他引:4  
A perfect single tandem repeat is defined as a nonempty string that can be divided into two identical substrings, e.g., abcabc. An approximate single tandem repeat is one in which the substrings are similar, but not identical, e.g., abcdaacd. In this paper we consider two criterions of similarity: the Hamming distance (k mismatches) and the edit distance (k differences). For a string S of length n and an integer k our algorithm reports all locally optimal approximate repeats, r = umacro ?, for which the Hamming distance of umacro and ? is at most k, in O(nk log (n/k)) time, or all those for which the edit distance of umacro and ? is at most k, in O(nk log k log (n/k)) time. This paper concentrates on a more general type of repeat called multiple tandem repeats. A multiple tandem repeat in a sequence S is a (periodic) substring r of S of the form r = u(a)u', where u is a prefix of r and u' is a prefix of u. An approximate multiple tandem repeat is a multiple repeat with errors; the repeated subsequences are similar but not identical. We precisely define approximate multiple repeats, and present an algorithm that finds all repeats that concur with our definition. The time complexity of the algorithm, when searching for repeats with up to k errors in a string S of length n, is O(nka log (n/k)) where a is the maximum number of periods in any reported repeat. We present some experimental results concerning the performance and sensitivity of our algorithm. The problem of finding repeats within a string is a computational problem with important applications in the field of molecular biology. Both exact and inexact repeats occur frequently in the genome, and certain repeats occurring in the genome are known to be related to diseases in the human.  相似文献   

13.
Internal repeats in protein sequences have wide-ranging implications for the structure and function of proteins. A keen analysis of the repeats in protein sequences may help us to better understand the structural organization of proteins and their evolutionary relations. In this paper, a mathematical method for searching for latent periodicity in protein sequences is developed. Using this method, we identified simple sequence repeats in the alkaline proteases and found that the sequences could show the same periodicity as their tertiary structures. This result may help us to reduce difficulties in the study of the relationship between sequences and their structures.  相似文献   

14.
Finding approximate tandem repeats in genomic sequences.   总被引:1,自引:0,他引:1  
An efficient algorithm is presented for detecting approximate tandem repeats in genomic sequences. The algorithm is based on a flexible statistical model which allows a wide range of definitions of approximate tandem repeats. The ideas and methods underlying the algorithm are described and its effectiveness on genomic data is demonstrated.  相似文献   

15.
Latent amino acid repeats seem to be widespread in genetic sequences and to reflect their structure, function, and evolution. We have recently identified latent periodicity in more than 150 protein families including protein kinases and various nucleotide-binding proteins. The latent repeats in these families were correlated to their structure and evolution. However, a majority of known protein families were not identified with our latent periodicity search algorithm. The main presumable reason for this was the inability of our techniques to identify periodicities interspersed with insertions and deletions. We designed the new latent periodicity search algorithm, which is capable of taking into account insertions and deletions. As a result, we identified many novel cases of latent periodicity peculiar to protein families. Possible origins of the periodic structure of these families are discussed. Summarizing, we presume that latent periodicity is present in a substantial portion of known protein families. The latent periodicity matrices and the results of Swiss-Prot scans are available from http://bioinf.narod.ru/del/.  相似文献   

16.
An efficient algorithm for detecting approximate tandem repeats in genomic sequences is presented. The algorithm is based on innovative statistical criteria to detect candidate regions which may include tandem repeats; these regions are subsequently verified by alignments based on dynamic programming. No prior information about the period size or pattern is needed. Also, the algorithm is virtually capable of detecting repeats with any period. An implementation of the algorithm is compared with the two state-of-the-art tandem repeats detection tools to demonstrate its effectiveness both on natural and synthetic data. The algorithm is available at www.cs.brown.edu/people/domanic/tandem/.  相似文献   

17.
MOTIVATION: While the mechanism for regulating alternative splicing is poorly understood, secondary structure has been shown to be integral to this process. Due to their propensity for forming complementary hairpin loops and their elevated mutation rates, tandem repeated sequences have the potential to influence splicing regulation. RESULTS: An analysis of human intronic sequences reveals a strong correlation between alternative splicing and the prevalence of mono- through hexanucleotide tandem repeats that may engage in complementary pairing in introns that flank alternatively spliced exons. While only 44% of the 18 173 genes in the Human Alternative Splicing Database are known to be alternatively spliced, they contain 84% of the 694 237 intronic complementary repeat pairs. Significantly, the normalized frequency and distribution of repeat sequences, independent of their potential for pairing, are indistinguishable between alternatively spliced and non-alternatively spliced genes. Thus, the increased prevalence of repeats with pairing potential in alternatively spliced genes is not merely a consequence of more repeats or repeat composition bias. These results suggest that complementary repeats may play a role in the regulation of alternative splicing. CONTACT: harold.garner@utsouthwestern.edu.  相似文献   

18.
Exact Tandem Repeats Analyzer 1.0 (E-TRA) combines sequence motif searches with keywords such as ‘organs’, ‘tissues’, ‘cell lines’ and ‘development stages’ for finding simple exact tandem repeats as well as non-simple repeats. E-TRA has several advanced repeat search parameters/options compared to other repeat finder programs as it not only accepts GenBank, FASTA and expressed sequence tags (EST) sequence files, but also does analysis of multiple files with multiple sequences. The minimum and maximum tandem repeat motif lengths that E-TRA finds vary from one to one thousand. Advanced user defined parameters/options let the researchers use different minimum motif repeats search criteria for varying motif lengths simultaneously. One of the most interesting features of genomes is the presence of relatively short tandem repeats (TRs). These repeated DNA sequences are found in both prokaryotes and eukaryotes, distributed almost at random throughout the genome. Some of the tandem repeats play important roles in the regulation of gene expression whereas others do not have any known biological function as yet. Nevertheless, they have proven to be very beneficial in DNA profiling and genetic linkage analysis studies. To demonstrate the use of E-TRA, we used 5,465,605 human EST sequences derived from 18,814,550 GenBank EST sequences. Our results indicated that 12.44% (679,800) of the human EST sequences contained simple and non-simple repeat string patterns varying from one to 126 nucleotides in length. The results also revealed that human organs, tissues, cell lines and different developmental stages differed in number of repeats as well as repeat composition, indicating that the distribution of expressed tandem repeats among tissues or organs are not random, thus differing from the un-transcribed repeats found in genomes.  相似文献   

19.
All bacterial genomes contain multiple loci of repetitive DNA. Repeat unit sizes and repeat sequences may vary when multiple loci are considered for different isolates of an individual microbial species. Moreover, it has been documented on many occasions that the number of repeat units per locus is a strain-defining parameter. Consequently, there is isolate-specificity in the number of repeats per locus when different strains of a given bacterial species are compared. The experimental assessment of this variability for a number of different loci has been called 'multilocus variable number of tandem repeat analysis' (MLVA). The approach can be supported or extended by locus-specific DNA sequencing for establishing mutations in the individual repeat units, which usually enhances the resolution of the approach considerably. Essentially, MLVA with or without supportive sequencing has been developed for all of the medically relevant bacterial species and can be used effectively for tracing outbreaks or other forms of bacterial dissemination. MLVA is a modern, timely and versatile bacterial typing methodology.  相似文献   

20.
Tandemly repeated sequences are a major component of the eukaryotic genome. Although the general characteristics of tandem repeats have been well documented, the processes involved in their origin and maintenance remain unknown. In this study, a region on the paternal sex ratio (PSR) chromosome was analyzed to investigate the mechanisms of tandem repeat evolution. The region contains a junction between a tandem array of PSR2 repeats and a copy of the retrotransposon NATE, with other dispersed repeats (putative mobile elements) on the other side of the element. Little similarity was detected between the sequence of PSR2 and the region of NATE flanking the array, indicating that the PSR2 repeat did not originate from the underlying NATE sequence. However, a short region of sequence similarity (11/15 bp) and an inverted region of sequence identity (8 bp) are present on either side of the junction. These short sequences may have facilitated nonhomologous recombination between NATE and PSR2, resulting in the formation of the junction. Adjacent to the junction, the three most terminal repeats in the PSR2 array exhibited a higher sequence divergence relative to internal repeats, which is consistent with a theoretical prediction of the unequal exchange model for tandem repeat evolution. Other NATE insertion sites were characterized which show proximity to both tandem repeats and complex DNAs containing additional dispersed repeats. An ``accretion model' is proposed to account for this association by the accumulation of mobile elements at the ends of tandem arrays and into ``islands' within arrays. Mobile elements inserting into arrays will tend to migrate into islands and to array ends, due to the turnover in the number of intervening repeats. Received: 18 August 1997 / Accepted: 18 September 1998  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号