首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 500 毫秒
1.
An algorithm for approximate tandem repeats.   总被引:4,自引:0,他引:4  
A perfect single tandem repeat is defined as a nonempty string that can be divided into two identical substrings, e.g., abcabc. An approximate single tandem repeat is one in which the substrings are similar, but not identical, e.g., abcdaacd. In this paper we consider two criterions of similarity: the Hamming distance (k mismatches) and the edit distance (k differences). For a string S of length n and an integer k our algorithm reports all locally optimal approximate repeats, r = umacro ?, for which the Hamming distance of umacro and ? is at most k, in O(nk log (n/k)) time, or all those for which the edit distance of umacro and ? is at most k, in O(nk log k log (n/k)) time. This paper concentrates on a more general type of repeat called multiple tandem repeats. A multiple tandem repeat in a sequence S is a (periodic) substring r of S of the form r = u(a)u', where u is a prefix of r and u' is a prefix of u. An approximate multiple tandem repeat is a multiple repeat with errors; the repeated subsequences are similar but not identical. We precisely define approximate multiple repeats, and present an algorithm that finds all repeats that concur with our definition. The time complexity of the algorithm, when searching for repeats with up to k errors in a string S of length n, is O(nka log (n/k)) where a is the maximum number of periods in any reported repeat. We present some experimental results concerning the performance and sensitivity of our algorithm. The problem of finding repeats within a string is a computational problem with important applications in the field of molecular biology. Both exact and inexact repeats occur frequently in the genome, and certain repeats occurring in the genome are known to be related to diseases in the human.  相似文献   

2.
3.
Distant repeats in protein sequence play an important role in various aspects of protein analysis. A keen analysis of the distant repeats would enable to establish a firm relation of the repeats with respect to their function and three-dimensional structure during the evolutionary process. Further, it enlightens the diversity of duplication during the evolution. To this end, an algorithm has been developed to find all distant repeats in a protein sequence. The scores from Point Accepted Mutation (PAM) matrix has been deployed for the identification of amino acid substitutions while detecting the distant repeats. Due to the biological importance of distant repeats, the proposed algorithm will be of importance to structural biologists, molecular biologists, biochemists and researchers involved in phylogenetic and evolutionary studies.  相似文献   

4.
Tandem repeats finder: a program to analyze DNA sequences.   总被引:66,自引:3,他引:63       下载免费PDF全文
A tandem repeat in DNA is two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats have been shown to cause human disease, may play a variety of regulatory and evolutionary roles and are important laboratory and analytic tools. Extensive knowledge about pattern size, copy number, mutational history, etc. for tandem repeats has been limited by the inability to easily detect them in genomic sequence data. In this paper, we present a new algorithm for finding tandem repeats which works without the need to specify either the pattern or pattern size. We model tandem repeats by percent identity and frequency of indels between adjacent pattern copies and use statistically based recognition criteria. We demonstrate the algorithm's speed and its ability to detect tandem repeats that have undergone extensive mutational change by analyzing four sequences: the human frataxin gene, the human beta T cellreceptor locus sequence and two yeast chromosomes. These sequences range in size from 3 kb up to 700 kb. A World Wide Web server interface atc3.biomath.mssm.edu/trf.html has been established for automated use of the program.  相似文献   

5.

Background

Centromeres are essential for chromosome segregation, yet their DNA sequences evolve rapidly. In most animals and plants that have been studied, centromeres contain megabase-scale arrays of tandem repeats. Despite their importance, very little is known about the degree to which centromere tandem repeats share common properties between different species across different phyla. We used bioinformatic methods to identify high-copy tandem repeats from 282 species using publicly available genomic sequence and our own data.

Results

Our methods are compatible with all current sequencing technologies. Long Pacific Biosciences sequence reads allowed us to find tandem repeat monomers up to 1,419 bp. We assumed that the most abundant tandem repeat is the centromere DNA, which was true for most species whose centromeres have been previously characterized, suggesting this is a general property of genomes. High-copy centromere tandem repeats were found in almost all animal and plant genomes, but repeat monomers were highly variable in sequence composition and length. Furthermore, phylogenetic analysis of sequence homology showed little evidence of sequence conservation beyond approximately 50 million years of divergence. We find that despite an overall lack of sequence conservation, centromere tandem repeats from diverse species showed similar modes of evolution.

Conclusions

While centromere position in most eukaryotes is epigenetically determined, our results indicate that tandem repeats are highly prevalent at centromeres of both animal and plant genomes. This suggests a functional role for such repeats, perhaps in promoting concerted evolution of centromere DNA across chromosomes.  相似文献   

6.
MOTIVATION: Microsatellites, also known as simple sequence repeats, are the tandem repeats of nucleotide motifs of the size 1-6 bp found in every genome known so far. Their importance in genomes is well known. Microsatellites are associated with various disease genes, have been used as molecular markers in linkage analysis and DNA fingerprinting studies, and also seem to play an important role in the genome evolution. Therefore, it is of importance to study distribution, enrichment and polymorphism of microsatellites in the genomes of interest. For this, the prerequisite is the availability of a computational tool for extraction of microsatellites (perfect as well as imperfect) and their related information from whole genome sequences. Examination of available tools revealed certain lacunae in them and prompted us to develop a new tool. RESULTS: In order to efficiently screen genome sequences for microsatellites (perfect as well as imperfect), we developed a new tool called IMEx (Imperfect Microsatellite Extractor). IMEx uses simple string-matching algorithm with sliding window approach to screen DNA sequences for microsatellites and reports the motif, copy number, genomic location, nearby genes, mutational events and many other features useful for in-depth studies. IMEx is more sensitive, efficient and useful than the available widely used tools. IMEx is available in the form of a stand-alone program as well as in the form of a web-server. AVAILABILITY: A World Wide Web server and the stand-alone program are available for free access at http://203.197.254.154/IMEX/ or http://www.cdfd.org.in/imex.  相似文献   

7.
Uniformly repeated DNA sequences in genomes known as tandem repeats are one of the most interesting features of many organisms analyzed so far. Among the tandem repeats, microsatellites have attracted many researchers since their associations in several human diseases. The discovery of tandem repeats in the expressed sequence tags (ESTs) or in the cDNA libraries contributed to new ideas and tools for evolutionary studies. With the advent of new biotechnological tools the number of ESTs deposited in databases is rapidly increasing. Therefore, new informative bioinformatics tools are needed to assist the analysis and interpretation of these tandem repeats in ESTs and in other type of DNAs. In the present study we report two new utility tools; Organism Miner and Keyword Finder. Organism Miner utility collects, sorts, splice and provides statistical overview on DNA data files. Keyword Finder analyses all the sequences in the input folder and extracts and collects keywords for each specific organism or the all the organisms, which have the DNA sequence and generates statistical overview. We are currently generating cotton and pepper cDNA libraries and often using the GenBank DNA sequences. Therefore, in this study we used cDNAs and ESTs of cotton and pepper for the demonstrating the use of these two tools. With help of these two utilities we observed that most of ESTs are useful for downstream applications such as mining microsatellites specific to an organ, tissue or development stage. The analyses of ESTs indicated that not only tandem repeats existed in ESTs but also tandem repeats differentially presented in different organ or tissue specific ESTs within and between the species. Utilities and the sample data sets are self-extracting files and freely available from or can be obtained upon request from the corresponding author.  相似文献   

8.
Sequence alignment by cross-correlation.   总被引:1,自引:0,他引:1  
Many recent advances in biology and medicine have resulted from DNA sequence alignment algorithms and technology. Traditional approaches for the matching of DNA sequences are based either on global alignment schemes or heuristic schemes that seek to approximate global alignment algorithms while providing higher computational efficiency. This report describes an approach using the mathematical operation of cross-correlation to compare sequences. It can be implemented using the fast fourier transform for computational efficiency. The algorithm is summarized and sample applications are given. These include gene sequence alignment in long stretches of genomic DNA, finding sequence similarity in distantly related organisms, demonstrating sequence similarity in the presence of massive (approximately 90%) random point mutations, comparing sequences related by internal rearrangements (tandem repeats) within a gene, and investigating fusion proteins. Application to RNA and protein sequence alignment is also discussed. The method is efficient, sensitive, and robust, being able to find sequence similarities where other alignment algorithms may perform poorly.  相似文献   

9.
10.
The main feature of global repeat map (GRM) algorithm (www.hazu.hr/grm/software/win/grm2012.exe) is its ability to identify a broad variety of repeats of unbounded length that can be arbitrarily distant in sequences as large as human chromosomes. The efficacy is due to the use of complete set of a K-string ensemble which enables a new method of direct mapping of symbolic DNA sequence into frequency domain, with straightforward identification of repeats as peaks in GRM diagram. In this way, we obtain very fast, efficient and highly automatized repeat finding tool. The method is robust to substitutions and insertions/deletions, as well as to various complexities of the sequence pattern. We present several case studies of GRM use, in order to illustrate its capabilities: identification of α-satellite tandem repeats and higher order repeats (HORs), identification of Alu dispersed repeats and of Alu tandems, identification of Period 3 pattern in exons, implementation of ‘magnifying glass’ effect, identification of complex HOR pattern, identification of inter-tandem transitional dispersed repeat sequences and identification of long segmental duplications. GRM algorithm is convenient for use, in particular, in cases of large repeat units, of highly mutated and/or complex repeats, and of global repeat maps for large genomic sequences (chromosomes and genomes).  相似文献   

11.
An efficient algorithm is described for finding matches, repeats and other word relations, allowing for errors, in large data sets of long molecular sequences. The algorithm entails hashing on fixed-size words in conjunction with the use of a linked list connecting all occurrences of the same word. The average memory and run time requirement both increase almost linearly with the total sequence length. Some results of the program's performance on a database of Escherichia coli DNA sequences are presented.  相似文献   

12.
Conserved features of imprinted differentially methylated domains   总被引:1,自引:0,他引:1  
Genomic imprinting is a conserved epigenetic phenomenon in eutherian mammals, with regards both to the genes that are imprinted and the mechanism underlying the expression of just one of the parental alleles. Epigenetic modifications of alleles of imprinted genes are established during oogenesis and spermatogenesis, and these modifications are then inherited. Differentially methylated domains (DMDs) of imprinted genes are the genomic sites of these inherited epigenetic imprints. We previously showed that CpG-rich imperfect tandem direct repeats within three different mouse DMDs (Snurf/Snrpn, Kcnq1 and Igf2r), each with a unique sequence, play a central role in maintaining the differential methylation. This finding implicates repeat-related DNA structure, not sequence, in the imprinting mechanism. To better define the important features of this signal, we compared sequences of these three DMD tandem repeats among mammalian species. All DMD repeats contain short indirect repeats, many of which are organized into larger unit repeats. Even though the larger repeat units undergo deletion and addition during evolution (most likely through unequal crossovers during meiosis), the size of DMD tandem repeated regions has remained remarkably stable during mammalian evolution. Moreover, all three DMD tandem repeats have a high-CpG content, an ordered arrangement of CpG dinucleotides, and similar predicted secondary structures. These observations suggest that a structural feature or features of these DMD tandem repeats is the conserved DMD imprinting signal.  相似文献   

13.
Patterns of sequence variation in the mitochondrial D-loop region of shrews   总被引:8,自引:2,他引:6  
Direct sequencing of the mitochondrial displacement loop (D-loop) of shrews (genus Sorex) for the region between the tRNA(Pro) and the conserved sequence block-F revealed variable numbers of 79-bp tandem repeats. These repeats were found in all 19 individuals sequenced, representing three subspecies and one closely related species of the masked shrew group (Sorex cinereus cinereus, S. c. miscix, S. c. acadicus, and S. haydeni) and an outgroup, the pygmy shrew (S. hoyi). Each specimen also possessed an adjacent 76-bp imperfect copy of the tandem repeats. One individual was heteroplasmic for length variants consisting of five and seven copies of the 79-bp tandem repeat. The sequence of the repeats is conducive to the formation of secondary structure. A termination-associated sequence is present in each of the repeats and in a unique sequence region 5' to the tandem array as well. Mean genetic distance between the masked shrew taxa and the pygmy shrew was calculated separately for the unique sequence region, one of the tandem repeats, the imperfect repeat, and these three regions combined. The unique sequence region evolved more rapidly than the tandem repeats or the imperfect repeat. The small genetic distance between pairs of tandem repeats within an individual is consistent with a model of concerted evolution. Repeats are apparently duplicated and lost at a high rate, which tends to homogenize the tandem array. The rate of D- loop sequence divergence between the masked and pygmy shrews is estimated to be 15%-20%/Myr, the highest rate observed in D-loops of mammals. Rapid sequence evolution in shrews may be due either to their high metabolic rate and short generation time or to the presence of variable numbers of tandem repeats.   相似文献   

14.
An efficient algorithm for detecting approximate tandem repeats in genomic sequences is presented. The algorithm is based on innovative statistical criteria to detect candidate regions which may include tandem repeats; these regions are subsequently verified by alignments based on dynamic programming. No prior information about the period size or pattern is needed. Also, the algorithm is virtually capable of detecting repeats with any period. An implementation of the algorithm is compared with the two state-of-the-art tandem repeats detection tools to demonstrate its effectiveness both on natural and synthetic data. The algorithm is available at www.cs.brown.edu/people/domanic/tandem/.  相似文献   

15.
16.
Polymers of random 14 mer oligonucleotides are shown to detect discrete loci in the human genome. Eighteen different synthetic tandem repeats of random 14 base-pair units (STRs) have been generated and all of them turn out to detect polymorphic loci on southern blots of human DNA samples, presumably corresponding to a variable number of tandem repeats (VNTR). This finding suggests that minisatellites are a major component of the human genome and are strongly associated with the generation of genetic variability. In addition, it should open new strategies to make new polymorphic probes available.  相似文献   

17.
Computer-based sequence analysis, notation, and manipulation are a necessity for all molecular biologists working with any but the most simple DNA sequences. As sequence data become increasingly available, tools that can be used to manipulate and annotate individual sequences and sequence elements will become an even more vital implement in the molecular biologist's arsenal. The Omiga DNA and Protein Sequence Analysis Software tool, version 2.0 provides an effective and comprehensive tool for the analysis of both nucleic acid and protein sequences that runs on a standard PC available in every molecular biology laboratory. Omiga allows the import of sequences in several common formats. Upon importing sequences and assigning them to various projects, Omiga allows the user to produce, analyze, and edit sequence alignments. Sequences may also be queried for the presence of restriction sites, sequence motifs, and other sequence features, all of which can be added into the notations accompanying each sequence. This newest version of Omiga also allows for sequencing and polymerase chain reaction (PCR) primer prediction, a functionality missing in earlier versions. Finally, Omiga allows rapid searches for putative coding regions, and Basic Local Alignment Search Tool (BLAST) queries against public databases at the National Center for Biotechnology Information (NCBI).  相似文献   

18.
BC Faircloth  TC Glenn 《PloS one》2012,7(8):e42543
Ligating adapters with unique synthetic oligonucleotide sequences (sequence tags) onto individual DNA samples before massively parallel sequencing is a popular and efficient way to obtain sequence data from many individual samples. Tag sequences should be numerous and sufficiently different to ensure sequencing, replication, and oligonucleotide synthesis errors do not cause tags to be unrecoverable or confused. However, many design approaches only protect against substitution errors during sequencing and extant tag sets contain too few tag sequences. We developed an open-source software package to validate sequence tags for conformance to two distance metrics and design sequence tags robust to indel and substitution errors. We use this software package to evaluate several commercial and non-commercial sequence tag sets, design several large sets (maxcount = 7,198) of edit metric sequence tags having different lengths and degrees of error correction, and integrate a subset of these edit metric tags to polymerase chain reaction (PCR) primers and sequencing adapters. We validate a subset of these edit metric tagged PCR primers and sequencing adapters by sequencing on several platforms and subsequent comparison to commercially available alternatives. We find that several commonly used sets of sequence tags or design methodologies used to produce sequence tags do not meet the minimum expectations of their underlying distance metric, and we find that PCR primers and sequencing adapters incorporating edit metric sequence tags designed by our software package perform as well as their commercial counterparts. We suggest that researchers evaluate sequence tags prior to use or evaluate tags that they have been using. The sequence tag sets we design improve on extant sets because they are large, valid across the set, and robust to the suite of substitution, insertion, and deletion errors affecting massively parallel sequencing workflows on all currently used platforms.  相似文献   

19.
基于后缀列的基因序列最大串联重复查找技术   总被引:1,自引:0,他引:1  
重复序列分析在全基因组研究中起着重要作用,其首要任务就是在DNA序列中识别并定位所有的重复结构。本文提出了一种新的算法,此算法基于一种简单的数据结构——后缀数,用于查找给定的DNA序列中所有的最大串联重复。并且在该算法的基础上编写了一个有效实用的软件——RepLocate,同时给出了它应用到已知的DNA序列的实例。  相似文献   

20.
The wcd system is an open source tool for clustering expressed sequence tags (EST) and other DNA and RNA sequences. wcd allows efficient all-versus-all comparison of ESTs using either the d(2) distance function or edit distance, improving existing implementations of d(2). It supports merging, refinement and reclustering of clusters. It is 'drop in' compatible with the StackPack clustering package. wcd supports parallelization under both shared memory and cluster architectures. It is distributed with an EMBOSS wrapper allowing wcd to be installed as part of an EMBOSS installation (and so provided by a web server). AVAILABILITY: wcd is distributed under a GPL licence and is available from http://code.google.com/p/wcdest. SUPPLEMENTARY INFORMATION: Additional experimental results. The wcd manual, a companion paper describing underlying algorithms, and all datasets used for experimentation can also be found at www.bioinf.wits.ac.za/~scott/wcdsupp.html.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号