首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 561 毫秒
1.
We study the problem of approximate non-tandem repeat extraction. Given a long subject string S of length N over a finite alphabet Sigma and a threshold D, we would like to find all short substrings of S of length P that repeat with at most D differences, i.e., insertions, deletions, and mismatches. We give a careful theoretical characterization of the set of seeds (i.e., some maximal exact repeats) required by the algorithm, and prove a sublinear bound on their expected numbers. Using this result, we present a sub-quadratic algorithm for finding all short (i.e., of length O(log N)) approximate repeats. The running time of our algorithm is O(DN(3pow(epsilon)-1)log N), where epsilon = D/P and pow(epsilon) is an increasing, concave function that is 0 when epsilon = 0 and about 0.9 for DNA and protein sequences.  相似文献   

2.
An approximate nested tandem repeat (NTR) in a string T is a complex repetitive structure consisting of many approximate copies of two substrings x and X ("motifs") interspersed with one another. NTRs fall into a class of repetitive structures broadly known as subrepeats. NTRs have been found in real DNA sequences and are expected to be important in evolutionary biology, both in understanding evolution of the ribosomal DNA (where NTRs can occur), and as a potential marker in population genetic and phylogenetic studies. This article describes an alignment algorithm for the verification phase of the software tool NTRFinder developed for database searches for NTRs. When the search algorithm has located a subsequence containing a possible NTR, with motifs X and x, a verification step aligns this subsequence against an exact NTR built from the templates X and x, to determine whether the subsequence contains an approximate NTR and its extent. This article describes an algorithm to solve this alignment problem in O(|T|(|X| + |x|)) space and time. The algorithm is based on Fischetti et al.'s wrap-around dynamic programming.  相似文献   

3.
MOTIVATION: A tandem repeat in DNA is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats occur in the genomes of both eukaryotic and prokaryotic organisms. They are important in numerous fields including disease diagnosis, mapping studies, human identity testing (DNA fingerprinting), sequence homology and population studies. Although tandem repeats have been used by biologists for many years, there are few tools available for performing an exhaustive search for all tandem repeats in a given sequence. RESULTS: In this paper we describe an efficient algorithm for finding all tandem repeats within a sequence, under the edit distance measure. The contributions of this paper are two-fold: theoretical and practical. We present a precise definition for tandem repeats over the edit distance and an efficient, deterministic algorithm for finding these repeats. AVAILABILITY: The algorithm has been implemented in C++, and the software is available upon request and can be used at http://www.sci.brooklyn.cuny.edu/~sokol/trepeats. The use of this tool will assist biologists in discovering new ways that tandem repeats affect both the structure and function of DNA and protein molecules.  相似文献   

4.
Patterns of sequence variation in the mitochondrial D-loop region of shrews   总被引:8,自引:2,他引:6  
Direct sequencing of the mitochondrial displacement loop (D-loop) of shrews (genus Sorex) for the region between the tRNA(Pro) and the conserved sequence block-F revealed variable numbers of 79-bp tandem repeats. These repeats were found in all 19 individuals sequenced, representing three subspecies and one closely related species of the masked shrew group (Sorex cinereus cinereus, S. c. miscix, S. c. acadicus, and S. haydeni) and an outgroup, the pygmy shrew (S. hoyi). Each specimen also possessed an adjacent 76-bp imperfect copy of the tandem repeats. One individual was heteroplasmic for length variants consisting of five and seven copies of the 79-bp tandem repeat. The sequence of the repeats is conducive to the formation of secondary structure. A termination-associated sequence is present in each of the repeats and in a unique sequence region 5' to the tandem array as well. Mean genetic distance between the masked shrew taxa and the pygmy shrew was calculated separately for the unique sequence region, one of the tandem repeats, the imperfect repeat, and these three regions combined. The unique sequence region evolved more rapidly than the tandem repeats or the imperfect repeat. The small genetic distance between pairs of tandem repeats within an individual is consistent with a model of concerted evolution. Repeats are apparently duplicated and lost at a high rate, which tends to homogenize the tandem array. The rate of D- loop sequence divergence between the masked and pygmy shrews is estimated to be 15%-20%/Myr, the highest rate observed in D-loops of mammals. Rapid sequence evolution in shrews may be due either to their high metabolic rate and short generation time or to the presence of variable numbers of tandem repeats.   相似文献   

5.
Given a text of length n, a pattern of length m and an integer k, we present an algorithm for finding all occurrences of the pattern in the text, each with at most k substitutions. The algorithm runs in O(k(m log m + n)) time, and requires O(nk) space. This algorithm has direct implications for nucleotide and amino acid sequence comparisons.  相似文献   

6.
Exact Tandem Repeats Analyzer 1.0 (E-TRA) combines sequence motif searches with keywords such as ‘organs’, ‘tissues’, ‘cell lines’ and ‘development stages’ for finding simple exact tandem repeats as well as non-simple repeats. E-TRA has several advanced repeat search parameters/options compared to other repeat finder programs as it not only accepts GenBank, FASTA and expressed sequence tags (EST) sequence files, but also does analysis of multiple files with multiple sequences. The minimum and maximum tandem repeat motif lengths that E-TRA finds vary from one to one thousand. Advanced user defined parameters/options let the researchers use different minimum motif repeats search criteria for varying motif lengths simultaneously. One of the most interesting features of genomes is the presence of relatively short tandem repeats (TRs). These repeated DNA sequences are found in both prokaryotes and eukaryotes, distributed almost at random throughout the genome. Some of the tandem repeats play important roles in the regulation of gene expression whereas others do not have any known biological function as yet. Nevertheless, they have proven to be very beneficial in DNA profiling and genetic linkage analysis studies. To demonstrate the use of E-TRA, we used 5,465,605 human EST sequences derived from 18,814,550 GenBank EST sequences. Our results indicated that 12.44% (679,800) of the human EST sequences contained simple and non-simple repeat string patterns varying from one to 126 nucleotides in length. The results also revealed that human organs, tissues, cell lines and different developmental stages differed in number of repeats as well as repeat composition, indicating that the distribution of expressed tandem repeats among tissues or organs are not random, thus differing from the un-transcribed repeats found in genomes.  相似文献   

7.
The S antigens from different isolates of Plasmodium falciparum exhibit extensive size, charge, and serological diversity. We show here that the S-antigen genes behave as multiple alleles of a single locus. The size heterogeneity results from different numbers, lengths, and/or sequences of tandem repeat units encoded within the S-antigen genes. Two genes studied here encode antigenically different S antigens but nevertheless have closely related tandem repeat sequences. We show that antigenic differences can arise because repeats are translated in different reading frames.  相似文献   

8.

Background

Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2 -L distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations.

Results

The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm.

Conclusions

The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.  相似文献   

9.
A model of perfect tandem repeat with random pattern has been considered. It expands a notion of approximate tandem repeat and describes new kind of latent periodicity in biological sequences, which has been named profile periodicity. Based on this model, an original spectral-statistical approach has been proposed for the estimation of size of the periodicity pattern in sequences of approximate tandem repeats. In contrast to existing approaches, this is applicable in practical conditions of unrepresentative samples (for standard criteria). This approach is suitable and effective for preliminary automated revelation of latent periodicity. The advantages of the spectral-statistical approach to estimation of the periodicity pattern for approximate tandem repeats have been demonstrated in comparison with the other methods.  相似文献   

10.
Finding approximate tandem repeats in genomic sequences.   总被引:1,自引:0,他引:1  
An efficient algorithm is presented for detecting approximate tandem repeats in genomic sequences. The algorithm is based on a flexible statistical model which allows a wide range of definitions of approximate tandem repeats. The ideas and methods underlying the algorithm are described and its effectiveness on genomic data is demonstrated.  相似文献   

11.

Background

Tandem repeat variation in protein-coding regions will alter protein length and may introduce frameshifts. Tandem repeat variants are associated with variation in pathogenicity in bacteria and with human disease. We characterized tandem repeat polymorphism in human proteins, using the UniGene database, and tested whether these were associated with host defense roles.

Results

Protein-coding tandem repeat copy-number polymorphisms were detected in 249 tandem repeats found in 218 UniGene clusters; observed length differences ranged from 2 to 144 nucleotides, with unit copy lengths ranging from 2 to 57. This corresponded to 1.59% (218/13,749) of proteins investigated carrying detectable polymorphisms in the copy-number of protein-coding tandem repeats. We found no evidence that tandem repeat copy-number polymorphism was significantly elevated in defense-response proteins (p = 0.882). An association with the Gene Ontology term 'protein-binding' remained significant after covariate adjustment and correction for multiple testing. Combining this analysis with previous experimental evaluations of tandem repeat polymorphism, we estimate the approximate mean frequency of tandem repeat polymorphisms in human proteins to be 6%. Because 13.9% of the polymorphisms were not a multiple of three nucleotides, up to 1% of proteins may contain frameshifting tandem repeat polymorphisms.

Conclusion

Around 1 in 20 human proteins are likely to contain tandem repeat copy-number polymorphisms within coding regions. Such polymorphisms are not more frequent among defense-response proteins; their prevalence among protein-binding proteins may reflect lower selective constraints on their structural modification. The impact of frameshifting and longer copy-number variants on protein function and disease merits further investigation.  相似文献   

12.
Key string algorithm (KSA) could be viewed as robust computational generalization of restriction enzyme method. KSA enables robust and effective identification and structural analyzes of any given genomic sequences, like in the case of NCBI assembly for human genome. We have developed a method, using total frequency distribution of all r-bp key strings in dependence on the fragment length l, to determine the exact size of all repeats within the given genomic sequence, both of monomeric and HOR type. Subsequently, for particular fragment lengths equal to each of these repeat sizes we compute the partial frequency distribution of r-bp key strings; the key string with highest frequency is a dominant key string, optimal for segmentation of a given genomic sequence into repeat units. We illustrate how a wide class of 3-bp key strings leads to a key-string-dependent periodic cell which enables a simple identification and consensus length determinations of HORs, or any other highly convergent repeat of monomeric or HOR type, both tandem or dispersed. We illustrated KSA application for HORs in human genome and determined consensus HORs in the Build 35.1 assembly. In the next step we compute suprachromosomal family classification and CENP-B box / pJalpha distributions for HORs. In the case of less convergent repeats, like for example monomeric alpha satellite (20-40% divergence), we searched for optimal compact key string using frequency method and developed a concept of composite key string (GAAAC--CTTTG) or flexible relaxation (28 bp key string) which provides both monomeric alpha satellites as well as alpha monomer segmentation of internal HOR structure. This method is convenient also for study of R-strand (direct) / S-strand (reverse complement) alpha monomer alternations. Using KSA we identified 16 alternating regions of R-strand and S-strand monomers in one contig in choromosome 7. Use of CENP-B box and/or pJalpha motif as key string is suitable both for identification of HORs and monomeric pattern as well as for studies of CENP-B box / pJalpha distribution. As an example of application of KSA to sequences outside of HOR regions we present our finding of a tandem with highly convergent 3434-bp Long monomer in chromosome 5 (divergence less then 0.3%).  相似文献   

13.
Tandem repeats finder: a program to analyze DNA sequences.   总被引:66,自引:3,他引:63       下载免费PDF全文
A tandem repeat in DNA is two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats have been shown to cause human disease, may play a variety of regulatory and evolutionary roles and are important laboratory and analytic tools. Extensive knowledge about pattern size, copy number, mutational history, etc. for tandem repeats has been limited by the inability to easily detect them in genomic sequence data. In this paper, we present a new algorithm for finding tandem repeats which works without the need to specify either the pattern or pattern size. We model tandem repeats by percent identity and frequency of indels between adjacent pattern copies and use statistically based recognition criteria. We demonstrate the algorithm's speed and its ability to detect tandem repeats that have undergone extensive mutational change by analyzing four sequences: the human frataxin gene, the human beta T cellreceptor locus sequence and two yeast chromosomes. These sequences range in size from 3 kb up to 700 kb. A World Wide Web server interface atc3.biomath.mssm.edu/trf.html has been established for automated use of the program.  相似文献   

14.
The individual haplotyping problem is a computing problem of reconstructing two haplotypes for an individual based on several optimal criteria from one's fragments sequencing data. This paper is based on the fact that the length of a fragment and the number of the fragments covering a SNP (single nucleotide polymorphism) site are both very small compared with the length of a sequenced region and the total number of the fragments and introduces the parameterized haplotyping problems. With m fragments whose maximum length is k(1), n SNP sites and the number of the fragments covering a SNP site no more than k(2), our algorithms can solve the gapless MSR (Minimum SNP Removal) and MFR (Minimum Fragment Removal) problems in the time complexity O(nk(1)k(2) + m log m + nk(2) + mk(1)) and O(mk(2)(2) + mk(1) k(2) + m log m + nk(2) + mk(1))respectively. Since, the value of k(1) and k(2) are both small (about 10) in practice, our algorithms are more efficient and applicable compared with the algorithms of V. Bafna et al. of time complexity O(mn(2)) and O(m(2)n + m(3)), respectively.  相似文献   

15.
An efficient algorithm for detecting approximate tandem repeats in genomic sequences is presented. The algorithm is based on innovative statistical criteria to detect candidate regions which may include tandem repeats; these regions are subsequently verified by alignments based on dynamic programming. No prior information about the period size or pattern is needed. Also, the algorithm is virtually capable of detecting repeats with any period. An implementation of the algorithm is compared with the two state-of-the-art tandem repeats detection tools to demonstrate its effectiveness both on natural and synthetic data. The algorithm is available at www.cs.brown.edu/people/domanic/tandem/.  相似文献   

16.
In higher eukaryotes, the 5S ribosomal DNA (5S rDNA) is organized in tandem arrays with repeat units composed of a coding region and a non-transcribed spacer sequence (NTS). These tandem arrays can be found on either one or more chromosome pairs. 5S rDNA copies from the tilapia fish, Oreochromis niloticus, were cloned and the nucleotide sequences of the coding region and of the non-transcribed spacer were determined. Moreover, the genomic organization of the 5S rDNA tandem repeats was investigated by fluorescence IN SITU hybridization (FISH) and Southern blot hybridization. Two 5S rDNA classes, one consisting of 1.4-kb repeats and another one with 0.5-kb repeats were identified and designated 5S rDNA type I and type II, respectively. An inverted 5S rRNA gene and a 5S rRNA putative pseudogene were also identified inside the tandem repeats of 5S rDNA type I. FISH permitted the visualization of the 5S rRNA genes at three chromosome loci, one of them consisting of arrays of the 5S rDNA type I, and the two others corresponding to arrays of the 5S rDNA type II. The two classes of the 5S rDNA, the presence of pseudogenes, and the inverted genes observed in the O. niloticus genome might be a consequence of the intense dynamics of the evolution of these tandem repeat elements.  相似文献   

17.
The ends of eukaryotic chromosomes have special properties and roles in chromosome behavior. Selection for telomere function in yeast, using a Chinese hamster hybrid cell line as the source DNA, generated a stable yeast artificial chromosome clone containing 23 kb of DNA adjacent to (TTAGGG)n, the vertebrate telomeric repeat. The common repetitive element d(GT)n appeared to be responsible for most of the other stable clones. Circular derivatives of the TTAGGG-positive clone that could be propagated in E. coli were constructed. These derivatives identify a single pair of hamster telomeres by fluorescence in situ hybridization. The telomeric repeat tract consists of (TTAGGG)n repeats with minor variations, some of which can be cleaved with the restriction enzyme MnlI. Blot hybridization with genomic hamster DNA under stringent conditions confirms that the TTAGGG tracts are cleaved into small fragments due to the presence of this restriction enzyme site, in contrast to mouse telomeres. Additional blocks of (TTAGGG)n repeats are found 4–5 kb internally on the clone. The terminal region of the clone is dominated by a novel A-T rich 78 bp tandemly repeating sequence; the repeat monomer can be subdivided into halves distinguished by more or less adherence to the consensus sequence. The sequence in genomic DNA has the same tandem organization in probably a single primary locus of >20–30 kb and is thus termed a minisatellite.  相似文献   

18.
The current pace of the generation of sequence data requires the development of software tools that can rapidly provide full annotation of the data. We have developed a new method for rapid sequence comparison using the exact match algorithm without repeat masking. As a demonstration, we have identified all perfect simple tandem repeats (STR) within the draft sequence of the human genome. The STR elements (chromosome, position, length and repeat subunit) have been placed into a relational database. Repeat flanking sequence is also publicly accessible at http://grid.abcc.ncifcrf.gov. To illustrate the utility of this complete set of STR elements, we documented the increased density of potentially polymorphic markers throughout the genome. The new STR markers may be useful in disease association studies because so many STR elements manifest multiallelic polymorphism. Also, because triplet repeat expansions are important for human disease etiology, we identified trinucleotide repeats that exist within exons of known genes. This resulted in a list that includes all 14 genes known to undergo polynucleotide expansion, and 48 additional candidates. Several of these are non-polyglutamine triplet repeats. Other examinations of the STR database demonstrated repeats spanning splice junctions and identified SNPs within repeat elements.  相似文献   

19.
DNA of the oncogenic strain BC-1 of Marek's disease virus contains three units of tandem direct repeats with 132 base pairs in the terminal repeat and internal repeat, respectively, of the long region of the Marek's disease virus genome, whereas the attenuated, nononcogenic viral DNA contains multiple units of the tandem direct repeats.  相似文献   

20.
M Simon  M Phillips  H Green 《Genomics》1991,9(4):576-580
The coding region of the involucrin gene in higher primates contains a segment consisting of numerous tandem repeats of a 10-codon sequence. The process of repeat addition began in a common ancestor of all higher primates and subsequent repeats were added vectorially. As a result, the principal site of repeat addition has moved in the 3' to 5' direction and the most recently generated repeats (the late region) are close to the 5' end of the segment of repeats. In the human, most of the late region is made up of two different blocks, each consisting of nearly identical repeats. We describe here five polymorphic forms resulting from the addition of differing numbers of repeats to each block. As the variety and nature of the polymorphic alleles are different in different human populations, we postulate that the process of repeat addition is genetically determined.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号