首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The paper investigates the computational problem of predicting RNA secondary structures. The general belief is that allowing pseudoknots makes the problem hard. Existing polynomial-time algorithms are heuristic algorithms with no performance guarantee and can handle only limited types of pseudoknots. In this paper, we initiate the study of predicting RNA secondary structures with a maximum number of stacking pairs while allowing arbitrary pseudoknots. We obtain two approximation algorithms with worst-case approximation ratios of 1/2 and 1/3 for planar and general secondary structures, respectively. For an RNA sequence of n bases, the approximation algorithm for planar secondary structures runs in O(n(3)) time while that for the general case runs in linear time. Furthermore, we prove that allowing pseudoknots makes it NP-hard to maximize the number of stacking pairs in a planar secondary structure. This result is in contrast with the recent NP-hard results on psuedoknots which are based on optimizing some general and complicated energy functions.  相似文献   

2.
In this paper, we study the problem of computing the similarity of two protein structures by measuring their contact-map overlap. Contact-map overlap abstracts the problem of computing the similarity of two polygonal chains as a graph-theoretic problem. In R3, we present the first polynomial time algorithm with any guarantee on the approximation ratio for the 3-dimensional problem. More precisely, we give an algorithm for the contact-map overlap problem with an approximation ratio of sigma where sigma = min{sigma(P1), sigma(P2)} 0, is hard.  相似文献   

3.
An algorithm for approximate tandem repeats.   总被引:4,自引:0,他引:4  
A perfect single tandem repeat is defined as a nonempty string that can be divided into two identical substrings, e.g., abcabc. An approximate single tandem repeat is one in which the substrings are similar, but not identical, e.g., abcdaacd. In this paper we consider two criterions of similarity: the Hamming distance (k mismatches) and the edit distance (k differences). For a string S of length n and an integer k our algorithm reports all locally optimal approximate repeats, r = umacro ?, for which the Hamming distance of umacro and ? is at most k, in O(nk log (n/k)) time, or all those for which the edit distance of umacro and ? is at most k, in O(nk log k log (n/k)) time. This paper concentrates on a more general type of repeat called multiple tandem repeats. A multiple tandem repeat in a sequence S is a (periodic) substring r of S of the form r = u(a)u', where u is a prefix of r and u' is a prefix of u. An approximate multiple tandem repeat is a multiple repeat with errors; the repeated subsequences are similar but not identical. We precisely define approximate multiple repeats, and present an algorithm that finds all repeats that concur with our definition. The time complexity of the algorithm, when searching for repeats with up to k errors in a string S of length n, is O(nka log (n/k)) where a is the maximum number of periods in any reported repeat. We present some experimental results concerning the performance and sensitivity of our algorithm. The problem of finding repeats within a string is a computational problem with important applications in the field of molecular biology. Both exact and inexact repeats occur frequently in the genome, and certain repeats occurring in the genome are known to be related to diseases in the human.  相似文献   

4.
We explored the possibilities of whole-genome duplication (WGD) in prokaryotic species,where we performed statistical analyses of the configurations of the central angles between homologous tandem repeats (TRs) on the circular chromosomes.At first,we detected TRs on their chromosomes and identified equivalent tandem repeat pairs (ETRPs); here,an ETRP is defined as a pair of tandem repeats sequentially similar to each other.Then we carried out statistical analyses of the central angle distributions of the de...  相似文献   

5.
We study the problem of approximate non-tandem repeat extraction. Given a long subject string S of length N over a finite alphabet Sigma and a threshold D, we would like to find all short substrings of S of length P that repeat with at most D differences, i.e., insertions, deletions, and mismatches. We give a careful theoretical characterization of the set of seeds (i.e., some maximal exact repeats) required by the algorithm, and prove a sublinear bound on their expected numbers. Using this result, we present a sub-quadratic algorithm for finding all short (i.e., of length O(log N)) approximate repeats. The running time of our algorithm is O(DN(3pow(epsilon)-1)log N), where epsilon = D/P and pow(epsilon) is an increasing, concave function that is 0 when epsilon = 0 and about 0.9 for DNA and protein sequences.  相似文献   

6.
Many proteins, especially in eukaryotes, contain tandem repeats of several domains from the same family. These repeats have a variety of binding properties and are involved in protein–protein interactions as well as binding to other ligands such as DNA and RNA. The rapid expansion of protein domain repeats is assumed to have evolved through internal tandem duplications. However, the exact mechanisms behind these tandem duplications are not well-understood. Here, we have studied the evolution, function, protein structure, gene structure, and phylogenetic distribution of domain repeats. For this purpose we have assigned Pfam-A domain families to 24 proteomes with more sensitive domain assignments in the repeat regions. These assignments confirmed previous findings that eukaryotes, and in particular vertebrates, contain a much higher fraction of proteins with repeats compared with prokaryotes. The internal sequence similarity in each protein revealed that the domain repeats are often expanded through duplications of several domains at a time, while the duplication of one domain is less common. Many of the repeats appear to have been duplicated in the middle of the repeat region. This is in strong contrast to the evolution of other proteins that mainly works through additions of single domains at either terminus. Further, we found that some domain families show distinct duplication patterns, e.g., nebulin domains have mainly been expanded with a unit of seven domains at a time, while duplications of other domain families involve varying numbers of domains. Finally, no common mechanism for the expansion of all repeats could be detected. We found that the duplication patterns show no dependence on the size of the domains. Further, repeat expansion in some families can possibly be explained by shuffling of exons. However, exon shuffling could not have created all repeats.  相似文献   

7.
James Bruce Walsh 《Genetics》1987,115(3):553-567
Recombination processes acting on tandem arrays are suggested here to have probable intrinsic biases, producing an expected net decrease in array size following each event, in contrast to previous models which assume no net change in array size. We examine the implications of this by modeling copy number dynamics in a tandem array under the joint interactions of sister-strand unequal crossing over (rate gamma per generation per copy) and intrastrand recombination resulting in deletion (rate epsilon per generation per copy). Assuming no gene amplification or selection, the expected mean persistence time of an array starting with z excess copies (i.e., array size z + 1) is z(1 + gamma/epsilon) recombinational events. Nontrivial equilibrium distributions of array sizes exist when gene amplification or certain forms of selection are considered. We characterize the equilibrium distribution for both a simple model of gene amplification and under the assumption that selection imposes a minimal array size, n. For the latter case, n + 1/alpha is an upper bound for mean array size under fairly general conditions, where alpha(= 2 epsilon/gamma) is the scaled deletion rate. Further, the distribution of excess copies over n is bounded above by a geometric distribution with parameter alpha/(1 + alpha). Tandem arrays are unlikely to be greatly expanded by unequal crossing over unless alpha much less than 1, implying that other mechanisms, such as gene amplification, are likely important in the evolution of large arrays. Thus unequal crossing over, by itself, is likely insufficient to account for satellite DNA.  相似文献   

8.
G. S. Wilkinson  F. Mayer  G. Kerth    B. Petri 《Genetics》1997,146(3):1035-1048
Analysis of mitochondrial DNA control region sequences from 41 species of bats representing 11 families revealed that repeated sequence arrays near the tRNA-Pro gene are present in all vespertilionine bats. Across 18 species tandem repeats varied in size from 78 to 85 bp and contained two to nine repeats. Heteroplasmy ranged from 15% to 63%. Fewer repeats among heteroplasmic than homoplasmic individuals in a species with up to nine repeats indicates selection may act against long arrays. A lower limit of two repeats and more repeats among heteroplasmic than homoplasmic individuals in two species with few repeats suggests length mutations are biased. Significant regressions of heteroplasmy, θ and π, on repeat number further suggest that repeat duplication rate increases with repeat number. Comparison of vespertilionine bat consensus repeats to mammal control region sequences revealed that tandem repeats of similar size, sequence and number also occur in shrews, cats and bighorn sheep. The presence of two conserved protein-binding sequences in all repeat units indicates that convergent evolution has occurred by duplication of functional units. We speculate that D-loop region tandem repeats may provide signal redundancy and a primitive repair mechanism in the event of somatic mutations to these binding sites.  相似文献   

9.
Tandem repeats finder: a program to analyze DNA sequences.   总被引:66,自引:3,他引:63       下载免费PDF全文
A tandem repeat in DNA is two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats have been shown to cause human disease, may play a variety of regulatory and evolutionary roles and are important laboratory and analytic tools. Extensive knowledge about pattern size, copy number, mutational history, etc. for tandem repeats has been limited by the inability to easily detect them in genomic sequence data. In this paper, we present a new algorithm for finding tandem repeats which works without the need to specify either the pattern or pattern size. We model tandem repeats by percent identity and frequency of indels between adjacent pattern copies and use statistically based recognition criteria. We demonstrate the algorithm's speed and its ability to detect tandem repeats that have undergone extensive mutational change by analyzing four sequences: the human frataxin gene, the human beta T cellreceptor locus sequence and two yeast chromosomes. These sequences range in size from 3 kb up to 700 kb. A World Wide Web server interface atc3.biomath.mssm.edu/trf.html has been established for automated use of the program.  相似文献   

10.
Complete eukaryote chromosomes were investigated for intrachromosomal duplications of nucleotide sequences. The analysis was performed by looking for nonexact repeats on two complete genomes, Saccharomyces cerevisiae and Caenorhabditis elegans, and four partial ones, Drosophila melanogaster, Plasmodium falciparum, Arabidopsis thaliana, and Homo sapiens. Through this analysis, we show that all eukaryote chromosomes exhibit similar characteristics for their intrachromosomal repeats, suggesting similar dynamics: many direct repeats have their two copies physically close together, and these close direct repeats are more similar and shorter than the other repeats. On the contrary, there are almost no close inverted repeats. These results support a model for the dynamics of duplication. This model is based on a continuous genesis of tandem repeats and implies that most of the distant and inverted repeats originate from these tandem repeats by further chromosomal rearrangements (insertions, inversions, and deletions). Remnants of these predicted rearrangements have been brought out through fine analysis of the chromosome sequence. Despite these dynamics, shared by all eukaryotes, each genome exhibits its own style of intrachromosomal duplication: the density of repeated elements is similar in all chromosomes issued from the same genome, but is different between species. This density was further related to the relative rates of duplication, deletion, and mutation proper to each species. One should notice that the density of repeats in the X chromosome of C. elegans is much lower than in the autosomes of that organism, suggesting that the exchange between homologous chromosomes is important in the duplication process.  相似文献   

11.
An efficient algorithm for detecting approximate tandem repeats in genomic sequences is presented. The algorithm is based on innovative statistical criteria to detect candidate regions which may include tandem repeats; these regions are subsequently verified by alignments based on dynamic programming. No prior information about the period size or pattern is needed. Also, the algorithm is virtually capable of detecting repeats with any period. An implementation of the algorithm is compared with the two state-of-the-art tandem repeats detection tools to demonstrate its effectiveness both on natural and synthetic data. The algorithm is available at www.cs.brown.edu/people/domanic/tandem/.  相似文献   

12.
A model developed for the evolving size of the repetitive part of the eukaryote genome during speciation was subjected to analytical and computer treatment. The basic assumption of the model was that two classes of repetitive DNA contribute mainly to macroevolutionary changes in genome size: arrays of tandem repeats (ATR) changing through unequal crossover and mobile genetic elements (MGE) changing presumably through an integration mechanism of the Tn- and Is-kind operating in bacteria. Within the framework of this model, the macroevolution of the MGE size is formally equivalent to that of the ATR in the particular case when shifts of chromatids have only one repeat out of register. This allowed us to consider genome size as a large set of various ATRs. The results obtained are as follows. If the duplication and deletion of repeats have unequal fixation probabilities during each speciation act, the predicted species distributions of genome size significantly deviate from the real ones; if they have equal fixation probabilities, there is a conformance between calculated and real distributions. In the latter case, the model reproduces the salient features of real distributions upon acceptance of 1) upper selective boundary nonspecifically limiting increase in genome size within the evolving taxonomic group and 2) non-neutrality of variability in genome size with respect to speciation.  相似文献   

13.
Maltase-glucoamylase and sucrase-isomaltase are two glycosydases responsible for starch digestion in human. Their evolutionary history was studied by comparing the amino acid sequences of these enzymes from several mammals and their orthologs from other chordates. The two glycosydases are paralogs and contain catalytic domains of the GH31 family. A common evolutionary precursor of their genes arose via a tandem duplication. As a consequence, sucrase-isomaltase consists of two homologous parts. The maltase-glucoamylase gene experienced several additional duplications, whose number varies among mammals. Its locus harbors four to seven tandem repeats, each coding for an amino acid sequence similar to the two parts of sucrase-isomaltase.  相似文献   

14.
J. Mathern  S. Hake 《Genetics》1997,147(1):305-314
The knotted1 gene was first defined by dominant mutations that affect leaf morphology. The original allele, Kn1-O, results from a 17-kb tandem duplication. Mutator (Mu) insertions near the junction of the two repeats suppress the leaf phenotype to different degrees depending on the position of the insertion. The Mu insertions also increase the frequency of recombination at Kn1-O to create derivative alleles in which the Mu element and one copy of the repeat are lost. These derivatives are normal in appearance. Here we describe two derivatives that retained the tandem duplication but gained insertions of 1.7 and 3 kb in length in place of the Mu element. In each case, the inserted DNA is a sequence that normally flanks the distal repeat unit. Thus, each derivative consists of a tandem duplication in which the repeat unit has been extended at its distal end by the length of the new insertion. The 1.7-kb insertion dampens the phenotype, as did the original Mu insertion, whereas the 3-kb insertion completely suppresses the knotted phenotype. We propose that gene conversion, stimulated by the double-strand break of the Mu excision, gave rise to these derivatives.  相似文献   

15.
The problem of reconstructing the duplication tree of a set of tandemly repeated sequences which are supposed to have arisen through unequal recombination, was first introduced by Fitch (1977, Genetics, 86, 93-104), and has recently received a lot of attention. In this paper, we describe DTSCORE, a fast distance based algorithm to reconstruct tandem duplication trees, which is statistically consistent. As a cousin of the ADDTREE algorithm (Sattath and Tversky, 1977, Psychometrika, 42, 319-345), the raw DTSCORE has a time complexity in O(n(5)), where n is the number of observed repeated sequences. Through a series of algorithmic refinements, we improve its complexity to O(n(4)) in the worst case, but stress that the refined DTSCORE algorithm should perform faster with real data. We assess the topological accuracy of DTSCORE using simulated data sets, and compare it to existing reconstruction methods. The results clearly show that DTSCORE is more accurate than all the other methods we studied. Finally, we report the results of DTSCORE on a real dataset. Supplementary information: http://www.lirmm.fr/w3ifa/MAAS/  相似文献   

16.
Maltase-glucoamylase and sucrase-isomaltase are two human glycosidases responsible for starch digestion. We have performed a comparative analysis of their amino acid sequences from several species of mammals and their orthologues from other chordates. This allowed us to determine the evolutionary history of the enzymes. Both glycosidases are paralogues and contain GH31 family catalytic domains. The common evolutionary precursor of these genes has arisen by a tandem duplication. As a consequence, sucrase-isomaltase consists of two homologous parts. The maltase-glucoamylase gene was a subject of several additional duplications, which number was not the same in different mammals. The locus, containing this gene, consists of 4-7 tandem repeats. The amino acid sequence, encoded by each of them, is similar to both parts of sucrase-isomaltase.  相似文献   

17.
DNase I preferentially cleaves polyomavirus minichromosomes at two sites in the enhancer, each of which comprises the sequence AAGCAPuPuAAG flanked by short inverted repeats. A tandem duplication of this sequence generates an additional hypersensitive locus. Mutations which alter either the AAGCAPuPuAAG or flanking repeats diminish hypersensitivity. This region must determine the chromatin conformation recognized by DNase I.  相似文献   

18.
We describe an optimized algorithm, which is faster and more accurate compared to previously described algorithms, for computing the statistical mechanics of denaturation of nucleic acid sequences according to the classical Poland-Scheraga type of model. Nearest neighbor thermodynamics has been included in a complete and general way, by rigorously treating nearest neighbor interactions, helix end interactions, and isolated base-pairs. This avoids the simplifications of previous approaches and achieves full generality and controllability with respect to thermodynamic modeling. The algorithm computes subchain partition functions by recursion, from which various quantitative aspects of the melting process are easily derived, for example the base-pairing probability profiles. The algorithm represents an optimization with respect to algorithmic complexity of the partition function algorithm of Yeramian et al. (Biopolymers 1990, 30, 481-497): we reduce the computation time for a base-pairing probability profile from O(N2) to O(N), where N is the sequence length. This speed-up comes in addition to the speed-up due to a multiexponential approximation of the loop entropy factor as introduced by Fixman and Freire22 and applied by Yeramian et al. The speed-up, however, is independent of the multiexponential approximation and reduces time from O(N3) to O(N2) in the exact case. A method for representing very large numbers is described, which avoids numerical overflow in the partition functions for genomic length sequences. In addition to calculating the standard base-pairing probability profiles, we propose to use the algorithm to calculate various other probabilities (loops, helices, tails) for a more direct view of the melting regions and their positions and sizes. This can provide a better understanding of the physics of denaturation and the biology of genomes.  相似文献   

19.
Many computational problems and methods have been proposed for analysis of biological pathways. Among them, this paper focuses on extraction of mapping rules of atoms from enzymatic reaction data, which is useful for drug design, simulation of tracer experiments, and consistency checking of pathway databases. Most of existing methods for this problem are based on maximal common subgraph algorithms. In this paper, we propose a novel approach based on graph partition and graph isomorphism. We show that this problem is NP-hard in general, but can be solved in polynomial time for wide classes of enzymatic reactions. We also present an O(n(1.5)) time algorithm for a special but fundamental class of reactions, where n is the maximum size of compounds appearing in a reaction. We develop practical polynomial-time algorithms in which the Morgan algorithm is used for computing the normal form of a graph, where it is known that the Morgan algorithm works correctly for most chemical structures. Computational experiments are performed for these practical algorithms using the chemical reaction data stored in the KEGG/LIGAND database. The results of computational experiments suggest that practical algorithms are useful in many cases.  相似文献   

20.
Particularly in higher eukaryotes, some protein domains are found in tandem repeats, performing broad functions often related to cellular organization. For instance, the eukaryotic protein filamin interacts with many proteins and is crucial for the cytoskeleton. The functional properties of long repeat domains are governed by the specific properties of each individual domain as well as by the repeat copy number. To provide better understanding of the evolutionary and functional history of repeating domains, we investigated the mode of evolution of the filamin domain in some detail. Among the domains that are common in long repeat proteins, sushi and spectrin domains evolve primarily through cassette tandem duplications while scavenger and immunoglobulin repeats appear to evolve through clustered tandem duplications. Additionally, immunoglobulin and filamin repeats exhibit a unique pattern where every other domain shows high sequence similarity. This pattern may be the result of tandem duplications, serve to avert aggregation between adjacent domains or it is the result of functional constraints. In filamin, our studies confirm the presence of interspersed integrin binding domains in vertebrates, while invertebrates exhibit more varied patterns, including more clustered integrin binding domains. The most notable case is leech filamin, which contains a 20 repeat expansion and exhibits unique dimerization topology. Clearly, invertebrate filamins are varied and contain examples of similar adjacent integrin-binding domains. Given that invertebrate integrin shows more similarity to the weaker filamin binder, integrin β3, it is possible that the distance between integrin-binding domains is not as crucial for invertebrate filamins as for vertebrates.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号