共查询到10条相似文献,搜索用时 140 毫秒
1.
Optimal sequence alignment allowing for long gaps 总被引:7,自引:0,他引:7
Osamu Gotoh 《Bulletin of mathematical biology》1990,52(3):359-373
A new algorithm for optimal sequence alignment allowing for long insertions and deletions is developed. The algorithm requires
O((L+C)MN) computational steps, O(LN) primary memory and O(MN) secondary memory storage, whereM andN(M≥N) are sequence lengths,L (typicallyL≤3) is the number of segment specifying the gap weighting function, andC is a constant. We have also modified our earlier traceback algorithm so that it finds all and only the optimal alignments
in a compact form of a directed graph. The current versions accept a set of aligned sequences as input, which facilitates
multiple sequence alignment by some iterative procedures.
Dedicated to Professor Akiyoshi Wada on the occasion of his 60th birthday. 相似文献
2.
Given a sequenceA and regular expressionR, theapproximate regular expression matching problem is to find a sequence matchingR whose optimal alignment withA is the highest scoring of all such sequences. This paper develops an algorithm to solve the problem in timeO(MN), whereM andN are the lengths ofA andR. Thus, the time requirement is asymptotically no worse than for the simpler problem of aligning two fixed sequences. Our
method is superior to an earlier algorithm by Wagner and Seiferas in several ways. First, it treats real-valued costs, in
addition to integer costs, with no loss of asymptotic efficiency. Second, it requires onlyO(N) space to deliver just the score of the best alignment. Finally, its structure permits implementation techniques that make
it extremely fast in practice. We extend the method to accommodate gap penalties, as required for typical applications in
molecular biology, and further refine it to search for substrings ofA that strongly align with a sequence inR, as required for typical data base searches. We also show how to deliver an optimal alignment betweenA andR in onlyO(N+logM) space usingO(MN logM) time. Finally, anO(MN(M+N)+N
2logN) time algorithm is presented for alignment scoring schemes where the cost of a gap is an arbitrary increasing function of
its length. 相似文献
3.
Reed A Cartwright 《BMC bioinformatics》2006,7(1):527
Background
Studies on the distribution of indel sizes have consistently found that they obey a power law. This finding has lead several scientists to propose that logarithmic gap costs, G (k) = a + c ln k, are more biologically realistic than affine gap costs, G (k) = a + bk, for sequence alignment. Since quick and efficient affine costs are currently the most popular way to globally align sequences, the goal of this paper is to determine whether logarithmic gap costs improve alignment accuracy significantly enough the merit their use over the faster affine gap costs. 相似文献4.
Sequence alignment underpins common tasks in molecular biology, including genome annotation, molecular phylogenetics, and homology modeling. Fundamental to sequence alignment is the placement of gaps, which represent character insertions or deletions. We assessed the ability of a generalized affine gap cost model to reliably detect remote protein homology and to produce high-quality alignments. Generalized affine gap alignment with optimal gap parameters performed as well as the traditional affine gap model in remote homology detection. Evaluation of alignment quality showed that the generalized affine model aligns fewer residue pairs than the traditional affine model but achieves significantly higher per-residue accuracy. We conclude that generalized affine gap costs should be used when alignment accuracy carries more importance than aligned sequence length. 相似文献
5.
R Mott 《Bioinformatics (Oxford, England)》1999,15(6):455-462
MOTIVATION: Sequence alignments obtained using affine gap penalties are not always biologically correct, because the insertion of long gaps is over-penalised. There is a need for an efficient algorithm which can find local alignments using non-linear gap penalties. RESULTS: A dynamic programming algorithm is described which computes optimal local sequence alignments for arbitrary, monotonically increasing gap penalties, i.e. where the cost g(k) of inserting a gap of k symbols is such that g(k) >/= g(k-1). The running time of the algorithm is dependent on the scoring scheme; if the expected score of an alignment between random, unrelated sequences of lengths m, n is proportional to log mn, then with one exception, the algorithm has expected running time O(mn). Elsewhere, the running time is no greater than O(mn(m+n)). Optimisations are described which appear to reduce the worst-case run-time to O(mn) in many cases. We show how using a non-affine gap penalty can dramatically increase the probability of detecting a similarity containing a long gap. AVAILABILITY: The source code is available to academic collaborators under licence. 相似文献
6.
Gap costs for multiple sequence alignment 总被引:6,自引:0,他引:6
S F Altschul 《Journal of theoretical biology》1989,138(3):297-309
Standard methods for aligning pairs of biological sequences charge for the most common mutations, which are substitutions, deletions and insertions. Because a single mutation may insert or delete several nucleotides, gap costs that are not directly proportional to gap length are usually the most effective. How to extend such gap costs to alignments of three or more sequences is not immediately obvious, and a variety of approaches have been taken. This paper argues that, since gap and substitution costs together specify optimal alignments, they should be defined using a common rationale. Specifically, a new definition of gap costs for multiple alignments is proposed and compared with previous ones. Since the new definition links a multiple alignment's cost to that of its pairwise projections, it allows knowledge gained about two-sequence alignments to bear on the multiple alignment problem. Also, such linkage is a key element of recent algorithms that have rendered practical the simultaneous alignment of as many as six sequences. 相似文献
7.
Alignment algorithms can be used to infer a relationship between sequences when the true relationship is unknown. Simple alignment algorithms use a cost function that gives a fixed cost to each possible point mutation-mismatch, deletion, insertion. These algorithms tend to find optimal alignments that have many small gaps. It is more biologically plausible to have fewer longer gaps rather than many small gaps in an alignment. To address this issue, linear gap cost algorithms are in common use for aligning biological sequence data. More reliable inferences are obtained by aligning more than two sequences at a time. The obvious dynamic programming algorithm for optimally aligning k sequences of length n runs in O(n(k)) time. This is impractical if k>/=3 and n is of any reasonable length. Thus, for this problem there are many heuristics for aligning k sequences, however, they are not guaranteed to find an optimal alignment. In this paper, we present a new algorithm guaranteed to find the optimal alignment for three sequences using linear gap costs. This gives the same results as the dynamic programming algorithm for three sequences, but typically does so much more quickly. It is particularly fast when the (three-way) edit distance is small. Our algorithm uses a speed-up technique based on Ukkonen's greedy algorithm (Ukkonen, 1983) which he presented for two sequences and simple costs. 相似文献
8.
Summary The problem of choosing an alignment of two or more nucleotide sequences is particularly difficult for nucleic acids, such as 5S ribosomal RNA, which do not code for protein and for which secondary structure is unknown. Given a set of costs for the various types of replacement mutations and for base insertion or deletion, we present a dynamic programming algorithm which finds the optimal (least costly) alignment for a set of N sequences simultaneously, where each sequence is associated with one of the N tips of a given evolutionary tree. Concurrently, protosequences are constructed corresponding to the ancestral nodes of the tree. A version of this algorithm, modified to be computationally feasible, is implemented to align the sequences of 5S RNA from nine organisms. Complete sets of alignments and proto-sequence reconstructions are done for a large number of different con-figurations of mutation costs. Examination of the family of curves of total replacements inferred versus the ratio of transitions/trans-versions inferred, each curve corresponding to a given number of in-sertions-deletions inferred, provides a method for estimating relative costs and relative frequencies for these different types of mutation. 相似文献
9.
A major problem in sequence alignments based on the standard dynamic programming method is that the optimal path does not necessarily yield the best equivalencing of residues assessed by structural or functional criteria. An algorithm is presented that finds suboptimal alignments of protein sequences by a simple modification to the standard dynamic programming method. The standard pairwise weight matrix elements are modified in order to penalize, but not eliminate, the equivalencing of residues obtained from previous alignments. The algorithm thereby yields a limited set of alternate alignments that can differ considerably from the optimal. The approach is benchmarked on the alignments of immunoglobulin domains. Without a prior knowledge of the optimal choice of gap penalty, one of the suboptimal alignments is shown to be more accurate than the optimal. 相似文献
10.
Nonlinear similarity functions are often better than linear functions at distinguishing interesting subalignments from those
due to chance. Nonlinear similarity functions useful for comparing biological sequences are developed. Several new algorithms
are presented for finding locally optimal subalignments of two sequences. Unlike previous algorithms, they may use any reasonable
similarity function as a selection criterion. Among these algorithms are VV-1, which finds all and only the locally optimal
subalignments of two sequences, and CC-1, which finds all and only the weakly locally optimal subalignments of two sequences.
The VV-1 algorithm is slow and interesting only for theoretical reasons. In contrast, the CC-1 algorithm has average time
complexityO(MN) when used to find only very good subalignments. 相似文献