首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 39 毫秒
1.
In the growing field of genomics, multiple alignment programs are confronted with ever increasing amounts of data. To address this growing issue we have dramatically improved the running time and memory requirement of Kalign, while maintaining its high alignment accuracy. Kalign version 2 also supports nucleotide alignment, and a newly introduced extension allows for external sequence annotation to be included into the alignment procedure. We demonstrate that Kalign2 is exceptionally fast and memory-efficient, permitting accurate alignment of very large numbers of sequences. The accuracy of Kalign2 compares well to the best methods in the case of protein alignments while its accuracy on nucleotide alignments is generally superior. In addition, we demonstrate the potential of using known or predicted sequence annotation to improve the alignment accuracy. Kalign2 is freely available for download from the Kalign web site (http://msa.sbc.su.se/).  相似文献   

2.
MOTIVATION: Sequence alignments obtained using affine gap penalties are not always biologically correct, because the insertion of long gaps is over-penalised. There is a need for an efficient algorithm which can find local alignments using non-linear gap penalties. RESULTS: A dynamic programming algorithm is described which computes optimal local sequence alignments for arbitrary, monotonically increasing gap penalties, i.e. where the cost g(k) of inserting a gap of k symbols is such that g(k) >/= g(k-1). The running time of the algorithm is dependent on the scoring scheme; if the expected score of an alignment between random, unrelated sequences of lengths m, n is proportional to log mn, then with one exception, the algorithm has expected running time O(mn). Elsewhere, the running time is no greater than O(mn(m+n)). Optimisations are described which appear to reduce the worst-case run-time to O(mn) in many cases. We show how using a non-affine gap penalty can dramatically increase the probability of detecting a similarity containing a long gap. AVAILABILITY: The source code is available to academic collaborators under licence.  相似文献   

3.
Qiu J  Elber R 《Proteins》2006,62(4):881-891
In template-based modeling of protein structures, the generation of the alignment between the target and the template is a critical step that significantly affects the accuracy of the final model. This paper proposes an alignment algorithm SSALN that learns substitution matrices and position-specific gap penalties from a database of structurally aligned protein pairs. In addition to the amino acid sequence information, secondary structure and solvent accessibility information of a position are used to derive substitution scores and position-specific gap penalties. In a test set of CASP5 targets, SSALN outperforms sequence alignment methods such as a Smith-Waterman algorithm with BLOSUM50 and PSI_BLAST. SSALN also generates better alignments than PSI_BLAST in the CASP6 test set. LOOPP server prediction based on an SSALN alignment is ranked the best for target T0280_1 in CASP6. SSALN is also compared with several threading methods and sequence alignment methods on the ProSup benchmark. SSALN has the highest alignment accuracy among the methods compared. On the Fischer's benchmark, SSALN performs better than CLUSTALW and GenTHREADER, and generates more alignments with accuracy >50%, >60% or >70% than FUGUE, but fewer alignments with accuracy >80% than FUGUE. All the supplemental materials can be found at http://www.cs.cornell.edu/ approximately jianq/research.htm.  相似文献   

4.
Few sequence alignment methods have been designed specifically for integral membrane proteins, even though these important proteins have distinct evolutionary and structural properties that might affect their alignments. Existing approaches typically consider membrane-related information either by using membrane-specific substitution matrices or by assigning distinct penalties for gap creation in transmembrane and non-transmembrane regions. Here, we ask whether favoring matching of predicted transmembrane segments within a standard dynamic programming algorithm can improve the accuracy of pairwise membrane protein sequence alignments. We tested various strategies using a specifically designed program called AlignMe. An updated set of homologous membrane protein structures, called HOMEP2, was used as a reference for optimizing the gap penalties. The best of the membrane-protein optimized approaches were then tested on an independent reference set of membrane protein sequence alignments from the BAliBASE collection. When secondary structure (S) matching was combined with evolutionary information (using a position-specific substitution matrix (P)), in an approach we called AlignMePS, the resultant pairwise alignments were typically among the most accurate over a broad range of sequence similarities when compared to available methods. Matching transmembrane predictions (T), in addition to evolutionary information, and secondary-structure predictions, in an approach called AlignMePST, generally reduces the accuracy of the alignments of closely-related proteins in the BAliBASE set relative to AlignMePS, but may be useful in cases of extremely distantly related proteins for which sequence information is less informative. The open source AlignMe code is available at https://sourceforge.net/projects/alignme/, and at http://www.forrestlab.org, along with an online server and the HOMEP2 data set.  相似文献   

5.
MOTIVATION: Although pairwise sequence alignment is essential in comparative genomic sequence analysis, it has proven difficult to precisely determine the gap penalties for a given pair of sequences. A common practice is to employ default penalty values. However, there are a number of problems associated with using gap penalties. First, alignment results can vary depending on the gap penalties, making it difficult to explore appropriate parameters. Second, the statistical significance of an alignment score is typically based on a theoretical model of non-gapped alignments, which may be misleading. Finally, there is no way to control the number of gaps for a given pair of sequences, even if the number of gaps is known in advance. RESULTS: In this paper, we develop and evaluate the performance of an alignment technique that allows the researcher to assign a priori set of the number of allowable gaps, rather than using gap penalties. We compare this approach with the Smith-Waterman and Needleman-Wunsch techniques on a set of structurally aligned protein sequences. We demonstrate that this approach outperforms the other techniques, especially for short sequences (56-133 residues) with low similarity (<25%). Further, by employing a statistical measure, we show that it can be used to assess the quality of the alignment in relation to the true alignment with the associated optimal number of gaps. AVAILABILITY: The implementation of the described methods SANK_AL is available at http://cbbc.murdoch.edu.au/ CONTACT: matthew@cbbc.murdoch.edu.au.  相似文献   

6.
Qian B  Goldstein RA 《Proteins》2002,48(4):605-610
The accuracy of the alignments of protein sequences depends on the score matrix and gap penalties used in performing the alignment. Most score functions are designed to find homologs in the various databases rather than to generate accurate alignments between known homologs. We describe the optimization of a score function for the purpose of generating accurate alignments, as evaluated by using a coordinate root-mean-square deviation (RMSD)-based merit function. We show that the resulting score matrix, which we call STROMA, generates more accurate alignments than other commonly used score matrices, and this difference is not due to differences in the gap penalties. In fact, in contrast to most of the other matrices, the alignment accuracies with STROMA are relatively insensitive to the choice of gap penalty parameters.  相似文献   

7.
MOTIVATION: Multiple sequence alignment is a fundamental task in bioinformatics. Current tools typically form an initial alignment by merging subalignments, and then polish this alignment by repeated splitting and merging of subalignments to obtain an improved final alignment. In general this form-and-polish strategy consists of several stages, and a profusion of methods have been tried at every stage. We carefully investigate: (1) how to utilize a new algorithm for aligning alignments that optimally solves the common subproblem of merging subalignments, and (2) what is the best choice of method for each stage to obtain the highest quality alignment. RESULTS: We study six stages in the form-and-polish strategy for multiple alignment: parameter choice, distance estimation, merge-tree construction, sequence-pair weighting, alignment merging, and polishing. For each stage, we consider novel approaches as well as standard ones. Interestingly, the greatest gains in alignment quality come from (i) estimating distances by a new approach using normalized alignment costs, and (ii) polishing by a new approach using 3-cuts. Experiments with a parameter-value oracle suggest large gains in quality may be possible through an input-dependent choice of alignment parameters, and we present a promising approach for building such an oracle. Combining the best approaches to each stage yields a new tool we call Opal that on benchmark alignments matches the quality of the top tools, without employing alignment consistency or hydrophobic gap penalties. AVAILABILITY: Opal, a multiple alignment tool that implements the best methods in our study, is freely available at http://opal.cs.arizona.edu.  相似文献   

8.
Qian B  Goldstein RA 《Proteins》2001,45(1):102-104
Protein sequence alignment has become a widely used method in the study of newly sequenced proteins. Most sequence alignment methods use an affine gap penalty to assign scores to insertions and deletions. Although affine gap penalties represent the relative ease of extending a gap compared with initializing a gap, it is still an obvious oversimplification of the real processes that occur during sequence evolution. To improve the efficiency of sequence alignment methods and to obtain a better understanding of the process of sequence evolution, we wanted to find a more accurate model of insertions and deletions in homologous proteins. In this work, we extract the probability of a gap occurrence and the resulting gap length distribution in distantly related proteins (sequence identity < 25%) using alignments based on their common structures. We observe a distribution of gaps that can be fitted with a multiexponential with four distinct components. The results suggest new approaches to modeling insertions and deletions in sequence alignments.  相似文献   

9.
Several methods have been developed for simultaneous estimation of alignment and tree, of which POY is the most popular. In a 2007 paper published in Systematic Biology, Ogden and Rosenberg reported on a simulation study in which they compared POY to estimating the alignment using ClustalW and then analyzing the resultant alignment using maximum parsimony. They found that ClustalW+MP outperformed POY with respect to alignment and phylogenetic tree accuracy, and they concluded that simultaneous estimation techniques are not competitive with two-phase techniques. Our paper presents a simulation study in which we focus on the NP-hard optimization problem that POY addresses: minimizing treelength. Our study considers the impact of the gap penalty and suggests that the poor performance observed for POY by Ogden and Rosenberg is due to the simple gap penalties they used to score alignment/tree pairs. Our study suggests that optimizing under an affine gap penalty might produce alignments that are better than ClustalW alignments, and competitive with those produced by the best current alignment methods. We also show that optimizing under this affine gap penalty produces trees whose topological accuracy is better than ClustalW+MP, and competitive with the current best two-phase methods.  相似文献   

10.
A widely used algorithm for computing an optimal local alignment between two sequences requires a parameter set with a substitution matrix and gap penalties. It is recognized that a proper parameter set should be selected to suit the level of conservation between sequences. We describe an algorithm for selecting an appropriate substitution matrix at given gap penalties for computing an optimal local alignment between two sequences. In the algorithm, a substitution matrix that leads to the maximum alignment similarity score is selected among substitution matrices at various evolutionary distances. The evolutionary distance of the selected substitution matrix is defined as the distance of the computed alignment. To show the effects of gap penalties on alignments and their distances and help select appropriate gap penalties, alignments and their distances are computed at various gap penalties. The algorithm has been implemented as a computer program named SimDist. The SimDist program was compared with an existing local alignment program named SIM for finding reciprocally best-matching pairs (RBPs) of sequences in each of 100 protein families, where RBPs are commonly used as an operational definition of orthologous sequences. SimDist produced more accurate results than SIM on 50 of the 100 families, whereas both programs produced the same results on the other 50 families. SimDist was also used to compare three types of substitution matrices in scoring 444,461 pairs of homologous sequences from the 100 families.  相似文献   

11.
Wrabl JO  Grishin NV 《Proteins》2004,54(1):71-87
An algorithm was developed to locally optimize gaps from the FSSP database. Over 2 million gaps were identified from all versus all FSSP structure comparisons, and datasets of non-identical gaps and flanking regions comprising between 90,000 and 135,000 sequence fragments were extracted for statistical analysis. Relative to background frequencies, gaps were enriched in residue types with small side chains and high turn propensity (D, G, N, P, S), and were depleted in residue types with hydrophobic side chains (C, F, I, L, V, W, Y). In contrast, regions flanking a gap exhibited opposite trends in amino acid frequencies, i.e., enrichment in hydrophobic residues and a high degree of secondary structure. Log-odds scores of residue type as a function of position in or around a gap were derived from the statistics. Three simple experiments demonstrated that these scores contained significant predictive information. First, regions where gaps were observed in single sequences taken from HOMSTRAD structure-based multiple sequence alignments generally scored higher than regions where gaps were not observed. Second, given the correct pairwise-aligned cores, the actual positions of gaps could be reproduced from sequence more accurately using the structurally-derived statistics than by using random pairwise alignments. Finally, revision of the Clustal-W residue-specific gap opening parameters with this new information improved the agreement of Clustal-W alignments with the structure-based alignments. At least three applications for these results are envisioned: improvement of gap penalties in pairwise (or multiple) sequence alignment, prediction of regions of single sequences likely (or unlikely) to contain indels, and more accurate placement of gaps in automated pairwise structure alignment.  相似文献   

12.
Carrying out simultaneous tree-building and alignment of sequence data is a difficult computational task, and the methods currently available are either limited to a few sequences or restricted to highly simplified models of alignment and phylogeny. A method is given here for overcoming these limitations by Bayesian sampling of trees and alignments simultaneously. The method uses a standard substitution matrix model for residues together with a hidden Markov model structure that allows affine gap penalties. It escapes the heavy computational burdens of other models by using an approximation called the ``*' rule, which replaces missing data by a sum over all possible values of variables. The behavior of the model is demonstrated on test sets of globins. Received: 25 May 1998 / Accepted: 8 December 1998  相似文献   

13.
The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of divergent protein sequences. Firstly, individual weights are assigned to each sequence in a partial alignment in order to down-weight near-duplicate sequences and up-weight the most divergent ones. Secondly, amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. Thirdly, residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. Fourthly, positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions. These modifications are incorporated into a new program, CLUSTAL W which is freely available.  相似文献   

14.

Background  

While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties. It is also not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than homolog recognition. Here a new parameter optimization procedure, POP, is described and applied to the problems of optimizing gap penalties and selecting substitution matrices for pair-wise global protein alignments.  相似文献   

15.
Liu K  Warnow T 《PloS one》2012,7(3):e33104
The standard approach to phylogeny estimation uses two phases, in which the first phase produces an alignment on a set of homologous sequences, and the second phase estimates a tree on the multiple sequence alignment. POY, a method which seeks a tree/alignment pair minimizing the total treelength, is the most widely used alternative to this two-phase approach. The topological accuracy of trees computed under treelength optimization is, however, controversial. In particular, one study showed that treelength optimization using simple gap penalties produced poor trees and alignments, and suggested the possibility that if POY were used with an affine gap penalty, it might be able to be competitive with the best two-phase methods. In this paper we report on a study addressing this possibility. We present a new heuristic for treelength, called BeeTLe (Better Treelength), that is guaranteed to produce trees at least as short as POY. We then use this heuristic to analyze a large number of simulated and biological datasets, and compare the resultant trees and alignments to those produced using POY and also maximum likelihood (ML) and maximum parsimony (MP) trees computed on a number of alignments. In general, we find that trees produced by BeeTLe are shorter and more topologically accurate than POY trees, but that neither POY nor BeeTLe produces trees as topologically accurate as ML trees produced on standard alignments. These findings, taken as a whole, suggest that treelength optimization is not as good an approach to phylogenetic tree estimation as maximum likelihood based upon good alignment methods.  相似文献   

16.
H Tyson 《Génome》1992,35(2):360-371
Optimum alignment in all pairwise combinations among a group of amino acid sequences generated a distance matrix. These distances were clustered to evaluate relationships among the sequences. The degree of relationship among sequences was also evaluated by calculating specific distances from the distance matrix and examining correlations between patterns of specific distances for pairs of sequences. The sequences examined were a group of 20 amino acid sequences of scorpion toxins originally published and analyzed by M.J. Dufton and H. Rochat in 1984. Alignment gap penalties were constant for all 190 pairwise sequence alignments and were chosen after assessing the impact of changing penalties on resultant distances. The total distances generated by the 190 pairwise sequence alignments were clustered using complete (farthest neighbour) linkage. The square, symmetrical input distance matrix is analogous to diallel cross data where reciprocal and parental values are absent. Diallel analysis methods provided analogues for the distance matrix to genetical specific combining abilities, namely specific distances between all sequence pairs that are independent of the average distances shown by individual sequences. Correlation of specific distance patterns, with transformation to modified z values and a stringent probability level, were used to delineate subgroups of related sequences. These were compared with complete linkage clustering results. Excellent agreement between the two approaches was found. Three originally outlying sequences were placed within the four new subgroups.  相似文献   

17.
Alignments grow, secondary structure prediction improves.   总被引:12,自引:0,他引:12  
Using information from sequence alignments significantly improves protein secondary structure prediction. Typically, more divergent profiles yield better predictions. Recently, various groups have shown that accuracy can be improved significantly by using PSI-BLAST profiles to develop new prediction methods. Here, we focused on the influences of various alignment strategies on two 8-year-old PHD methods. The following results stood out. (i) PHD using pairwise alignments predicts about 72% of all residues correctly in one of the three states: helix, strand, and other. Using larger databases and PSI-BLAST raised accuracy to 75%. (ii) More than 60% of the improvement originated from the growth of current sequence databases; about 20% resulted from detailed changes in the alignment procedure (substitution matrix, thresholds, and gap penalties). Another 20% of the improvement resulted from carefully using iterated PSI-BLAST searches. (iii) It is of interest that we failed to improve prediction accuracy further when attempting to refine the alignment by dynamic programming (MaxHom and ClustalW). (iv) Improvement through family growth appears to saturate at some point. However, most families have not reached this saturation. Hence, we anticipate that prediction accuracy will continue to rise with database growth.  相似文献   

18.
When aligning biological sequences, the choice of parameter values for the alignment scoring function is critical. Small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to compute parameter values that are appropriate for aligning biological sequences is through inverse parametric sequence alignment. Given a collection of examples of biologically correct alignments, this is the problem of finding parameter values that make the scores of the example alignments close to those of optimal alignments for their sequences. We extend prior work on inverse parametric alignment to partial examples, which contain regions where the alignment is left unspecified, and to an improved formulation based on minimizing the average error between the score of an example and the score of an optimal alignment. Experiments on benchmark biological alignments show we can find parameters that generalize across protein families and that boost the accuracy of multiple sequence alignment by as much as 25%.  相似文献   

19.
A "Long Indel" model for evolutionary sequence alignment   总被引:7,自引:0,他引:7  
We present a new probabilistic model of sequence evolution, allowing indels of arbitrary length, and give sequence alignment algorithms for our model. Previously implemented evolutionary models have allowed (at most) single-residue indels or have introduced artifacts such as the existence of indivisible "fragments." We compare our algorithm to these previous methods by applying it to the structural homology dataset HOMSTRAD, evaluating the accuracy of (1) alignments and (2) evolutionary time estimates. With our method, it is possible (for the first time) to integrate probabilistic sequence alignment, with reliability indicators and arbitrary gap penalties, in the same framework as phylogenetic reconstruction. Our alignment algorithm requires that we evaluate the likelihood of any specific path of mutation events in a continuous-time Markov model, with the event times integrated out. To this effect, we introduce a "trajectory likelihood" algorithm (Appendix A). We anticipate that this algorithm will be useful in more general contexts, such as Markov Chain Monte Carlo simulations.  相似文献   

20.
BALSA: Bayesian algorithm for local sequence alignment   总被引:3,自引:1,他引:2       下载免费PDF全文
The Smith–Waterman algorithm yields a single alignment, which, albeit optimal, can be strongly affected by the choice of the scoring matrix and the gap penalties. Additionally, the scores obtained are dependent upon the lengths of the aligned sequences, requiring a post-analysis conversion. To overcome some of these shortcomings, we developed a Bayesian algorithm for local sequence alignment (BALSA), that takes into account the uncertainty associated with all unknown variables by incorporating in its forward sums a series of scoring matrices, gap parameters and all possible alignments. The algorithm can return both the joint and the marginal optimal alignments, samples of alignments drawn from the posterior distribution and the posterior probabilities of gap penalties and scoring matrices. Furthermore, it automatically adjusts for variations in sequence lengths. BALSA was compared with SSEARCH, to date the best performing dynamic programming algorithm in the detection of structural neighbors. Using the SCOP databases PDB40D-B and PDB90D-B, BALSA detected 19.8 and 41.3% of remote homologs whereas SSEARCH detected 18.4 and 38% at an error rate of 1% errors per query over the databases, respectively.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号