首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A widely used algorithm for computing an optimal local alignment between two sequences requires a parameter set with a substitution matrix and gap penalties. It is recognized that a proper parameter set should be selected to suit the level of conservation between sequences. We describe an algorithm for selecting an appropriate substitution matrix at given gap penalties for computing an optimal local alignment between two sequences. In the algorithm, a substitution matrix that leads to the maximum alignment similarity score is selected among substitution matrices at various evolutionary distances. The evolutionary distance of the selected substitution matrix is defined as the distance of the computed alignment. To show the effects of gap penalties on alignments and their distances and help select appropriate gap penalties, alignments and their distances are computed at various gap penalties. The algorithm has been implemented as a computer program named SimDist. The SimDist program was compared with an existing local alignment program named SIM for finding reciprocally best-matching pairs (RBPs) of sequences in each of 100 protein families, where RBPs are commonly used as an operational definition of orthologous sequences. SimDist produced more accurate results than SIM on 50 of the 100 families, whereas both programs produced the same results on the other 50 families. SimDist was also used to compare three types of substitution matrices in scoring 444,461 pairs of homologous sequences from the 100 families.  相似文献   

2.
MOTIVATION: Sequence alignments obtained using affine gap penalties are not always biologically correct, because the insertion of long gaps is over-penalised. There is a need for an efficient algorithm which can find local alignments using non-linear gap penalties. RESULTS: A dynamic programming algorithm is described which computes optimal local sequence alignments for arbitrary, monotonically increasing gap penalties, i.e. where the cost g(k) of inserting a gap of k symbols is such that g(k) >/= g(k-1). The running time of the algorithm is dependent on the scoring scheme; if the expected score of an alignment between random, unrelated sequences of lengths m, n is proportional to log mn, then with one exception, the algorithm has expected running time O(mn). Elsewhere, the running time is no greater than O(mn(m+n)). Optimisations are described which appear to reduce the worst-case run-time to O(mn) in many cases. We show how using a non-affine gap penalty can dramatically increase the probability of detecting a similarity containing a long gap. AVAILABILITY: The source code is available to academic collaborators under licence.  相似文献   

3.

Background  

While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties. It is also not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than homolog recognition. Here a new parameter optimization procedure, POP, is described and applied to the problems of optimizing gap penalties and selecting substitution matrices for pair-wise global protein alignments.  相似文献   

4.
Qian B  Goldstein RA 《Proteins》2002,48(4):605-610
The accuracy of the alignments of protein sequences depends on the score matrix and gap penalties used in performing the alignment. Most score functions are designed to find homologs in the various databases rather than to generate accurate alignments between known homologs. We describe the optimization of a score function for the purpose of generating accurate alignments, as evaluated by using a coordinate root-mean-square deviation (RMSD)-based merit function. We show that the resulting score matrix, which we call STROMA, generates more accurate alignments than other commonly used score matrices, and this difference is not due to differences in the gap penalties. In fact, in contrast to most of the other matrices, the alignment accuracies with STROMA are relatively insensitive to the choice of gap penalty parameters.  相似文献   

5.
When aligning biological sequences, the choice of parameter values for the alignment scoring function is critical. Small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to compute parameter values that are appropriate for aligning biological sequences is through inverse parametric sequence alignment. Given a collection of examples of biologically correct alignments, this is the problem of finding parameter values that make the scores of the example alignments close to those of optimal alignments for their sequences. We extend prior work on inverse parametric alignment to partial examples, which contain regions where the alignment is left unspecified, and to an improved formulation based on minimizing the average error between the score of an example and the score of an optimal alignment. Experiments on benchmark biological alignments show we can find parameters that generalize across protein families and that boost the accuracy of multiple sequence alignment by as much as 25%.  相似文献   

6.
Locality is an important and well-studied notion in comparative analysis of biological sequences. Similarly, taking into account affine gap penalties when calculating biological sequence alignments is a well-accepted technique for obtaining better alignments. When dealing with RNA, one has to take into consideration not only sequential features, but also structural features of the inspected molecule. This makes the computation more challenging, and usually prohibits the comparison only to small RNAs. In this paper we introduce two local metrics for comparing RNAs that extend the Smith-Waterman metric and its normalized version used for string comparison. We also present a global RNA alignment algorithm which handles affine gap penalties. Our global algorithm runs in O(m(2)n(1 + lg n/m)) time, while our local algorithms run in O(m(2)n(1 + lg n/m)) and O(n(2)m) time, respectively, where m 相似文献   

7.
DbClustal addresses the important problem of the automatic multiple alignment of the top scoring full-length sequences detected by a database homology search. By combining the advantages of both local and global alignment algorithms into a single system, DbClustal is able to provide accurate global alignments of highly divergent, complex sequence sets. Local alignment information is incorporated into a ClustalW global alignment in the form of a list of anchor points between pairs of sequences. The method is demonstrated using anchors supplied by the Blast post-processing program, Ballast. The rapidity and reliability of DbClustal have been demonstrated using the recently annotated Pyrococcus abyssi proteome where the number of alignments with totally misaligned sequences was reduced from 20% to <2%. A web site has been implemented proposing BlastP database searches with automatic alignment of the top hits by DbClustal.  相似文献   

8.
Comparison of methods for searching protein sequence databases.   总被引:12,自引:2,他引:10       下载免费PDF全文
We have compared commonly used sequence comparison algorithms, scoring matrices, and gap penalties using a method that identifies statistically significant differences in performance. Search sensitivity with either the Smith-Waterman algorithm or FASTA is significantly improved by using modern scoring matrices, such as BLOSUM45-55, and optimized gap penalties instead of the conventional PAM250 matrix. More dramatic improvement can be obtained by scaling similarity scores by the logarithm of the length of the library sequence (In()-scaling). With the best modern scoring matrix (BLOSUM55 or JO93) and optimal gap penalties (-12 for the first residue in the gap and -2 for additional residues), Smith-Waterman and FASTA performed significantly better than BLASTP. With In()-scaling and optimal scoring matrices (BLOSUM45 or Gonnet92) and gap penalties (-12, -1), the rigorous Smith-Waterman algorithm performs better than either BLASTP and FASTA, although with the Gonnet92 matrix the difference with FASTA was not significant. Ln()-scaling performed better than normalization based on other simple functions of library sequence length. Ln()-scaling also performed better than scores based on normalized variance, but the differences were not statistically significant for the BLOSUM50 and Gonnet92 matrices. Optimal scoring matrices and gap penalties are reported for Smith-Waterman and FASTA, using conventional or In()-scaled similarity scores. Searches with no penalty for gap extension, or no penalty for gap opening, or an infinite penalty for gaps performed significantly worse than the best methods. Differences in performance between FASTA and Smith-Waterman were not significant when partial query sequences were used. However, the best performance with complete query sequences was obtained with the Smith-Waterman algorithm and In()-scaling.  相似文献   

9.
Bayesian adaptive sequence alignment algorithms   总被引:3,自引:1,他引:2  
The selection of a scoring matrix and gap penalty parameters continues to be an important problem in sequence alignment. We describe here an algorithm, the 'Bayes block aligner, which bypasses this requirement. Instead of requiring a fixed set of parameter settings, this algorithm returns the Bayesian posterior probability for the number of gaps and for the scoring matrices in any series of interest. Furthermore, instead of returning the single best alignment for the chosen parameter settings, this algorithm returns the posterior distribution of all alignments considering the full range of gapping and scoring matrices selected, weighing each in proportion to its probability based on the data. We compared the Bayes aligner with the popular Smith-Waterman algorithm with parameter settings from the literature which had been optimized for the identification of structural neighbors, and found that the Bayes aligner correctly identified more structural neighbors. In a detailed examination of the alignment of a pair of kinase and a pair of GTPase sequences, we illustrate the algorithm's potential to identify subsequences that are conserved to different degrees. In addition, this example shows that the Bayes aligner returns an alignment-free assessment of the distance between a pair of sequences.   相似文献   

10.
Qiu J  Elber R 《Proteins》2006,62(4):881-891
In template-based modeling of protein structures, the generation of the alignment between the target and the template is a critical step that significantly affects the accuracy of the final model. This paper proposes an alignment algorithm SSALN that learns substitution matrices and position-specific gap penalties from a database of structurally aligned protein pairs. In addition to the amino acid sequence information, secondary structure and solvent accessibility information of a position are used to derive substitution scores and position-specific gap penalties. In a test set of CASP5 targets, SSALN outperforms sequence alignment methods such as a Smith-Waterman algorithm with BLOSUM50 and PSI_BLAST. SSALN also generates better alignments than PSI_BLAST in the CASP6 test set. LOOPP server prediction based on an SSALN alignment is ranked the best for target T0280_1 in CASP6. SSALN is also compared with several threading methods and sequence alignment methods on the ProSup benchmark. SSALN has the highest alignment accuracy among the methods compared. On the Fischer's benchmark, SSALN performs better than CLUSTALW and GenTHREADER, and generates more alignments with accuracy >50%, >60% or >70% than FUGUE, but fewer alignments with accuracy >80% than FUGUE. All the supplemental materials can be found at http://www.cs.cornell.edu/ approximately jianq/research.htm.  相似文献   

11.
Contact-based sequence alignment   总被引:2,自引:1,他引:1  
This paper introduces the novel method of contact-based protein sequence alignment, where structural information in the form of contact mutation probabilities is incorporated into an alignment routine using contact-mutation matrices (CAO: Contact Accepted mutatiOn). The contact-based alignment routine optimizes the score of matched contacts, which involves four (two per contact) instead of two residues per match in pairwise alignments. The first contact refers to a real side-chain contact in a template sequence with known structure, and the second contact is the equivalent putative contact of a homologous query sequence with unknown structure. An algorithm has been devised to perform a pairwise sequence alignment based on contact information. The contact scores were combined with PAM-type (Point Accepted Mutation) substitution scores after parameterization of gap penalties and score weights by means of a genetic algorithm. We show that owing to the structural information contained in the CAO matrices, significantly improved alignments of distantly related sequences can be obtained. This has allowed us to annotate eight putative Drosophila IGF sequences. Contact-based sequence alignment should therefore prove useful in comparative modelling and fold recognition.  相似文献   

12.
The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of divergent protein sequences. Firstly, individual weights are assigned to each sequence in a partial alignment in order to down-weight near-duplicate sequences and up-weight the most divergent ones. Secondly, amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. Thirdly, residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. Fourthly, positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions. These modifications are incorporated into a new program, CLUSTAL W which is freely available.  相似文献   

13.
Sequence alignment profiles have been shown to be very powerful in creating accurate sequence alignments. Profiles are often used to search a sequence database with a local alignment algorithm. More accurate and longer alignments have been obtained with profile-to-profile comparison. There are several steps that must be performed in creating profile-profile alignments, and each involves choices in parameters and algorithms. These steps include (1) what sequences to include in a multiple alignment used to build each profile, (2) how to weight similar sequences in the multiple alignment and how to determine amino acid frequencies from the weighted alignment, (3) how to score a column from one profile aligned to a column of the other profile, (4) how to score gaps in the profile-profile alignment, and (5) how to include structural information. Large-scale benchmarks consisting of pairs of homologous proteins with structurally determined sequence alignments are necessary for evaluating the efficacy of each scoring scheme. With such a benchmark, we have investigated the properties of profile-profile alignments and found that (1) with optimized gap penalties, most column-column scoring functions behave similarly to one another in alignment accuracy; (2) some functions, however, have much higher search sensitivity and specificity; (3) position-specific weighting schemes in determining amino acid counts in columns of multiple sequence alignments are better than sequence-specific schemes; (4) removing positions in the profile with gaps in the query sequence results in better alignments; and (5) adding predicted and known secondary structure information improves alignments.  相似文献   

14.
In recent years we have witnessed a growth in sequencing yield, the number of samples sequenced, and as a result–the growth of publicly maintained sequence databases. The increase of data present all around has put high requirements on protein similarity search algorithms with two ever-opposite goals: how to keep the running times acceptable while maintaining a high-enough level of sensitivity. The most time consuming step of similarity search are the local alignments between query and database sequences. This step is usually performed using exact local alignment algorithms such as Smith-Waterman. Due to its quadratic time complexity, alignments of a query to the whole database are usually too slow. Therefore, the majority of the protein similarity search methods prior to doing the exact local alignment apply heuristics to reduce the number of possible candidate sequences in the database. However, there is still a need for the alignment of a query sequence to a reduced database. In this paper we present the SW#db tool and a library for fast exact similarity search. Although its running times, as a standalone tool, are comparable to the running times of BLAST, it is primarily intended to be used for exact local alignment phase in which the database of sequences has already been reduced. It uses both GPU and CPU parallelization and was 4–5 times faster than SSEARCH, 6–25 times faster than CUDASW++ and more than 20 times faster than SSW at the time of writing, using multiple queries on Swiss-prot and Uniref90 databases  相似文献   

15.
Computational biology is replete with high-dimensional (high-D) discrete prediction and inference problems, including sequence alignment, RNA structure prediction, phylogenetic inference, motif finding, prediction of pathways, and model selection problems in statistical genetics. Even though prediction and inference in these settings are uncertain, little attention has been focused on the development of global measures of uncertainty. Regardless of the procedure employed to produce a prediction, when a procedure delivers a single answer, that answer is a point estimate selected from the solution ensemble, the set of all possible solutions. For high-D discrete space, these ensembles are immense, and thus there is considerable uncertainty. We recommend the use of Bayesian credibility limits to describe this uncertainty, where a (1−α)%, 0≤α≤1, credibility limit is the minimum Hamming distance radius of a hyper-sphere containing (1−α)% of the posterior distribution. Because sequence alignment is arguably the most extensively used procedure in computational biology, we employ it here to make these general concepts more concrete. The maximum similarity estimator (i.e., the alignment that maximizes the likelihood) and the centroid estimator (i.e., the alignment that minimizes the mean Hamming distance from the posterior weighted ensemble of alignments) are used to demonstrate the application of Bayesian credibility limits to alignment estimators. Application of Bayesian credibility limits to the alignment of 20 human/rodent orthologous sequence pairs and 125 orthologous sequence pairs from six Shewanella species shows that credibility limits of the alignments of promoter sequences of these species vary widely, and that centroid alignments dependably have tighter credibility limits than traditional maximum similarity alignments.  相似文献   

16.
PASS2 is a nearly automated version of CAMPASS and contains sequence alignments of proteins grouped at the level of superfamilies. This database has been created to fall in correspondence with SCOP database (1.53 release) and currently consists of 110 multi-member superfamilies and 613 superfamilies corresponding to single members. In multi-member superfamilies, protein chains with no more than 25% sequence identity have been considered for the alignment and hence the database aims to address sequence alignments which represent 26 219 protein domains under the SCOP 1.53 release. Structure-based sequence alignments have been obtained by COMPARER and the initial equivalences are provided automatically from a MALIGN alignment and subsequently augmented using STAMP4.0. The final sequence alignments have been annotated for the structural features using JOY4.0. Several interesting links are provided to other related databases and genome sequence relatives. Availability of reliable sequence alignments of distantly related proteins, despite poor sequence identity and single-member superfamilies, permit better sampling of structures in libraries for fold recognition of new sequences and for the understanding of protein structure–function relationships of individual superfamilies. The database can be queried by keywords and also by sequence search, interfaced by PSI-BLAST methods. Structure-annotated sequence alignments and several structural accessory files can be retrieved for all the superfamilies including the user-input sequence. The database can be accessed from http://www.ncbs.res.in/%7Efaculty/mini/campass/pass.html.  相似文献   

17.
BCL::Align is a multiple sequence alignment tool that utilizes the dynamic programming method in combination with a customizable scoring function for sequence alignment and fold recognition. The scoring function is a weighted sum of the traditional PAM and BLOSUM scoring matrices, position-specific scoring matrices output by PSI-BLAST, secondary structure predicted by a variety of methods, chemical properties, and gap penalties. By adjusting the weights, the method can be tailored for fold recognition or sequence alignment tasks at different levels of sequence identity. A Monte Carlo algorithm was used to determine optimized weight sets for sequence alignment and fold recognition that most accurately reproduced the SABmark reference alignment test set. In an evaluation of sequence alignment performance, BCL::Align ranked best in alignment accuracy (Cline score of 22.90 for sequences in the Twilight Zone) when compared with Align-m, ClustalW, T-Coffee, and MUSCLE. ROC curve analysis indicates BCL::Align's ability to correctly recognize protein folds with over 80% accuracy. The flexibility of the program allows it to be optimized for specific classes of proteins (e.g. membrane proteins) or fold families (e.g. TIM-barrel proteins). BCL::Align is free for academic use and available online at http://www.meilerlab.org/.  相似文献   

18.
Dong E  Smith J  Heinze S  Alexander N  Meiler J 《Gene》2008,422(1-2):41-46
BCL::Align is a multiple sequence alignment tool that utilizes the dynamic programming method in combination with a customizable scoring function for sequence alignment and fold recognition. The scoring function is a weighted sum of the traditional PAM and BLOSUM scoring matrices, position-specific scoring matrices output by PSI-BLAST, secondary structure predicted by a variety of methods, chemical properties, and gap penalties. By adjusting the weights, the method can be tailored for fold recognition or sequence alignment tasks at different levels of sequence identity. A Monte Carlo algorithm was used to determine optimized weight sets for sequence alignment and fold recognition that most accurately reproduced the SABmark reference alignment test set. In an evaluation of sequence alignment performance, BCL::Align ranked best in alignment accuracy (Cline score of 22.90 for sequences in the Twilight Zone) when compared with Align-m, ClustalW, T-Coffee, and MUSCLE. ROC curve analysis indicates BCL::Align's ability to correctly recognize protein folds with over 80% accuracy. The flexibility of the program allows it to be optimized for specific classes of proteins (e.g. membrane proteins) or fold families (e.g. TIM-barrel proteins). BCL::Align is free for academic use and available online at http://www.meilerlab.org/.  相似文献   

19.
We describe a new strategy for utilizing multiple sequence alignment information to detect distant relationships in searches of sequence databases. A single sequence representing a protein family is enriched by replacing conserved regions with position-specific scoring matrices (PSSMs) or consensus residues derived from multiple alignments of family members. In comprehensive tests of these and other family representations, PSSM-embedded queries produced the best results overall when used with a special version of the Smith-Waterman searching algorithm. Moreover, embedding consensus residues instead of PSSMs improved performance with readily available single sequence query searching programs, such as BLAST and FASTA. Embedding PSSMs or consensus residues into a representative sequence improves searching performance by extracting multiple alignment information from motif regions while retaining single sequence information where alignment is uncertain.  相似文献   

20.
John B  Sali A 《Nucleic acids research》2003,31(14):3982-3992
Comparative or homology protein structure modeling is severely limited by errors in the alignment of a modeled sequence with related proteins of known three-dimensional structure. To ameliorate this problem, we have developed an automated method that optimizes both the alignment and the model implied by it. This task is achieved by a genetic algorithm protocol that starts with a set of initial alignments and then iterates through re-alignment, model building and model assessment to optimize a model assessment score. During this iterative process: (i) new alignments are constructed by application of a number of operators, such as alignment mutations and cross-overs; (ii) comparative models corresponding to these alignments are built by satisfaction of spatial restraints, as implemented in our program MODELLER; (iii) the models are assessed by a variety of criteria, partly depending on an atomic statistical potential. When testing the procedure on a very difficult set of 19 modeling targets sharing only 4–27% sequence identity with their template structures, the average final alignment accuracy increased from 37 to 45% relative to the initial alignment (the alignment accuracy was measured as the percentage of positions in the tested alignment that were identical to the reference structure-based alignment). Correspondingly, the average model accuracy increased from 43 to 54% (the model accuracy was measured as the percentage of the Cα atoms of the model that were within 5 Å of the corresponding Cα atoms in the superposed native structure). The present method also compares favorably with two of the most successful previously described methods, PSI-BLAST and SAM. The accuracy of the final models would be increased further if a better method for ranking of the models were available.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号