首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Alignment of protein sequences is a key step in most computational methods for prediction of protein function and homology-based modeling of three-dimensional (3D)-structure. We investigated correspondence between "gold standard" alignments of 3D protein structures and the sequence alignments produced by the Smith-Waterman algorithm, currently the most sensitive method for pair-wise alignment of sequences. The results of this analysis enabled development of a novel method to align a pair of protein sequences. The comparison of the Smith-Waterman and structure alignments focused on their inner structure and especially on the continuous ungapped alignment segments, "islands" between gaps. Approximately one third of the islands in the gold standard alignments have negative or low positive score, and their recognition is below the sensitivity limit of the Smith-Waterman algorithm. From the alignment accuracy perspective, the time spent by the algorithm while working in these unalignable regions is unnecessary. We considered features of the standard similarity scoring function responsible for this phenomenon and suggested an alternative hierarchical algorithm, which explicitly addresses high scoring regions. This algorithm is considerably faster than the Smith-Waterman algorithm, whereas resulting alignments are in average of the same quality with respect to the gold standard. This finding shows that the decrease of alignment accuracy is not necessarily a price for the computational efficiency.  相似文献   

2.
MOTIVATION: Homology search for RNAs can use secondary structure information to increase power by modeling base pairs, as in covariance models, but the resulting computational costs are high. Typical acceleration strategies rely on at least one filtering stage using sequence-only search. RESULTS: Here we present the multi-segment CYK (MSCYK) filter, which implements a heuristic of ungapped structural alignment for RNA homology search. Compared to gapped alignment, this approximation has lower computation time requirements (O(N?) reduced to O(N3), and space requirements (O(N3) reduced to O(N2). A vector-parallel implementation of this method gives up to 100-fold speed-up; vector-parallel implementations of standard gapped alignment at two levels of precision give 3- and 6-fold speed-ups. These approaches are combined to create a filtering pipeline that scores RNA secondary structure at all stages, with results that are synergistic with existing methods.  相似文献   

3.
MOTIVATION: Improved comparisons of multiple sequence alignments (profiles) with other profiles can identify subtle relationships between protein families and motifs significantly beyond the resolution of sequence-based comparisons. RESULTS: The local alignment of multiple alignments (LAMA) method was modified to estimate alignment score significance by applying a new measure based on Fisher's combining method. To verify the new procedure, we used known protein structures, sequence annotations and cyclical relations consistency analysis (CYRCA) sets of consistently aligned blocks. Using the new significance measure improved the sensitivity of LAMA without altering its selectivity. The program performed better than other profile-to-profile methods (COMPASS and Prof_sim) and a sequence-to-profile method (PSI-BLAST). The testing was large scale and used several parameters, including pseudo-counts profile calculations and local ungapped blocks or more extended gapped profiles. This comparison provides guidelines to the relative advantages of each method for different cases. We demonstrate and discuss the unique advantages of using block multiple alignments of protein motifs.  相似文献   

4.
MOTIVATION: Membrane-bound proteins are a special class of proteins. The regions that insert into the cell-membrane have a profoundly different hydrophobicity pattern compared with soluble proteins. Multiple alignment techniques use scoring schemes tailored for sequences of soluble proteins and are therefore in principle not optimal to align membrane-bound proteins. RESULTS: Transmembrane (TM) regions in protein sequences can be reliably recognized using state-of-the-art sequence prediction techniques. Furthermore, membrane-specific scoring matrices are available. We have developed a new alignment method, called PRALINETM, which integrates these two features to enhance multiple sequence alignment. We tested our algorithm on the TM alignment benchmark set by Bahr et al. (2001), and showed that the quality of TM alignments can be significantly improved compared with the quality produced by a standard multiple alignment technique. The results clearly indicate that the incorporation of these new elements into current state-of-the-art alignment methods is crucial for optimizing the alignment of TM proteins. AVAILABILITY: A webserver is available at http://www.ibi.vu.nl/programs/pralinewww.  相似文献   

5.
Multiple sequence alignment using partial order graphs   总被引:14,自引:0,他引:14  
MOTIVATION: Progressive Multiple Sequence Alignment (MSA) methods depend on reducing an MSA to a linear profile for each alignment step. However, this leads to loss of information needed for accurate alignment, and gap scoring artifacts. RESULTS: We present a graph representation of an MSA that can itself be aligned directly by pairwise dynamic programming, eliminating the need to reduce the MSA to a profile. This enables our algorithm (Partial Order Alignment (POA)) to guarantee that the optimal alignment of each new sequence versus each sequence in the MSA will be considered. Moreover, this algorithm introduces a new edit operator, homologous recombination, important for multidomain sequences. The algorithm has improved speed (linear time complexity) over existing MSA algorithms, enabling construction of massive and complex alignments (e.g. an alignment of 5000 sequences in 4 h on a Pentium II). We demonstrate the utility of this algorithm on a family of multidomain SH2 proteins, and on EST assemblies containing alternative splicing and polymorphism. AVAILABILITY: The partial order alignment program POA is available at http://www.bioinformatics.ucla.edu/poa.  相似文献   

6.
MOTIVATION: To construct a multiple sequence alignment (MSA) of a large number (> approximately 10,000) of sequences, the calculation of a guide tree with a complexity of O(N2) to O(N3), where N is the number of sequences, is the most time-consuming process. RESULTS: To overcome this limitation, we have developed an approximate algorithm, PartTree, to construct a guide tree with an average time complexity of O(N log N). The new MSA method with the PartTree algorithm can align approximately 60,000 sequences in several minutes on a standard desktop computer. The loss of accuracy in MSA caused by this approximation was estimated to be several percent in benchmark tests using Pfam. AVAILABILITY: The present algorithm has been implemented in the MAFFT sequence alignment package (http://align.bmr.kyushu-u.ac.jp/mafft/software/). SUPPLEMENTARY INFORMATION: Supplementary information is available at Bioinformatics online.  相似文献   

7.
一个新的核酸序列比对算法及其在序列全局比对中的应用   总被引:1,自引:0,他引:1  
目前在序列比对中所广泛使用的动态规划算法,虽然能达到最优比对结果,但却由于具有高计算复杂度O(N_2)而极大地降低了计算效率。将多阶段动态规划决策算法用于两两序列比对并用Visual BASIC编程实现,结果发现该新算法在将计算复杂度减小到O(N)的同时,也能够获得较为理想的计算精度,预期将在序列全局比对中起重要作用。  相似文献   

8.
BCL::Align is a multiple sequence alignment tool that utilizes the dynamic programming method in combination with a customizable scoring function for sequence alignment and fold recognition. The scoring function is a weighted sum of the traditional PAM and BLOSUM scoring matrices, position-specific scoring matrices output by PSI-BLAST, secondary structure predicted by a variety of methods, chemical properties, and gap penalties. By adjusting the weights, the method can be tailored for fold recognition or sequence alignment tasks at different levels of sequence identity. A Monte Carlo algorithm was used to determine optimized weight sets for sequence alignment and fold recognition that most accurately reproduced the SABmark reference alignment test set. In an evaluation of sequence alignment performance, BCL::Align ranked best in alignment accuracy (Cline score of 22.90 for sequences in the Twilight Zone) when compared with Align-m, ClustalW, T-Coffee, and MUSCLE. ROC curve analysis indicates BCL::Align's ability to correctly recognize protein folds with over 80% accuracy. The flexibility of the program allows it to be optimized for specific classes of proteins (e.g. membrane proteins) or fold families (e.g. TIM-barrel proteins). BCL::Align is free for academic use and available online at http://www.meilerlab.org/.  相似文献   

9.
Dong E  Smith J  Heinze S  Alexander N  Meiler J 《Gene》2008,422(1-2):41-46
BCL::Align is a multiple sequence alignment tool that utilizes the dynamic programming method in combination with a customizable scoring function for sequence alignment and fold recognition. The scoring function is a weighted sum of the traditional PAM and BLOSUM scoring matrices, position-specific scoring matrices output by PSI-BLAST, secondary structure predicted by a variety of methods, chemical properties, and gap penalties. By adjusting the weights, the method can be tailored for fold recognition or sequence alignment tasks at different levels of sequence identity. A Monte Carlo algorithm was used to determine optimized weight sets for sequence alignment and fold recognition that most accurately reproduced the SABmark reference alignment test set. In an evaluation of sequence alignment performance, BCL::Align ranked best in alignment accuracy (Cline score of 22.90 for sequences in the Twilight Zone) when compared with Align-m, ClustalW, T-Coffee, and MUSCLE. ROC curve analysis indicates BCL::Align's ability to correctly recognize protein folds with over 80% accuracy. The flexibility of the program allows it to be optimized for specific classes of proteins (e.g. membrane proteins) or fold families (e.g. TIM-barrel proteins). BCL::Align is free for academic use and available online at http://www.meilerlab.org/.  相似文献   

10.
Sequence analysis is the basis of bioinformatics, while sequence alignment is a fundamental task for sequence analysis. The widely used alignment algorithm, Dynamic Programming, though generating optimal alignment, takes too much time due to its high computation complexity O(N(2)). In order to reduce computation complexity without sacrificing too much accuracy, we have developed a new approach to align two homologous sequences. The new approach presented here, adopting our novel algorithm which combines the methods of probabilistic and combinatorial analysis, reduces the computation complexity to as low as O(N). The computation speed by our program is at least 15 times faster than traditional pairwise alignment algorithms without a loss of much accuracy. We hence named the algorithm Super Pairwise Alignment (SPA). The pairwise alignment execution program based on SPA and the detailed results of the aligned sequences discussed in this article are available upon request.  相似文献   

11.
12.
MOTIVATION: Pairwise local sequence alignment is commonly used to search data bases for sequences related to some query sequence. Alignments are obtained using a scoring matrix that takes into account the different frequencies of occurrence of the various types of amino acid substitutions. Software like BLAST provides the user with a set of scoring matrices available to choose from, and in the literature it is sometimes recommended to try several scoring matrices on the sequences of interest. The significance of an alignment is usually assessed by looking at E-values and p-values. While sequence lengths and data base sizes enter the standard calculations of significance, it is much less common to take the use of several scoring matrices on the same sequences into account. Altschul proposed corrections of the p-value that account for the simultaneous use of an infinite number of PAM matrices. Here we consider the more realistic situation where the user may choose from a finite set of popular PAM and BLOSUM matrices, in particular the ones available in BLAST. It turns out that the significance of a result can be considerably overestimated, if a set of substitution matrices is used in an alignment problem and the most significant alignment is then quoted. RESULTS: Based on extensive simulations, we study the multiple testing problem that occurs when several scoring matrices for local sequence alignment are used. We consider a simple Bonferroni correction of the p-values and investigate its accuracy. Finally, we propose a more accurate correction based on extreme value distributions fitted to the maximum of the normalized scores obtained from different scoring matrices. For various sets of matrices we provide correction factors which can be easily applied to adjust p- and E-values reported by software packages.  相似文献   

13.
MOTIVATION: Even the best sequence alignment methods frequently fail to correctly identify the framework regions for which backbones can be copied from the template into the target structure. Since the underprediction and, more significantly, the overprediction of these regions reduces the quality of the final model, it is of prime importance to attain as much as possible of the true structural alignment between target and template. RESULTS: We have developed an algorithm called Consensus that consistently provides a high quality alignment for comparative modeling. The method follows from a benchmark analysis of the 3D models generated by ten alignment techniques for a set of 79 homologous protein structure pairs. For 20-to-40% of the targets, these methods yield models with at least 6 A root mean square deviation (RMSD) from the native structure. We have selected the top five performing methods, and developed a consensus algorithm to generate an improved alignment. By building on the individual strength of each method, a set of criteria was implemented to remove the alignment segments that are likely to correspond to structurally dissimilar regions. The automated algorithm was validated on a different set of 48 protein pairs, resulting in 2.2 A average RMSD for the predicted models, and only four cases in which the RMSD exceeded 3 A. The average length of the alignments was about 75% of that found by standard structural superposition methods. The performance of Consensus was consistent from 2 to 32% target-template sequence identity, and hence it can be used for accurate prediction of framework regions in homology modeling.  相似文献   

14.
Shape information about macromolecules is increasingly available but is difficult to use in modeling efforts. We demonstrate that shape information alone can often distinguish structural models of biological macromolecules. By using a data structure called a surface envelope (SE) to represent the shape of the molecule, we propose a method that generates a fitness score for the shape of a particular molecular model. This score correlates well with root mean squared deviation (RMSD) of the model to the known test structures and can be used to filter models in decoy sets. The scoring method requires both alignment of the model to the SE in three-dimensional space and assessment of the degree to which atoms in the model fill the SE. Alignment combines a hybrid algorithm using principal components and a previously published iterated closest point algorithm. We test our method against models generated from random atom perturbation from crystal structures, published decoy sets used in structure prediction, and models created from the trajectories of atoms in molecular modeling runs. We also test our alignment algorithm against experimental electron microscopic data from rice dwarf virus. The alignment performance is reliable, and we show a high correlation between model RMSD and score function. This correlation is stronger for molecular models with greater oblong character (as measured by the ratio of largest to smallest principal component).  相似文献   

15.
MOTIVATION: Protein sequence alignment plays a critical role in computational biology as it is an integral part in many analysis tasks designed to solve problems in comparative genomics, structure and function prediction, and homology modeling. METHODS: We have developed novel sequence alignment algorithms that compute the alignment between a pair of sequences based on short fixed- or variable-length high-scoring subsequences. Our algorithms build the alignments by repeatedly selecting the highest scoring pairs of subsequences and using them to construct small portions of the final alignment. We utilize PSI-BLAST generated sequence profiles and employ a profile-to-profile scoring scheme derived from PICASSO. RESULTS: We evaluated the performance of the computed alignments on two recently published benchmark datasets and compared them against the alignments computed by existing state-of-the-art dynamic programming-based profile-to-profile local and global sequence alignment algorithms. Our results show that the new algorithms achieve alignments that are comparable with or better than those achieved by existing algorithms. Moreover, our results also showed that these algorithms can be used to provide better information as to which of the aligned positions are more reliable--a critical piece of information for comparative modeling applications.  相似文献   

16.
Homology modeling is the most commonly used technique to build a three-dimensional model for a protein sequence. It heavily relies on the quality of the sequence alignment between the protein to model and related proteins with a known three dimensional structure. Alignment quality can be assessed according to the physico-chemical properties of the three dimensional models it produces. In this work, we introduce fifteen predictors designed to evaluate the properties of the models obtained for various alignments. They consist of an energy value obtained from different force fields (CHARMM, ProsaII or ANOLEA) computed on residue selected around misaligned regions. These predictors were evaluated on ten challenging test cases. For each target, all possible ungapped alignments are generated and their corresponding models are computed and evaluated. The best predictor, retrieving the structural alignment for 9 out of 10 test cases, is based on the ANOLEA atomistic mean force potential and takes into account residues around misaligned secondary structure elements. The performance of the other predictors is significantly lower. This work shows that substantial improvement in local alignments can be obtained by careful assessment of the local structure of the resulting models.  相似文献   

17.
As a basic tool of modern biology, sequence alignment can provide us useful information in fold, function, and active site of protein. For many cases, the increased quality of sequence alignment means a better performance. The motivation of present work is to increase ability of the existing scoring scheme/algorithm by considering residue-residue correlations better. Based on a coarse-grained approach, the hydrophobic force between each pair of residues is written out from protein sequence. It results in the construction of an intramolecular hydrophobic force network that describes the whole residue-residue interactions of each protein molecule, and characterizes protein's biological properties in the hydrophobic aspect. A former work has suggested that such network can characterize the top weighted feature regarding hydrophobicity. Moreover, for each homologous protein of a family, the corresponding network shares some common and representative family characters that eventually govern the conservation of biological properties during protein evolution. In present work, we score such family representative characters of a protein by the deviation of its intramolecular hydrophobic force network from that of background. Such score can assist the existing scoring schemes/algorithms, and boost up the ability of multiple sequences alignment, e.g. achieving a prominent increase (∼50%) in searching the structurally alike residue segments at a low identity level. As the theoretical basis is different, the present scheme can assist most existing algorithms, and improve their efficiency remarkably.  相似文献   

18.
The question of multiple sequence alignment quality has received much attention from developers of alignment methods. Less forthcoming, however, are practical measures for addressing alignment quality issues in real life settings. Here, we present a simple methodology to help identify and quantify the uncertainties in multiple sequence alignments and their effects on subsequent analyses. The proposed methodology is based upon the a priori expectation that sequence alignment results should be independent of the orientation of the input sequences. Thus, for totally unambiguous cases, reversing residue order prior to alignment should yield an exact reversed alignment of that obtained by using the unreversed sequences. Such "ideal" alignments, however, are the exception in real life settings, and the two alignments, which we term the heads and tails alignments, are usually different to a greater or lesser degree. The degree of agreement or discrepancy between these two alignments may be used to assess the reliability of the sequence alignment. Furthermore, any alignment dependent sequence analysis protocol can be carried out separately for each of the two alignments, and the two sets of results may be compared with each other, providing us with valuable information regarding the robustness of the whole analytical process. The heads-or-tails (HoT) methodology can be easily implemented for any choice of alignment method and for any subsequent analytical protocol. We demonstrate the utility of HoT for phylogenetic reconstruction for the case of 130 sequences belonging to the chemoreceptor superfamily in Drosophila melanogaster, and by analysis of the BaliBASE alignment database. Surprisingly, Neighbor-Joining methods of phylogenetic reconstruction turned out to be less affected by alignment errors than maximum likelihood and Bayesian methods.  相似文献   

19.
A comparison of scoring functions for protein sequence profile alignment   总被引:3,自引:0,他引:3  
MOTIVATION: In recent years, several methods have been proposed for aligning two protein sequence profiles, with reported improvements in alignment accuracy and homolog discrimination versus sequence-sequence methods (e.g. BLAST) and profile-sequence methods (e.g. PSI-BLAST). Profile-profile alignment is also the iterated step in progressive multiple sequence alignment algorithms such as CLUSTALW. However, little is known about the relative performance of different profile-profile scoring functions. In this work, we evaluate the alignment accuracy of 23 different profile-profile scoring functions by comparing alignments of 488 pairs of sequences with identity < or =30% against structural alignments. We optimize parameters for all scoring functions on the same training set and use profiles of alignments from both PSI-BLAST and SAM-T99. Structural alignments are constructed from a consensus between the FSSP database and CE structural aligner. We compare the results with sequence-sequence and sequence-profile methods, including BLAST and PSI-BLAST. RESULTS: We find that profile-profile alignment gives an average improvement over our test set of typically 2-3% over profile-sequence alignment and approximately 40% over sequence-sequence alignment. No statistically significant difference is seen in the relative performance of most of the scoring functions tested. Significantly better results are obtained with profiles constructed from SAM-T99 alignments than from PSI-BLAST alignments. AVAILABILITY: Source code, reference alignments and more detailed results are freely available at http://phylogenomics.berkeley.edu/profilealignment/  相似文献   

20.
Optimal spliced alignment of homologous cDNA to a genomic DNA template   总被引:17,自引:0,他引:17  
MOTIVATION: Supplementary cDNA or EST evidence is often decisive for discriminating between alternative gene predictions derived from computational sequence inspection by any of a number of requisite programs. Without additional experimental effort, this approach must rely on the occurrence of cognate ESTs for the gene under consideration in available, generally incomplete, EST collections for the given species. In some cases, particular exon assignments can be supported by sequence matching even if the cDNA or EST is produced from non-cognate genomic DNA, including different loci of a gene family or homologous loci from different species. However, marginally significant sequence matching alone can also be misleading. We sought to develop an algorithm that would simultaneously score for predicted intrinsic splice site strength and sequence matching between the genomic DNA template and a related cDNA or EST. In this case, weakly predicted splice sites may be chosen for the optimal scoring spliced alignment on the basis of surrounding sequence matching. Strongly predicted splice sites will enter the optimal spliced alignment even without strong sequence matching. RESULTS: We designed a novel algorithm that produces the optimal spliced alignment of a genomic DNA with a cDNA or EST based on scoring for both sequence matching and intrinsic splice site strength. By example, we demonstrate that this combined approach appears to improve gene prediction accuracy compared with current methods that rely only on either search by content and signal or on sequence similarity. AVAILABILITY: The algorithm is available as a C subroutine and is implemented in the SplicePredictor and GeneSeqer programs. The source code is available via anonymous ftp from ftp. zmdb.iastate.edu. Both programs are also implemented as a Web service at http://gremlin1.zool.iastate.edu/cgi-bin/s p.cgiand http://gremlin1.zool.iastate.edu/cgi-bin/g s.cgi, respectively. CONTACT: vbrendel@iastate.edu  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号