首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
BCL::Align is a multiple sequence alignment tool that utilizes the dynamic programming method in combination with a customizable scoring function for sequence alignment and fold recognition. The scoring function is a weighted sum of the traditional PAM and BLOSUM scoring matrices, position-specific scoring matrices output by PSI-BLAST, secondary structure predicted by a variety of methods, chemical properties, and gap penalties. By adjusting the weights, the method can be tailored for fold recognition or sequence alignment tasks at different levels of sequence identity. A Monte Carlo algorithm was used to determine optimized weight sets for sequence alignment and fold recognition that most accurately reproduced the SABmark reference alignment test set. In an evaluation of sequence alignment performance, BCL::Align ranked best in alignment accuracy (Cline score of 22.90 for sequences in the Twilight Zone) when compared with Align-m, ClustalW, T-Coffee, and MUSCLE. ROC curve analysis indicates BCL::Align's ability to correctly recognize protein folds with over 80% accuracy. The flexibility of the program allows it to be optimized for specific classes of proteins (e.g. membrane proteins) or fold families (e.g. TIM-barrel proteins). BCL::Align is free for academic use and available online at http://www.meilerlab.org/.  相似文献   

2.
Protein threading using PROSPECT: design and evaluation   总被引:14,自引:0,他引:14  
Xu Y  Xu D 《Proteins》2000,40(3):343-354
The computer system PROSPECT for the protein fold recognition using the threading method is described and evaluated in this article. For a given target protein sequence and a template structure, PROSPECT guarantees to find a globally optimal threading alignment between the two. The scoring function for a threading alignment employed in PROSPECT consists of four additive terms: i) a mutation term, ii) a singleton fitness term, iii) a pairwise-contact potential term, and iv) alignment gap penalties. The current version of PROSPECT considers pair contacts only between core (alpha-helix or beta-strand) residues and alignment gaps only in loop regions. PROSPECT finds a globally optimal threading efficiently when pairwise contacts are considered only between residues that are spatially close (7 A or less between the C(beta) atoms in the current implementation). On a test set consisting of 137 pairs of target-template proteins, each pair being from the same superfamily and having sequence identity 相似文献   

3.
Improved sequence alignment at low pairwise identity is important for identifying potential remote homologues in database searches and for obtaining accurate alignments as a prelude to modeling structures by homology. Our work is motivated by two observations: structural data provide superior training examples for developing techniques to improve the alignment of remote homologues; and general substitution patterns for remote homologues differ from those of closely related proteins. We introduce a new set of amino acid residue interchange matrices built from structural superposition data. These matrices exploit known structural homology as a means of characterizing the effect evolution has on residue-substitution profiles. Given their origin, it is not surprising that the individual residue-residue interchange frequencies are chemically sensible.The structural interchange matrices show a significant increase both in pairwise alignment accuracy and in functional annotation/fold recognition accuracy across distantly related sequences. We demonstrate improved pairwise alignment by using superpositions of homologous domains extracted from a structural database as a gold standard and go on to show an increase in fold recognition accuracy using a database of homologous fold families. This was applied to the unassigned open reading frames from the genome of Helicobacter pylori to identify five matches, two of which are not represented by new annotations in the sequence databases. In addition, we describe a new cyclic permutation strategy to identify distant homologues that experienced gene duplication and subsequent deletions. Using this method, we have identified a potential homologue to one additional previously unassigned open reading frame from the H. pylori genome.  相似文献   

4.
MOTIVATION: Membrane-bound proteins are a special class of proteins. The regions that insert into the cell-membrane have a profoundly different hydrophobicity pattern compared with soluble proteins. Multiple alignment techniques use scoring schemes tailored for sequences of soluble proteins and are therefore in principle not optimal to align membrane-bound proteins. RESULTS: Transmembrane (TM) regions in protein sequences can be reliably recognized using state-of-the-art sequence prediction techniques. Furthermore, membrane-specific scoring matrices are available. We have developed a new alignment method, called PRALINETM, which integrates these two features to enhance multiple sequence alignment. We tested our algorithm on the TM alignment benchmark set by Bahr et al. (2001), and showed that the quality of TM alignments can be significantly improved compared with the quality produced by a standard multiple alignment technique. The results clearly indicate that the incorporation of these new elements into current state-of-the-art alignment methods is crucial for optimizing the alignment of TM proteins. AVAILABILITY: A webserver is available at http://www.ibi.vu.nl/programs/pralinewww.  相似文献   

5.
Zhou H  Zhou Y 《Proteins》2004,55(4):1005-1013
An elaborate knowledge-based energy function is designed for fold recognition. It is a residue-level single-body potential so that highly efficient dynamic programming method can be used for alignment optimization. It contains a backbone torsion term, a buried surface term, and a contact-energy term. The energy score combined with sequence profile and secondary structure information leads to an algorithm called SPARKS (Sequence, secondary structure Profiles and Residue-level Knowledge-based energy Score) for fold recognition. Compared with the popular PSI-BLAST, SPARKS is 21% more accurate in sequence-sequence alignment in ProSup benchmark and 10%, 25%, and 20% more sensitive in detecting the family, superfamily, fold similarities in the Lindahl benchmark, respectively. Moreover, it is one of the best methods for sensitivity (the number of correctly recognized proteins), alignment accuracy (based on the MaxSub score), and specificity (the average number of correctly recognized proteins whose scores are higher than the first false positives) in LiveBench 7 among more than twenty servers of non-consensus methods. The simple algorithm used in SPARKS has the potential for further improvement. This highly efficient method can be used for fold recognition on genomic scales. A web server is established for academic users on http://theory.med.buffalo.edu.  相似文献   

6.
We present a protein fold-recognition method that uses a comprehensive statistical interpretation of structural Hidden Markov Models (HMMs). The structure/fold recognition is done by summing the probabilities of all sequence-to-structure alignments. The optimal alignment can be defined as the most probable, but suboptimal alignments may have comparable probabilities. These suboptimal alignments can be interpreted as optimal alignments to the "other" structures from the ensemble or optimal alignments under minor fluctuations in the scoring function. Summing probabilities for all alignments gives a complete estimate of sequence-model compatibility. In the case of HMMs that produce a sequence, this reflects the fact that due to our indifference to exactly how the HMM produced the sequence, we should sum over all possibilities. We have built a set of structural HMMs for 188 protein structures and have compared two methods for identifying the structure compatible with a sequence: by the optimal alignment probability and by the total probability. Fold recognition by total probability was 40% more accurate than fold recognition by the optimal alignment probability. Proteins 2000;40:451-462.  相似文献   

7.
MOTIVATION: Sequences for new proteins are being determined at a rapid rate, as a result of the Human Genome Project, and related genome research. The ability to predict the three-dimensional structure of proteins from sequence alone would be useful in discovering and understanding their function. Threading, or fold recognition, aims to predict the tertiary structure of a protein by aligning its amino acid sequence with a large number of structures, and finding the best fit. This approach depends on obtaining good performance from both the scoring function, which simulates the free energy for given trial alignments, and the threading algorithm, which searches for the lowest-score alignment. It appears that current scoring functions and threading algorithms need improvement. RESULTS: This paper presents a new threading algorithm. Numerical tests demonstrate that it is more powerful than two popular approximate algorithms, and much faster than exact methods.  相似文献   

8.
SUMMARY: Sequence-structure alignments are a common means for protein structure prediction in the fields of fold recognition and homology modeling, and there is a broad variety of programs that provide such alignments based on sequence similarity, secondary structure or contact potentials. Nevertheless, finding the best sequence-structure alignment in a pool of alignments remains a difficult problem. QUASAR (quality of sequence-structure alignments ranking) provides a unifying framework for scoring sequence-structure alignments that aids finding well-performing combinations of well-known and custom-made scoring schemes. Those scoring functions can be benchmarked against widely accepted quality scores like MaxSub, TMScore, Touch and APDB, thus enabling users to test their own alignment scores against 'standard-of-truth' structure-based scores. Furthermore, individual score combinations can be optimized with respect to benchmark sets based on known structural relationships using QUASAR's in-built optimization routines.  相似文献   

9.
Shan Y  Wang G  Zhou HX 《Proteins》2001,42(1):23-37
A homology-based structure prediction method ideally gives both a correct fold assignment and an accurate query-template alignment. In this article we show that the combination of two existing methods, PSI-BLAST and threading, leads to significant enhancement in the success rate of fold recognition. The combined approach, termed COBLATH, also yields much higher alignment accuracy than found in previous studies. It consists of two-way searches both by PSI-BLAST and by threading. In the PSI-BLAST portion, a query is used to search for hits in a library of potential templates and, conversely, each potential template is used to search for hits in a library of queries. In the threading portion, the scoring function is the sum of a sequence profile and a 6x6 substitution matrix between predicted query and known template secondary structure and solvent exposure. "Two-way" in threading means that the query's sequence profile is used to match the sequences of all potential templates and the sequence profiles of all potential templates are used to match the query's sequence. When tested on a set of 533 nonhomologous proteins, COBLATH was able to assign folds for 390 (73%). Among these 390 queries, 265 (68%) had root-mean-square deviations (RMSDs) of less than 8 A between predicted and actual structures. Such high success rate and accuracy make COBLATH an ideal tool for structural genomics.  相似文献   

10.
Standley DM  Toh H  Nakamura H 《Proteins》2008,72(4):1333-1351
A method to functionally annotate structural genomics targets, based on a novel structural alignment scoring function, is proposed. In the proposed score, position-specific scoring matrices are used to weight structurally aligned residue pairs to highlight evolutionarily conserved motifs. The functional form of the score is first optimized for discriminating domains belonging to the same Pfam family from domains belonging to different families but the same CATH or SCOP superfamily. In the optimization stage, we consider four standard weighting functions as well as our own, the "maximum substitution probability," and combinations of these functions. The optimized score achieves an area of 0.87 under the receiver-operating characteristic curve with respect to identifying Pfam families within a sequence-unique benchmark set of domain pairs. Confidence measures are then derived from the benchmark distribution of true-positive scores. The alignment method is next applied to the task of functionally annotating 230 query proteins released to the public as part of the Protein 3000 structural genomics project in Japan. Of these queries, 78 were found to align to templates with the same Pfam family as the query or had sequence identities > or = 30%. Another 49 queries were found to match more distantly related templates. Within this group, the template predicted by our method to be the closest functional relative was often not the most structurally similar. Several nontrivial cases are discussed in detail. Finally, 103 queries matched templates at the fold level, but not the family or superfamily level, and remain functionally uncharacterized.  相似文献   

11.
A neural network-based method has been developed for the prediction of beta-turns in proteins by using multiple sequence alignment. Two feed-forward back-propagation networks with a single hidden layer are used where the first-sequence structure network is trained with the multiple sequence alignment in the form of PSI-BLAST-generated position-specific scoring matrices. The initial predictions from the first network and PSIPRED-predicted secondary structure are used as input to the second structure-structure network to refine the predictions obtained from the first net. A significant improvement in prediction accuracy has been achieved by using evolutionary information contained in the multiple sequence alignment. The final network yields an overall prediction accuracy of 75.5% when tested by sevenfold cross-validation on a set of 426 nonhomologous protein chains. The corresponding Q(pred), Q(obs), and Matthews correlation coefficient values are 49.8%, 72.3%, and 0.43, respectively, and are the best among all the previously published beta-turn prediction methods. The Web server BetaTPred2 (http://www.imtech.res.in/raghava/betatpred2/) has been developed based on this approach.  相似文献   

12.
A new potential energy function representing the conformational preferences of sequentially local regions of a protein backbone is presented. This potential is derived from secondary structure probabilities such as those produced by neural network-based prediction methods. The potential is applied to the problem of remote homolog identification, in combination with a distance-dependent inter-residue potential and position-based scoring matrices. This fold recognition jury is implemented in a Java application called JThread. These methods are benchmarked on several test sets, including one released entirely after development and parameterization of JThread. In benchmark tests to identify known folds structurally similar to (but not identical with) the native structure of a sequence, JThread performs significantly better than PSI-BLAST, with 10% more structures identified correctly as the most likely structural match in a fold library, and 20% more structures correctly narrowed down to a set of five possible candidates. JThread also improves the average sequence alignment accuracy significantly, from 53% to 62% of residues aligned correctly. Reliable fold assignments and alignments are identified, making the method useful for genome annotation. JThread is applied to predicted open reading frames (ORFs) from the genomes of Mycoplasma genitalium and Drosophila melanogaster, identifying 20 new structural annotations in the former and 801 in the latter.  相似文献   

13.
MOTIVATION: Pairwise local sequence alignment is commonly used to search data bases for sequences related to some query sequence. Alignments are obtained using a scoring matrix that takes into account the different frequencies of occurrence of the various types of amino acid substitutions. Software like BLAST provides the user with a set of scoring matrices available to choose from, and in the literature it is sometimes recommended to try several scoring matrices on the sequences of interest. The significance of an alignment is usually assessed by looking at E-values and p-values. While sequence lengths and data base sizes enter the standard calculations of significance, it is much less common to take the use of several scoring matrices on the same sequences into account. Altschul proposed corrections of the p-value that account for the simultaneous use of an infinite number of PAM matrices. Here we consider the more realistic situation where the user may choose from a finite set of popular PAM and BLOSUM matrices, in particular the ones available in BLAST. It turns out that the significance of a result can be considerably overestimated, if a set of substitution matrices is used in an alignment problem and the most significant alignment is then quoted. RESULTS: Based on extensive simulations, we study the multiple testing problem that occurs when several scoring matrices for local sequence alignment are used. We consider a simple Bonferroni correction of the p-values and investigate its accuracy. Finally, we propose a more accurate correction based on extreme value distributions fitted to the maximum of the normalized scores obtained from different scoring matrices. For various sets of matrices we provide correction factors which can be easily applied to adjust p- and E-values reported by software packages.  相似文献   

14.
Liu S  Zhang C  Liang S  Zhou Y 《Proteins》2007,68(3):636-645
Recognizing the structural similarity without significant sequence identity (called fold recognition) is the key for bridging the gap between the number of known protein sequences and the number of structures solved. Previously, we developed a fold-recognition method called SP(3) which combines sequence-derived sequence profiles, secondary-structure profiles and residue-depth dependent, structure-derived sequence profiles. The use of residue-depth-dependent profiles makes SP(3) one of the best automatic predictors in CASP 6. Because residue depth (RD) and solvent accessible surface area (solvent accessibility) are complementary in describing the exposure of a residue to solvent, we test whether or not incorporation of solvent-accessibility profiles into SP(3) could further increase the accuracy of fold recognition. The resulting method, called SP(4), was tested in SALIGN benchmark for alignment accuracy and Lindahl, LiveBench 8 and CASP7 blind prediction for fold recognition sensitivity and model-structure accuracy. For remote homologs, SP(4) is found to consistently improve over SP(3) in the accuracy of sequence alignment and predicted structural models as well as in the sensitivity of fold recognition. Our result suggests that RD and solvent accessibility can be used concurrently for improving the accuracy and sensitivity of fold recognition. The SP(4) server and its local usage package are available on http://sparks.informatics.iupui.edu/SP4.  相似文献   

15.
Wu S  Zhang Y 《Proteins》2008,72(2):547-556
We develop a new threading algorithm MUSTER by extending the previous sequence profile-profile alignment method, PPA. It combines various sequence and structure information into single-body terms which can be conveniently used in dynamic programming search: (1) sequence profiles; (2) secondary structures; (3) structure fragment profiles; (4) solvent accessibility; (5) dihedral torsion angles; (6) hydrophobic scoring matrix. The balance of the weighting parameters is optimized by a grading search based on the average TM-score of 111 training proteins which shows a better performance than using the conventional optimization methods based on the PROSUP database. The algorithm is tested on 500 nonhomologous proteins independent of the training sets. After removing the homologous templates with a sequence identity to the target >30%, in 224 cases, the first template alignment has the correct topology with a TM-score >0.5. Even with a more stringent cutoff by removing the templates with a sequence identity >20% or detectable by PSI-BLAST with an E-value <0.05, MUSTER is able to identify correct folds in 137 cases with the first model of TM-score >0.5. Dependent on the homology cutoffs, the average TM-score of the first threading alignments by MUSTER is 5.1-6.3% higher than that by PPA. This improvement is statistically significant by the Wilcoxon signed rank test with a P-value < 1.0 x 10(-13), which demonstrates the effect of additional structural information on the protein fold recognition. The MUSTER server is freely available to the academic community at http://zhang.bioinformatics.ku.edu/MUSTER.  相似文献   

16.
Using a benchmark set of structurally similar proteins, we conduct a series of threading experiments intended to identify a scoring function with an optimal combination of contact-potential and sequence-profile terms. The benchmark set is selected to include many medium-difficulty fold recognition targets, where sequence similarity is undetectable by BLAST but structural similarity is extensive. The contact potential is based on the log-odds of non-local contacts involving different amino acid pairs, in native as opposed to randomly compacted structures. The sequence profile term is that used in PSI-BLAST. We find that combination of these terms significantly improves the success rate of fold recognition over use of either term alone, with respect to both recognition sensitivity and the accuracy of threading models. Improvement is greatest for targets between 10 % and 20 % sequence identity and 60 % to 80 % superimposable residues, where the number of models crossing critical accuracy and significance thresholds more than doubles. We suggest that these improvements account for the successful performance of the combined scoring function at CASP3. We discuss possible explanations as to why sequence-profile and contact-potential terms appear complementary.  相似文献   

17.
Sequence-based protein homology detection has been extensively studied and so far the most sensitive method is based upon comparison of protein sequence profiles, which are derived from multiple sequence alignment (MSA) of sequence homologs in a protein family. A sequence profile is usually represented as a position-specific scoring matrix (PSSM) or an HMM (Hidden Markov Model) and accordingly PSSM-PSSM or HMM-HMM comparison is used for homolog detection. This paper presents a new homology detection method MRFalign, consisting of three key components: 1) a Markov Random Fields (MRF) representation of a protein family; 2) a scoring function measuring similarity of two MRFs; and 3) an efficient ADMM (Alternating Direction Method of Multipliers) algorithm aligning two MRFs. Compared to HMM that can only model very short-range residue correlation, MRFs can model long-range residue interaction pattern and thus, encode information for the global 3D structure of a protein family. Consequently, MRF-MRF comparison for remote homology detection shall be much more sensitive than HMM-HMM or PSSM-PSSM comparison. Experiments confirm that MRFalign outperforms several popular HMM or PSSM-based methods in terms of both alignment accuracy and remote homology detection and that MRFalign works particularly well for mainly beta proteins. For example, tested on the benchmark SCOP40 (8353 proteins) for homology detection, PSSM-PSSM and HMM-HMM succeed on 48% and 52% of proteins, respectively, at superfamily level, and on 15% and 27% of proteins, respectively, at fold level. In contrast, MRFalign succeeds on 57.3% and 42.5% of proteins at superfamily and fold level, respectively. This study implies that long-range residue interaction patterns are very helpful for sequence-based homology detection. The software is available for download at http://raptorx.uchicago.edu/download/. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5.  相似文献   

18.
Lin HN  Notredame C  Chang JM  Sung TY  Hsu WL 《PloS one》2011,6(12):e27872
Most sequence alignment tools can successfully align protein sequences with higher levels of sequence identity. The accuracy of corresponding structure alignment, however, decreases rapidly when considering distantly related sequences (<20% identity). In this range of identity, alignments optimized so as to maximize sequence similarity are often inaccurate from a structural point of view. Over the last two decades, most multiple protein aligners have been optimized for their capacity to reproduce structure-based alignments while using sequence information. Methods currently available differ essentially in the similarity measurement between aligned residues using substitution matrices, Fourier transform, sophisticated profile-profile functions, or consistency-based approaches, more recently.In this paper, we present a flexible similarity measure for residue pairs to improve the quality of protein sequence alignment. Our approach, called SymAlign, relies on the identification of conserved words found across a sizeable fraction of the considered dataset, and supported by evolutionary analysis. These words are then used to define a position specific substitution matrix that better reflects the biological significance of local similarity. The experiment results show that the SymAlign scoring scheme can be incorporated within T-Coffee to improve sequence alignment accuracy. We also demonstrate that SymAlign is less sensitive to the presence of structurally non-similar proteins. In the analysis of the relationship between sequence identity and structure similarity, SymAlign can better differentiate structurally similar proteins from non- similar proteins. We show that protein sequence alignments can be significantly improved using a similarity estimation based on weighted n-grams. In our analysis of the alignments thus produced, sequence conservation becomes a better indicator of structural similarity. SymAlign also provides alignment visualization that can display sub-optimal alignments on dot-matrices. The visualization makes it easy to identify well-supported alternative alignments that may not have been identified by dynamic programming. SymAlign is available at http://bio-cluster.iis.sinica.edu.tw/SymAlign/.  相似文献   

19.
Two new sets of scoring matrices are introduced: H2 for the protein sequence comparison and T2 for the protein sequence-structure correlation. Each element of H2 or T2 measures the frequency with which a pair of amino acid types in one protein, k-residues apart in the sequence, is aligned with another pair of residues, of given amino acid types (for H2) or in given structural states (for T2), in other structurally homologous proteins. There are four types, corresponding to the k-values of 1 to 4, for both H2 and T2. These matrices were set up using a large number of structurally homologous protein pairs, with little sequence homology between the pair, that were recently generated using the structure comparison program SHEBA. The two scoring matrices were incorporated into the main body of the sequence alignment program SSEARCH in the FASTA package and tested in a fold recognition setting in which a set of 107 test sequences were aligned to each of a panel of 3,539 domains that represent all known protein structures. Six procedures were tested; the straight Smith-Waterman (SW) and FASTA procedures, which used the Blosum62 single residue type substitution matrix; BLAST and PSI-BLAST procedures, which also used the Blosum62 matrix; PASH, which used Blosum62 and H2 matrices; and PASSC, which used Blosum62, H2, and T2 matrices. All procedures gave similar results when the probe and target sequences had greater than 30% sequence identity. However, when the sequence identity was below 30%, a similar structure could be found for more sequences using PASSC than using any other procedure. PASH and PSI-BLAST gave the next best results.  相似文献   

20.
构建基于折叠核心的全α类蛋白取代矩阵   总被引:1,自引:0,他引:1  
氨基酸残基取代矩阵是影响多序列比对效果的重要因素,现有的取代矩阵对低相似序列的比对性能较低.在已有的 BLOSUM 取代矩阵算法基础上,定义了基于蛋白质折叠核心结构的序列 结构数据块;提出一种新的基于全α类蛋白质折叠核心结构的氨基酸残基取代矩阵——TOPSSUM25,用于提高低相似度序列的比对效果.将矩阵TOPSSUM25导入多序列比对程序,对相似性小于25%的一组四螺旋束序列 结构数据块的测试结果表明,基于 TOPSSUM25的多序列比对效果明显优于BLOSUM30矩阵;基于一个BAliBASE子集的比对检验也进一步表明, TOPSSUM25在全α类蛋白质的两两序列比对上优于BLOSUM30矩阵.研究结果可为进一步的阐明低同源蛋白质序列 结构 功能关系提供帮助.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号