首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
Koike R  Kinoshita K  Kidera A 《Proteins》2007,66(3):655-663
Dynamic programming (DP) and its heuristic algorithms are the most fundamental methods for similarity searches of amino acid sequences. Their detection power has been improved by including supplemental information, such as homologous sequences in the profile method. Here, we describe a method, probabilistic alignment (PA), that gives improved detection power, but similarly to the original DP, uses only a pair of amino acid sequences. Receiver operating characteristic (ROC) analysis demonstrated that the PA method is far superior to BLAST, and that its sensitivity and selectivity approach to those of PSI-BLAST. Particularly for orphan proteins having few homologues in the database, PA exhibits much better performance than PSI-BLAST. On the basis of this observation, we applied the PA method to a homology search of two orphan proteins, Latexin and Resuscitation-promoting factor domain. Their molecular functions have been described based on structural similarities, but sequence homologues have not been identified by PSI-BLAST. PA successfully detected sequence homologues for the two proteins and confirmed that the observed structural similarities are the result of an evolutional relationship.  相似文献   

3.
Over the past two decades, many ingenious efforts have been made in protein remote homology detection. Because homologous proteins often diversify extensively in sequence, it is challenging to demonstrate such relatedness through entirely sequence-driven searches. Here, we describe a computational method for the generation of 'protein-like' sequences that serves to bridge gaps in protein sequence space. Sequence profile information, as embodied in a position-specific scoring matrix of multiply aligned sequences of bona fide family members, serves as the starting point in this algorithm. The observed amino acid propensity and the selection of a random number dictate the selection of a residue for each position in the sequence. In a systematic manner, and by applying a 'roulette-wheel' selection approach at each position, we generate parent family-like sequences and thus facilitate an enlargement of sequence space around the family. When generated for a large number of families, we demonstrate that they expand the utility of natural intermediately related sequences in linking distant proteins. In 91% of the assessed examples, inclusion of designed sequences improved fold coverage by 5-10% over searches made in their absence. Furthermore, with several examples from proteins adopting folds such as TIM, globin, lipocalin and others, we demonstrate that the success of including designed sequences in a database positively sensitized methods such as PSI-BLAST and Cascade PSI-BLAST and is a promising opportunity for enormously improved remote homology recognition using sequence information alone.  相似文献   

4.
A database search often will find a seemingly strong sequence similarity between two fragments of proteins that are not expected to have an evolutionary or functional relationship. It is tempting to suggest that the two fragments will adopt a similar conformation due to a common pattern of residues that dictate a particular substructure. To investigate the likelihood of such a structural similarity, local sequence similarities between proteins of known conformation were identified by a standard database search algorithm. Significant sequence similarity was identified as when the chance probability of obtaining the relatedness score from a scan of the entire database was less than 1%. In this region both true homologies and false homologies are detected. A total of 69 false homologies was located of length between 20 and 262 aligned positions. Many of these alignments had approximately 25% sequence identity and a further 25% of conservative changes. However, the results show in general these aligned fragments did not have a significant similarity in secondary or tertiary structure. Thus local sequence does not indicate a structural similarity when there is neither an evolutionary nor functional explanation to support this. Accordingly structure predictions based on finding a local sequence similarity with an evolutionary unrelated protein of known conformation are unlikely to be valid.  相似文献   

5.
Fold assignments for proteins from the Escherichia coli genome are carried out using BASIC, a profile-profile alignment algorithm, recently tested on fold recognition benchmarks and on the Mycoplasma genitalium genome and PSI BLAST, the newest generation of the de facto standard in homology search algorithms. The fold assignments are followed by automated modeling and the resulting three-dimensional models are analyzed for possible function prediction. Close to 30% of the proteins encoded in the E. coli genome can be recognized as homologous to a protein family with known structure. Most of these homologies (23% of the entire genome) can be recognized both by PSI BLAST and BASIC algorithms, but the latter recognizes an additional 260 homologies. Previous estimates suggested that only 10-15% of E. coli proteins can be characterized this way. This dramatic increase in the number of recognized homologies between E. coli proteins and structurally characterized protein families is partly due to the rapid increase of the database of known protein structures, but mostly it is due to the significant improvement in prediction algorithms. Knowing protein structure adds a new dimension to our understanding of its function and the predictions presented here can be used to predict function for uncharacterized proteins. Several examples, analyzed in more detail in this paper, include the DPS protein protecting DNA from oxidative damage (predicted to be homologous to ferritin with iron ion acting as a reducing agent) and the ahpC/tsa family of proteins, which provides resistance to various oxidating agents (predicted to be homologous to glutathione peroxidase).  相似文献   

6.
Abstract

Profile-based sequence search procedures are commonly employed to detect remote relationships between proteins. We provide an assessment of a Cascade PSI-BLAST protocol that rigorously employs intermediate sequences in detecting remote relationships between proteins. In this approach we detect using PSI-BLAST, which involves multiple rounds of iteration, an initial set of homologues for a protein in a ‘first generation’ search by querying a database. We propagate a ‘second generation’ search in the database, involving multiple runs of PSI-BLAST using each of the homologues identified in the previous generation as queries to recognize homologues not detected earlier. This non-directed search process can be viewed as an iteration of iterations that is continued to detect further homologues until no new hits are detectable. We present an assessment of the coverage of this ‘cascaded’ intermediate sequence search on diverse folds and find that searches for up to three generations detect most known homologues of a query. Our assessments show that this approach appears to perform better than the traditional use of PSI-BLAST by detecting 15% more relationships within a family and 35% more relationships within a superfamily. We show that such searches can be performed on generalized sequence databases and non-trivial relationships between proteins can be detected effectively. Such a propagation of searches maximizes the chances of detecting distant homologies by effectively scanning protein “fold space”.  相似文献   

7.
This paper presents a novel approach to profile-profile comparison. The method compares two input profiles (like those that are generated by PSI-BLAST) and assigns a similarity score to assess their statistical similarity. Our profile-profile comparison tool, which allows for gaps, can be used to detect weak similarities between protein families. It has also been optimized to produce alignments that are in very good agreement with structural alignments. Tests show that the profile-profile alignments are indeed highly correlated with similarities between secondary structure elements and tertiary structure. Exhaustive evaluations show that our method is significantly more sensitive in detecting distant homologies than the popular profile-based search programs PSI-BLAST and IMPALA. The relative improvement is the same order of magnitude as the improvement of PSI-BLAST relative to BLAST. Our new tool often detects similarities that fall within the twilight zone of sequence similarity.  相似文献   

8.
Profile-based sequence search procedures are commonly employed to detect remote relationships between proteins. We provide an assessment of a Cascade PSI-BLAST protocol that rigorously employs intermediate sequences in detecting remote relationships between proteins. In this approach we detect using PSI-BLAST, which involves multiple rounds of iteration, an initial set of homologues for a protein in a 'first generation' search by querying a database. We propagate a 'second generation' search in the database, involving multiple runs of PSI-BLAST using each of the homologues identified in the previous generation as queries to recognize homologues not detected earlier. This non-directed search process can be viewed as an iteration of iterations that is continued to detect further homologues until no new hits are detectable. We present an assessment of the coverage of this 'cascaded' intermediate sequence search on diverse folds and find that searches for up to three generations detect most known homologues of a query. Our assessments show that this approach appears to perform better than the traditional use of PSI-BLAST by detecting 15% more relationships within a family and 35% more relationships within a superfamily. We show that such searches can be performed on generalized sequence databases and non-trivial relationships between proteins can be detected effectively. Such a propagation of searches maximizes the chances of detecting distant homologies by effectively scanning protein "fold space".  相似文献   

9.
We present a protein fold recognition method, MANIFOLD, which uses the similarity between target and template proteins in predicted secondary structure, sequence and enzyme code to predict the fold of the target protein. We developed a non-linear ranking scheme in order to combine the scores of the three different similarity measures used. For a difficult test set of proteins with very little sequence similarity, the program predicts the fold class correctly in 34% of cases. This is an over twofold increase in accuracy compared with sequence-based methods such as PSI-BLAST or GenTHREADER, which score 13-14% correct first hits for the same test set. The functional similarity term increases the prediction accuracy by up to 3% compared with using the combination of secondary structure similarity and PSI-BLAST alone. We argue that using functional and secondary structure information can increase the fold recognition beyond sequence similarity.  相似文献   

10.
Searches using position specific scoring matrices (PSSMs) have been commonly used in remote homology detection procedures such as PSI-BLAST and RPS-BLAST. A PSSM is generated typically using one of the sequences of a family as the reference sequence. In the case of PSI-BLAST searches the reference sequence is same as the query. Recently we have shown that searches against the database of multiple family-profiles, with each one of the members of the family used as a reference sequence, are more effective than searches against the classical database of single family-profiles. Despite relatively a better overall performance when compared with common sequence-profile matching procedures, searches against the multiple family-profiles database result in a few false positives and false negatives. Here we show that profile length and divergence of sequences used in the construction of a PSSM have major influence on the performance of multiple profile based search approach. We also identify that a simple parameter defined by the number of PSSMs corresponding to a family that is hit, for a query, divided by the total number of PSSMs in the family can distinguish effectively the true positives from the false positives in the multiple profiles search approach.  相似文献   

11.
Proteins might have considerable structural similarities even when no evolutionary relationship of their sequences can be detected. This property is often referred to as the proteins sharing only a "fold". Of course, there are also sequences of common origin in each fold, called a "superfamily", and in them groups of sequences with clear similarities, designated "family". Developing algorithms to reliably identify proteins related at any level is one of the most important challenges in the fast growing field of bioinformatics today. However, it is not at all certain that a method proficient at finding sequence similarities performs well at the other levels, or vice versa.Here, we have compared the performance of various search methods on these different levels of similarity. As expected, we show that it becomes much harder to detect proteins as their sequences diverge. For family related sequences the best method gets 75% of the top hits correct. When the sequences differ but the proteins belong to the same superfamily this drops to 29%, and in the case of proteins with only fold similarity it is as low as 15%. We have made a more complete analysis of the performance of different algorithms than earlier studies, also including threading methods in the comparison. Using this method a more detailed picture emerges, showing multiple sequence information to improve detection on the two closer levels of relationship. We have also compared the different methods of including this information in prediction algorithms.For lower specificities, the best scheme to use is a linking method connecting proteins through an intermediate hit. For higher specificities, better performance is obtained by PSI-BLAST and some procedures using hidden Markov models. We also show that a threading method, THREADER, performs significantly better than any other method at fold recognition.  相似文献   

12.
Several recent publications illustrated advantages of using sequence profiles in recognizing distant homologies between proteins. At the same time, the practical usefulness of distant homology recognition depends not only on the sensitivity of the algorithm, but also on the quality of the alignment between a prediction target and the template from the database of known proteins. Here, we study this question for several supersensitive protein algorithms that were previously compared in their recognition sensitivity (Rychlewski et al., 2000). A database of protein pairs with similar structures, but low sequence similarity is used to rate the alignments obtained with several different methods, which included sequence-sequence, sequence-profile, and profile-profile alignment methods. We show that incorporation of evolutionary information encoded in sequence profiles into alignment calculation methods significantly increases the alignment accuracy, bringing them closer to the alignments obtained from structure comparison. In general, alignment quality is correlated with recognition and alignment score significance. For every alignment method, alignments with statistically significant scores correlate with both correct structural templates and good quality alignments. At the same time, average alignment lengths differ in various methods, making the comparison between them difficult. For instance, the alignments obtained by FFAS, the profile-profile alignment algorithm developed in our group are always longer that the alignments obtained with the PSI-BLAST algorithms. To address this problem, we develop methods to truncate or extend alignments to cover a specified percentage of protein lengths. In most cases, the elongation of the alignment by profile-profile methods is reasonable, adding fragments of similar structure. The examples of erroneous alignment are examined and it is shown that they can be identified based on the model quality.  相似文献   

13.
Protein homology detection by HMM-HMM comparison   总被引:22,自引:4,他引:18  
MOTIVATION: Protein homology detection and sequence alignment are at the basis of protein structure prediction, function prediction and evolution. RESULTS: We have generalized the alignment of protein sequences with a profile hidden Markov model (HMM) to the case of pairwise alignment of profile HMMs. We present a method for detecting distant homologous relationships between proteins based on this approach. The method (HHsearch) is benchmarked together with BLAST, PSI-BLAST, HMMER and the profile-profile comparison tools PROF_SIM and COMPASS, in an all-against-all comparison of a database of 3691 protein domains from SCOP 1.63 with pairwise sequence identities below 20%.Sensitivity: When the predicted secondary structure is included in the HMMs, HHsearch is able to detect between 2.7 and 4.2 times more homologs than PSI-BLAST or HMMER and between 1.44 and 1.9 times more than COMPASS or PROF_SIM for a rate of false positives of 10%. Approximately half of the improvement over the profile-profile comparison methods is attributable to the use of profile HMMs in place of simple profiles. Alignment quality: Higher sensitivity is mirrored by an increased alignment quality. HHsearch produced 1.2, 1.7 and 3.3 times more good alignments ('balanced' score >0.3) than the next best method (COMPASS), and 1.6, 2.9 and 9.4 times more than PSI-BLAST, at the family, superfamily and fold level, respectively.Speed: HHsearch scans a query of 200 residues against 3691 domains in 33 s on an AMD64 2GHz PC. This is 10 times faster than PROF_SIM and 17 times faster than COMPASS.  相似文献   

14.
Using a benchmark set of structurally similar proteins, we conduct a series of threading experiments intended to identify a scoring function with an optimal combination of contact-potential and sequence-profile terms. The benchmark set is selected to include many medium-difficulty fold recognition targets, where sequence similarity is undetectable by BLAST but structural similarity is extensive. The contact potential is based on the log-odds of non-local contacts involving different amino acid pairs, in native as opposed to randomly compacted structures. The sequence profile term is that used in PSI-BLAST. We find that combination of these terms significantly improves the success rate of fold recognition over use of either term alone, with respect to both recognition sensitivity and the accuracy of threading models. Improvement is greatest for targets between 10 % and 20 % sequence identity and 60 % to 80 % superimposable residues, where the number of models crossing critical accuracy and significance thresholds more than doubles. We suggest that these improvements account for the successful performance of the combined scoring function at CASP3. We discuss possible explanations as to why sequence-profile and contact-potential terms appear complementary.  相似文献   

15.
Zhou H  Zhou Y 《Proteins》2004,55(4):1005-1013
An elaborate knowledge-based energy function is designed for fold recognition. It is a residue-level single-body potential so that highly efficient dynamic programming method can be used for alignment optimization. It contains a backbone torsion term, a buried surface term, and a contact-energy term. The energy score combined with sequence profile and secondary structure information leads to an algorithm called SPARKS (Sequence, secondary structure Profiles and Residue-level Knowledge-based energy Score) for fold recognition. Compared with the popular PSI-BLAST, SPARKS is 21% more accurate in sequence-sequence alignment in ProSup benchmark and 10%, 25%, and 20% more sensitive in detecting the family, superfamily, fold similarities in the Lindahl benchmark, respectively. Moreover, it is one of the best methods for sensitivity (the number of correctly recognized proteins), alignment accuracy (based on the MaxSub score), and specificity (the average number of correctly recognized proteins whose scores are higher than the first false positives) in LiveBench 7 among more than twenty servers of non-consensus methods. The simple algorithm used in SPARKS has the potential for further improvement. This highly efficient method can be used for fold recognition on genomic scales. A web server is established for academic users on http://theory.med.buffalo.edu.  相似文献   

16.
Sequence databases are rapidly growing, thereby increasing the coverage of protein sequence space, but this coverage is uneven because most sequencing efforts have concentrated on a small number of organisms. The resulting granularity of sequence space creates many problems for profile-based sequence comparison programs. In this paper, we suggest several strategies that address these problems, and at the same time speed up the searches for homologous proteins and improve the ability of profile methods to recognize distant homologies. One of our strategies combines database clustering, which removes highly redundant sequence, and a two-step PSI-BLAST (PDB-BLAST), which separates sequence spaces of profile composition and space of homology searching. The combination of these strategies improves distant homology recognitions by more than 100%, while using only 10% of the CPU time of the standard PSI-BLAST search. Another method, intermediate profile searches, allows for the exploration of additional search directions that are normally dominated by large protein sub-families within very diverse families. All methods are evaluated with a large fold-recognition benchmark.  相似文献   

17.
Silva PJ 《Proteins》2008,70(4):1588-1594
Hydrophobic cluster analysis (HCA) has long been used as a tool to detect distant homologies between protein sequences, and to classify them into different folds. However, it relies on expert human intervention, and is sensitive to subjective interpretations of pattern similarities. In this study, we describe a novel algorithm to assess the similarity of hydrophobic amino acid distributions between two sequences. Our algorithm correctly identifies as misattributions several HCA-based proposals of structural similarity between unrelated proteins present in the literature. We have also used this method to identify the proper fold of a large variety of sequences, and to automatically select the most appropriate structure for homology modeling of several proteins with low sequence identity to any other member of the protein data bank. Automatic modeling of the target proteins based on these templates yielded structures with TM-scores (vs. experimental structures) above 0.60, even without further refinement. Besides enabling a reliable identification of the correct fold of an unknown sequence and the choice of suitable templates, our algorithm also shows that whereas most structural classes of proteins are very homogeneous in hydrophobic cluster composition, a tenth of the described families are compatible with a large variety of hydrophobic patterns. We have built a browsable database of every major representative hydrophobic cluster pattern present in each structural class of proteins, freely available at http://www2.ufp.pt/ pedros/HCA_db/index.htm.  相似文献   

18.
The identification of the enzymes involved in the metabolism of simple and complex carbohydrates presents one bioinformatic challenge in the post-genomic era. Here, we present the PFIT and PFRIT algorithms for identifying those proteins adopting the alpha/beta barrel fold that function as glycosidases. These algorithms are based on the observation that proteins adopting the alpha/beta barrel fold share positions in their tertiary structures having equivalent sets of atomic interactions. These are conserved tertiary interaction positions, which have been implicated in both structure and function. Glycosidases adopting the alpha/beta barrel fold share more conserved tertiary interactions than alpha/beta barrel proteins having other functions. The enrichment pattern of conserved tertiary interactions in the glycosidases is the information that PFIT and PFRIT use to predict whether any given alpha/beta barrel will function as a glycosidase or not. Using as a test set a database of 19 glycosidase and 45 nonglycosidase alpha/beta barrel proteins with low sequence similarity, PFIT and PFRIT can correctly predict glycosidase function for 84% of the proteins known to function as glycosidases. PFIT and PFRIT incorrectly predict glycosidase function for 25% of the nonglycosidases. The program PSI-BLAST can also correctly identify 84% of the 19 glycosidases, however, it incorrectly predicts glycosidase function for 50% of the nonglycosidases (twofold greater than PFIT and PFRIT). Overall, we demonstrate that the structure-based PFIT and PFRIT algorithms are both more selective and sensitive for predicting glycosidase function than the sequence-based PSI-BLAST algorithm.  相似文献   

19.

Background

Development of sensitive sequence search procedures for the detection of distant relationships between proteins at superfamily/fold level is still a big challenge. The intermediate sequence search approach is the most frequently employed manner of identifying remote homologues effectively. In this study, examination of serine proteases of prolyl oligopeptidase, rhomboid and subtilisin protein families were carried out using plant serine proteases as queries from two genomes including A. thaliana and O. sativa and 13 other families of unrelated folds to identify the distant homologues which could not be obtained using PSI-BLAST.

Methodology/Principal Findings

We have proposed to start with multiple queries of classical serine protease members to identify remote homologues in families, using a rigorous approach like Cascade PSI-BLAST. We found that classical sequence based approaches, like PSI-BLAST, showed very low sequence coverage in identifying plant serine proteases. The algorithm was applied on enriched sequence database of homologous domains and we obtained overall average coverage of 88% at family, 77% at superfamily or fold level along with specificity of ∼100% and Mathew’s correlation coefficient of 0.91. Similar approach was also implemented on 13 other protein families representing every structural class in SCOP database. Further investigation with statistical tests, like jackknifing, helped us to better understand the influence of neighbouring protein families.

Conclusions/Significance

Our study suggests that employment of multiple queries of a family for the Cascade PSI-BLAST searches is useful for predicting distant relationships effectively even at superfamily level. We have proposed a generalized strategy to cover all the distant members of a particular family using multiple query sequences. Our findings reveal that prior selection of sequences as query and the presence of neighbouring families can be important for covering the search space effectively in minimal computational time. This study also provides an understanding of the ‘bridging’ role of related families.  相似文献   

20.
Several fold recognition algorithms are compared to each other in terms of prediction accuracy and significance. It is shown that on standard benchmarks, hybrid methods, which combine scoring based on sequence-sequence and sequence-structure matching, surpass both sequence and threading methods in the number of accurate predictions. However, the sequence similarity contributes most to the prediction accuracy. This strongly argues that most examples of apparently nonhomologous proteins with similar folds are actually related by evolution. While disappointing from the perspective of the fundamental understanding of protein folding, this adds a new significance to fold recognition methods as a possible first step in function prediction. Despite hybrid methods being more accurate at fold prediction than either the sequence or threading methods, each of the methods is correct in some cases where others have failed. This partly reflects a different perspective on sequence/structure relationship embedded in various methods. To combine predictions from different methods, estimates of significance of predictions are made for all methods. With the help of such estimates, it is possible to develop a "jury" method, which has accuracy higher than any of the single methods. Finally, building full three-dimensional models for all top predictions helps to eliminate possible false positives where alignments, which are optimal in the one-dimensional sequences, lead to unsolvable sterical conflicts for the full three-dimensional models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号