首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到15条相似文献,搜索用时 62 毫秒
1.
In this study, we show that it is possible to increase the performance over PSI-BLAST by using evolutionary information for both query and target sequences. This information can be used in three different ways: by sequence linking, profile-profile alignments, and by combining sequence-profile and profile-sequence searches. If only PSI-BLAST is used, 16% of superfamily-related protein domains can be detected at 90% specificity, but if a sequence-profile and a profile-sequence search are combined, this is increased to 20%, profile-profile searches detects 19%, whereas a linking procedure identifies 22% of these proteins. All three methods show equal performance, but the best combination of speed and accuracy seems to be obtained by the combined searches, because this method shows a good performance even at high specificity and the lowest computational cost. In addition, we show that the E-values reported by all these methods, including PSI-BLAST, underestimate the true rate of false positives. This behavior is seen even if a very strict E-value cutoff and a limited number of iterations are used. However, the difference is more pronounced with a looser E-value cutoff and more iterations.  相似文献   

2.
Ohlson T  Wallner B  Elofsson A 《Proteins》2004,57(1):188-197
To improve the detection of related proteins, it is often useful to include evolutionary information for both the query and target proteins. One method to include this information is by the use of profile-profile alignments, where a profile from the query protein is compared with the profiles from the target proteins. Profile-profile alignments can be implemented in several fundamentally different ways. The similarity between two positions can be calculated using a dot-product, a probabilistic model, or an information theoretical measure. Here, we present a large-scale comparison of different profile-profile alignment methods. We show that the profile-profile methods perform at least 30% better than standard sequence-profile methods both in their ability to recognize superfamily-related proteins and in the quality of the obtained alignments. Although the performance of all methods is quite similar, profile-profile methods that use a probabilistic scoring function have an advantage as they can create good alignments and show a good fold recognition capacity using the same gap-penalties, while the other methods need to use different parameters to obtain comparable performances.  相似文献   

3.
Several recent publications illustrated advantages of using sequence profiles in recognizing distant homologies between proteins. At the same time, the practical usefulness of distant homology recognition depends not only on the sensitivity of the algorithm, but also on the quality of the alignment between a prediction target and the template from the database of known proteins. Here, we study this question for several supersensitive protein algorithms that were previously compared in their recognition sensitivity (Rychlewski et al., 2000). A database of protein pairs with similar structures, but low sequence similarity is used to rate the alignments obtained with several different methods, which included sequence-sequence, sequence-profile, and profile-profile alignment methods. We show that incorporation of evolutionary information encoded in sequence profiles into alignment calculation methods significantly increases the alignment accuracy, bringing them closer to the alignments obtained from structure comparison. In general, alignment quality is correlated with recognition and alignment score significance. For every alignment method, alignments with statistically significant scores correlate with both correct structural templates and good quality alignments. At the same time, average alignment lengths differ in various methods, making the comparison between them difficult. For instance, the alignments obtained by FFAS, the profile-profile alignment algorithm developed in our group are always longer that the alignments obtained with the PSI-BLAST algorithms. To address this problem, we develop methods to truncate or extend alignments to cover a specified percentage of protein lengths. In most cases, the elongation of the alignment by profile-profile methods is reasonable, adding fragments of similar structure. The examples of erroneous alignment are examined and it is shown that they can be identified based on the model quality.  相似文献   

4.
Two new sets of scoring matrices are introduced: H2 for the protein sequence comparison and T2 for the protein sequence-structure correlation. Each element of H2 or T2 measures the frequency with which a pair of amino acid types in one protein, k-residues apart in the sequence, is aligned with another pair of residues, of given amino acid types (for H2) or in given structural states (for T2), in other structurally homologous proteins. There are four types, corresponding to the k-values of 1 to 4, for both H2 and T2. These matrices were set up using a large number of structurally homologous protein pairs, with little sequence homology between the pair, that were recently generated using the structure comparison program SHEBA. The two scoring matrices were incorporated into the main body of the sequence alignment program SSEARCH in the FASTA package and tested in a fold recognition setting in which a set of 107 test sequences were aligned to each of a panel of 3,539 domains that represent all known protein structures. Six procedures were tested; the straight Smith-Waterman (SW) and FASTA procedures, which used the Blosum62 single residue type substitution matrix; BLAST and PSI-BLAST procedures, which also used the Blosum62 matrix; PASH, which used Blosum62 and H2 matrices; and PASSC, which used Blosum62, H2, and T2 matrices. All procedures gave similar results when the probe and target sequences had greater than 30% sequence identity. However, when the sequence identity was below 30%, a similar structure could be found for more sequences using PASSC than using any other procedure. PASH and PSI-BLAST gave the next best results.  相似文献   

5.
The detection of remote homolog pairs of proteins using computational methods is a pivotal problem in structural bioinformatics, aiming to compute protein folds on the basis of information in the database of known structures. In the last 25 years, several methods have been developed to tackle this problem, based on different approaches including sequence-sequence alignments and/or structure comparison. In this article, we will briefly discuss When, Why, Where and How (WWWH) to perform remote homology search, reviewing some of the most widely adopted computational approaches. The specific aim is highlighting the basic criteria implemented by different research groups and commenting on the status of the art as well as on still-open questions.  相似文献   

6.
The tryptophan rich basic protein/calcium signal‐modulating cyclophilin ligand (WRB/CAML) and Get1p/Get2p complexes, in vertebrates and yeast, respectively, mediate the final step of tail‐anchored protein insertion into the endoplasmic reticulum membrane via the Get pathway. While WRB appears to exist in all eukaryotes, CAML homologs were previously recognized only among chordates, raising the question as to how CAML's function is performed in other phyla. Furthermore, whereas WRB was recognized as the metazoan homolog of Get1, CAML and Get2, although functionally equivalent, were not considered to be homologous. CAML contains an N‐terminal basic, TRC40/Get3‐interacting, region, three transmembrane segments near the C‐terminus, and a poorly conserved region between these domains. Here, I searched the NCBI protein database for remote CAML homologs in all eukaryotes, using position‐specific iterated‐basic local alignment search tool, with the C‐terminal, the N‐terminal or the full‐length sequence of human CAML as query. The N‐terminal basic region and full‐length CAML retrieved homologs among metazoa, plants and fungi. In the latter group several hits were annotated as GET2. The C‐terminal query did not return entries outside of the animal kingdom, but did retrieve over one hundred invertebrate metazoan CAML‐like proteins, which all conserved the N‐terminal TRC40‐binding domain. The results indicate that CAML homologs exist throughout the eukaryotic domain of life, and suggest that metazoan CAML and yeast GET2 share a common evolutionary origin. They further reveal a tight link between the particular features of the metazoan membrane‐anchoring domain and the TRC40‐interacting region. The list of sequences presented here should provide a useful resource for future studies addressing structure‐function relationships in CAML proteins.  相似文献   

7.
Elofsson A 《Proteins》2002,46(3):330-339
One of the most central methods in bioinformatics is the alignment of two protein or DNA sequences. However, so far large-scale benchmarks examining the quality of these alignments are scarce. On the other hand, recently several large-scale studies of the capacity of different methods to identify related sequences has led to new insights about the performance of fold recognition methods. To increase our understanding about fold recognition methods, we present a large-scale benchmark of alignment quality. We compare alignments from several different alignment methods, including sequence alignments, hidden Markov models, PSI-BLAST, CLUSTALW, and threading methods. For most methods, the alignment quality increases significantly at about 20% sequence identity. The difference in alignment quality between different methods is quite small, and the main difference can be seen at the exact positioning of the sharp rise in alignment quality, that is, around 15-20% sequence identity. The alignments are improved by using structural information. In general, the best alignments are obtained by methods that use predicted secondary structure information and sequence profiles obtained from PSI-BLAST. One interesting observation is that for different pairs many different methods create the best alignments. This finding implies that if a method that could select the best alignment method for each pair existed, a significant improvement of the alignment quality could be gained.  相似文献   

8.
Chen H  Kihara D 《Proteins》2011,79(1):315-334
Computational protein structure prediction remains a challenging task in protein bioinformatics. In the recent years, the importance of template-based structure prediction is increasing because of the growing number of protein structures solved by the structural genomics projects. To capitalize the significant efforts and investments paid on the structural genomics projects, it is urgent to establish effective ways to use the solved structures as templates by developing methods for exploiting remotely related proteins that cannot be simply identified by homology. In this work, we examine the effect of using suboptimal alignments in template-based protein structure prediction. We showed that suboptimal alignments are often more accurate than the optimal one, and such accurate suboptimal alignments can occur even at a very low rank of the alignment score. Suboptimal alignments contain a significant number of correct amino acid residue contacts. Moreover, suboptimal alignments can improve template-based models when used as input to Modeller. Finally, we use suboptimal alignments for handling a contact potential in a probabilistic way in a threading program, SUPRB. The probabilistic contacts strategy outperforms the partly thawed approach, which only uses the optimal alignment in defining residue contacts, and also the re-ranking strategy, which uses the contact potential in re-ranking alignments. The comparison with existing methods in the template-recognition test shows that SUPRB is very competitive and outperforms existing methods.  相似文献   

9.
Guo J  Lin Y  Liu X 《Proteomics》2006,6(19):5099-5105
This paper proposes a new integrative system (GNBSL--Gram-negative bacteria subcellular localization) for subcellular localization specifized on the Gram-negative bacteria proteins. First, the system generates a position-specific frequency matrix (PSFM) and a position-specific scoring matrix (PSSM) for each protein sequence by searching the Swiss-Prot database. Then different features are extracted by four modules from the PSFM and the PSSM. The features include whole-sequence amino acid composition, N- and C-terminus amino acid composition, dipeptide composition, and segment composition. Four probabilistic neural network (PNN) classifiers are used to classify these modules. To further improve the performance, two modules trained by support vector machine (SVM) are added in this system. One module extracts the residue-couple distribution from the amino acid sequence and the other module applies a pairwise profile alignment kernel to measure the local similarity between every two sequences. Finally, an additional SVM is used to fuse the outputs from the six modules. Test on a benchmark dataset shows that the overall success rate of GNBSL is higher than those of PSORT-B, CELLO, and PSLpred. A web server GNBSL can be visited from http://166.111.24.5/webtools/GNBSL/index.htm.  相似文献   

10.
Koike R  Kinoshita K  Kidera A 《Proteins》2007,66(3):655-663
Dynamic programming (DP) and its heuristic algorithms are the most fundamental methods for similarity searches of amino acid sequences. Their detection power has been improved by including supplemental information, such as homologous sequences in the profile method. Here, we describe a method, probabilistic alignment (PA), that gives improved detection power, but similarly to the original DP, uses only a pair of amino acid sequences. Receiver operating characteristic (ROC) analysis demonstrated that the PA method is far superior to BLAST, and that its sensitivity and selectivity approach to those of PSI-BLAST. Particularly for orphan proteins having few homologues in the database, PA exhibits much better performance than PSI-BLAST. On the basis of this observation, we applied the PA method to a homology search of two orphan proteins, Latexin and Resuscitation-promoting factor domain. Their molecular functions have been described based on structural similarities, but sequence homologues have not been identified by PSI-BLAST. PA successfully detected sequence homologues for the two proteins and confirmed that the observed structural similarities are the result of an evolutional relationship.  相似文献   

11.
We have modified and improved the GOR algorithm for the protein secondary structure prediction by using the evolutionary information provided by multiple sequence alignments, adding triplet statistics, and optimizing various parameters. We have expanded the database used to include the 513 non-redundant domains collected recently by Cuff and Barton (Proteins 1999;34:508-519; Proteins 2000;40:502-511). We have introduced a variable size window that allowed us to include sequences as short as 20-30 residues. A significant improvement over the previous versions of GOR algorithm was obtained by combining the PSI-BLAST multiple sequence alignments with the GOR method. The new algorithm will form the basis for the future GOR V release on an online prediction server. The average accuracy of the prediction of secondary structure with multiple sequence alignment and full jack-knife procedure was 73.5%. The accuracy of the prediction increases to 74.2% by limiting the prediction to 375 (of 513) sequences having at least 50 PSI-BLAST alignments. The average accuracy of the prediction of the new improved program without using multiple sequence alignments was 67.5%. This is approximately a 3% improvement over the preceding GOR IV algorithm (Garnier J, Gibrat JF, Robson B. Methods Enzymol 1996;266:540-553; Kloczkowski A, Ting K-L, Jernigan RL, Garnier J. Polymer 2002;43:441-449). We have discussed alternatives to the segment overlap (Sov) coefficient proposed by Zemla et al. (Proteins 1999;34:220-223).  相似文献   

12.
A novel method has been developed for acquiring the correct alignment of a query sequence against remotely homologous proteins by extracting structural information from profiles of multiple structure alignment. A systematic search algorithm combined with a group of score functions based on sequence information and structural information has been introduced in this procedure. A limited number of top solutions (15,000) with high scores were selected as candidates for further examination. On a test-set comprising 301 proteins from 75 protein families with sequence identity less than 30%, the proportion of proteins with completely correct alignment as first candidate was improved to 39.8% by our method, whereas the typical performance of existing sequence-based alignment methods was only between 16.1% and 22.7%. Furthermore, multiple candidates for possible alignment were provided in our approach, which dramatically increased the possibility of finding correct alignment, such that completely correct alignments were found amongst the top-ranked 1000 candidates in 88.3% of the proteins. With the assistance of a sequence database, completely correct alignment solutions were achieved amongst the top 1000 candidates in 94.3% of the proteins. From such a limited number of candidates, it would become possible to identify more correct alignment using a more time-consuming but more powerful method with more detailed structural information, such as side-chain packing and energy minimization, etc. The results indicate that the novel alignment strategy could be helpful for extending the application of highly reliable methods for fold identification and homology modeling to a huge number of homologous proteins of low sequence similarity. Details of the methods, together with the results and implications for future development are presented.  相似文献   

13.
Structural and functional annotation of the large and growing database of genomic sequences is a major problem in modern biology. Protein structure prediction by detecting remote homology to known structures is a well-established and successful annotation technique. However, the broad spectrum of evolutionary change that accompanies the divergence of close homologues to become remote homologues cannot easily be captured with a single algorithm. Recent advances to tackle this problem have involved the use of multiple predictive algorithms available on the Internet. Here we demonstrate how such ensembles of predictors can be designed in-house under controlled conditions and permit significant improvements in recognition by using a concept taken from protein loop energetics and applying it to the general problem of 3D clustering. We have developed a stringent test that simulates the situation where a protein sequence of interest is submitted to multiple different algorithms and not one of these algorithms can make a confident (95%) correct assignment. A method of meta-server prediction (Phyre) that exploits the benefits of a controlled environment for the component methods was implemented. At 95% precision or higher, Phyre identified 64.0% of all correct homologous query-template relationships, and 84.0% of the individual test query proteins could be accurately annotated. In comparison to the improvement that the single best fold recognition algorithm (according to training) has over PSI-Blast, this represents a 29.6% increase in the number of correct homologous query-template relationships, and a 46.2% increase in the number of accurately annotated queries. It has been well recognised in fold prediction, other bioinformatics applications, and in many other areas, that ensemble predictions generally are superior in accuracy to any of the component individual methods. However there is a paucity of information as to why the ensemble methods are superior and indeed this has never been systematically addressed in fold recognition. Here we show that the source of ensemble power stems from noise reduction in filtering out false positive matches. The results indicate greater coverage of sequence space and improved model quality, which can consequently lead to a reduction in the experimental workload of structural genomics initiatives.  相似文献   

14.
R.MvaI is a Type II restriction enzyme (REase), which specifically recognizes the pentanucleotide DNA sequence 5'-CCWGG-3' (W indicates A or T). It belongs to a family of enzymes, which recognize related sequences, including 5'-CCSGG-3' (S indicates G or C) in the case of R.BcnI, or 5'-CCNGG-3' (where N indicates any nucleoside) in the case of R.ScrFI. REases from this family hydrolyze the phosphodiester bond in the DNA between the 2nd and 3rd base in both strands, thereby generating a double strand break with 5'-protruding single nucleotides. So far, no crystal structures of REases with similar cleavage patterns have been solved. Characterization of sequence-structure-function relationships in this family would facilitate understanding of evolution of sequence specificity among REases and could aid in engineering of enzymes with new specificities. However, sequences of R.MvaI or its homologs show no significant similarity to any proteins with known structures, thus precluding straightforward comparative modeling. We used a fold recognition approach to identify a remote relationship between R.MvaI and the structure of DNA repair enzyme MutH, which belongs to the PD-(D/E)XK superfamily together with many other REases. We constructed a homology model of R.MvaI and used it to predict functionally important amino acid residues and the mode of interaction with the DNA. In particular, we predict that only one active site of R.MvaI interacts with the DNA target at a time, and the cleavage of both strands (5'-CCAGG-3' and 5'-CCTGG-3') is achieved by two independent catalytic events. The model is in good agreement with the available experimental data and will serve as a template for further analyses of R.MvaI, R.BcnI, R.ScrFI and other related enzymes.  相似文献   

15.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号