共查询到20条相似文献,搜索用时 0 毫秒
1.
In this study, we show that it is possible to increase the performance over PSI-BLAST by using evolutionary information for both query and target sequences. This information can be used in three different ways: by sequence linking, profile-profile alignments, and by combining sequence-profile and profile-sequence searches. If only PSI-BLAST is used, 16% of superfamily-related protein domains can be detected at 90% specificity, but if a sequence-profile and a profile-sequence search are combined, this is increased to 20%, profile-profile searches detects 19%, whereas a linking procedure identifies 22% of these proteins. All three methods show equal performance, but the best combination of speed and accuracy seems to be obtained by the combined searches, because this method shows a good performance even at high specificity and the lowest computational cost. In addition, we show that the E-values reported by all these methods, including PSI-BLAST, underestimate the true rate of false positives. This behavior is seen even if a very strict E-value cutoff and a limited number of iterations are used. However, the difference is more pronounced with a looser E-value cutoff and more iterations. 相似文献
2.
To improve the detection of related proteins, it is often useful to include evolutionary information for both the query and target proteins. One method to include this information is by the use of profile-profile alignments, where a profile from the query protein is compared with the profiles from the target proteins. Profile-profile alignments can be implemented in several fundamentally different ways. The similarity between two positions can be calculated using a dot-product, a probabilistic model, or an information theoretical measure. Here, we present a large-scale comparison of different profile-profile alignment methods. We show that the profile-profile methods perform at least 30% better than standard sequence-profile methods both in their ability to recognize superfamily-related proteins and in the quality of the obtained alignments. Although the performance of all methods is quite similar, profile-profile methods that use a probabilistic scoring function have an advantage as they can create good alignments and show a good fold recognition capacity using the same gap-penalties, while the other methods need to use different parameters to obtain comparable performances. 相似文献
3.
Here we present a simplified form of threading that uses only a 20 x 20 two-body residue-based potential and restricted number of gaps. Despite its simplicity and transparency the Monte Carlo-based threading algorithm performs very well in a rigorous test of fold recognition. The results suggest that by simplifying and constraining the decoy space, one can achieve better fold recognition. Fold recognition results are compared with and supplemented by a PSI-BLAST search. The statistical significance of threading results is rigorously evaluated from statistics of extremes by comparison with optimal alignments of a large set of randomly shuffled sequences. The statistical theory, based on the Random Energy Model, yields a cumulative statistical parameter, epsilon, that attests to the likelihood of correct fold recognition. A large epsilon indicates a significant energy gap between the optimal alignment and decoy alignments and, consequently, a high probability that the fold is correctly recognized. For a particular number of gaps, the epsilon parameter reaches its maximal value, and the fold is recognized. As the number of gaps further increases, the likelihood of correct fold recognition drops off. This is because the decoy space is small when gaps are restricted to a small number, but the native alignment is still well approximated, whereas unrestricted increase of the number of gaps leads to rapid growth of the number of decoys and their statistical dominance over the correct alignment. It is shown that best results are obtained when a combination of one-, two-, and three-gap threading is used. To this end, use of the epsilon parameter is crucial for rigorous comparison of results across the different decoy spaces belonging to a different number of gaps. 相似文献
4.
Rychlewski L Jaroszewski L Li W Godzik A 《Protein science : a publication of the Protein Society》2000,9(2):232-241
Distant homologies between proteins are often discovered only after three-dimensional structures of both proteins are solved. The sequence divergence for such proteins can be so large that simple comparison of their sequences fails to identify any similarity. New generation of sensitive alignment tools use averaged sequences of entire homologous families (profiles) to detect such homologies. Several algorithms, including the newest generation of BLAST algorithms and BASIC, an algorithm used in our group to assign fold predictions for proteins from several genomes, are compared to each other on the large set of structurally similar proteins with little sequence similarity. Proteins in the benchmark are classified according to the level of their similarity, which allows us to demonstrate that most of the improvement of the new algorithms is achieved for proteins with strong functional similarities, with almost no progress in recognizing distant fold similarities. It is also shown that details of profile calculation strongly influence its sensitivity in recognizing distant homologies. The most important choice is how to include information from diverging members of the family, avoiding generating false predictions, while accounting for entire sequence divergence within a family. PSI-BLAST takes a conservative approach, deriving a profile from core members of the family, providing a solid improvement without almost any false predictions. BASIC strives for better sensitivity by increasing the weight of divergent family members and paying the price in lower reliability. A new FFAS algorithm introduced here uses a new procedure for profile generation that takes into account all the relations within the family and matches BASIC sensitivity with PSI-BLAST like reliability. 相似文献
5.
GNBSL: a new integrative system to predict the subcellular location for Gram-negative bacteria proteins 总被引:4,自引:0,他引:4
This paper proposes a new integrative system (GNBSL--Gram-negative bacteria subcellular localization) for subcellular localization specifized on the Gram-negative bacteria proteins. First, the system generates a position-specific frequency matrix (PSFM) and a position-specific scoring matrix (PSSM) for each protein sequence by searching the Swiss-Prot database. Then different features are extracted by four modules from the PSFM and the PSSM. The features include whole-sequence amino acid composition, N- and C-terminus amino acid composition, dipeptide composition, and segment composition. Four probabilistic neural network (PNN) classifiers are used to classify these modules. To further improve the performance, two modules trained by support vector machine (SVM) are added in this system. One module extracts the residue-couple distribution from the amino acid sequence and the other module applies a pairwise profile alignment kernel to measure the local similarity between every two sequences. Finally, an additional SVM is used to fuse the outputs from the six modules. Test on a benchmark dataset shows that the overall success rate of GNBSL is higher than those of PSORT-B, CELLO, and PSLpred. A web server GNBSL can be visited from http://166.111.24.5/webtools/GNBSL/index.htm. 相似文献
6.
Protein sequences containing more than one structural domain are problematic when used in homology searches where they can either stop an iterative database search prematurely or cause an explosion of a search to common domains. We describe a method, DOMAINATION, that infers domains and their boundaries in a query sequence from local gapped alignments generated using PSI-BLAST. Through a new technique to recognize domain insertions and permutations, DOMAINATION submits delineated domains as successive database queries in further iterative steps. Assessed over a set of 452 multidomain proteins, the method predicts structural domain boundaries with an overall accuracy of 50% and improves finding distant homologies by 14% compared with PSI-BLAST. DOMAINATION is available as a web based tool at http://mathbio.nimr.mrc.ac.uk, and the source code is available from the authors upon request. 相似文献
7.
The detection of remote homolog pairs of proteins using computational methods is a pivotal problem in structural bioinformatics, aiming to compute protein folds on the basis of information in the database of known structures. In the last 25 years, several methods have been developed to tackle this problem, based on different approaches including sequence-sequence alignments and/or structure comparison. In this article, we will briefly discuss When, Why, Where and How (WWWH) to perform remote homology search, reviewing some of the most widely adopted computational approaches. The specific aim is highlighting the basic criteria implemented by different research groups and commenting on the status of the art as well as on still-open questions. 相似文献
8.
We have modified and improved the GOR algorithm for the protein secondary structure prediction by using the evolutionary information provided by multiple sequence alignments, adding triplet statistics, and optimizing various parameters. We have expanded the database used to include the 513 non-redundant domains collected recently by Cuff and Barton (Proteins 1999;34:508-519; Proteins 2000;40:502-511). We have introduced a variable size window that allowed us to include sequences as short as 20-30 residues. A significant improvement over the previous versions of GOR algorithm was obtained by combining the PSI-BLAST multiple sequence alignments with the GOR method. The new algorithm will form the basis for the future GOR V release on an online prediction server. The average accuracy of the prediction of secondary structure with multiple sequence alignment and full jack-knife procedure was 73.5%. The accuracy of the prediction increases to 74.2% by limiting the prediction to 375 (of 513) sequences having at least 50 PSI-BLAST alignments. The average accuracy of the prediction of the new improved program without using multiple sequence alignments was 67.5%. This is approximately a 3% improvement over the preceding GOR IV algorithm (Garnier J, Gibrat JF, Robson B. Methods Enzymol 1996;266:540-553; Kloczkowski A, Ting K-L, Jernigan RL, Garnier J. Polymer 2002;43:441-449). We have discussed alternatives to the segment overlap (Sov) coefficient proposed by Zemla et al. (Proteins 1999;34:220-223). 相似文献
9.
The ultimate goal of structural genomics is to obtain the structure of each protein coded by each gene within a genome to determine gene function. Because of cost and time limitations, it remains impractical to solve the structure for every gene product experimentally. Up to a point, reasonably accurate three‐dimensional structures can be deduced for proteins with homologous sequences by using comparative modeling. Beyond this, fold recognition or threading methods can be used for proteins showing little homology to any known fold, although this is relatively time‐consuming and limited by the library of template folds currently available. Therefore, it is appropriate to develop methods that can increase our knowledge base, expanding our fold libraries by earmarking potentially “novel” folds for experimental structure determination. How can we sift through proteomic data rapidly and yet reliably identify novel folds as targets for structural genomics? We have analyzed a number of simple methods that discriminate between “novel” and “known” folds. We propose that simple alignments of secondary structure elements using predicted secondary structure could potentially be a more selective method than both a simple fold recognition method (GenTHREADER) and standard sequence alignment at finding novel folds when sequences show no detectable homology to proteins with known structures. Proteins 2002;48:44–52. © 2002 Wiley‐Liss, Inc. 相似文献
10.
Two new sets of scoring matrices are introduced: H2 for the protein sequence comparison and T2 for the protein sequence-structure correlation. Each element of H2 or T2 measures the frequency with which a pair of amino acid types in one protein, k-residues apart in the sequence, is aligned with another pair of residues, of given amino acid types (for H2) or in given structural states (for T2), in other structurally homologous proteins. There are four types, corresponding to the k-values of 1 to 4, for both H2 and T2. These matrices were set up using a large number of structurally homologous protein pairs, with little sequence homology between the pair, that were recently generated using the structure comparison program SHEBA. The two scoring matrices were incorporated into the main body of the sequence alignment program SSEARCH in the FASTA package and tested in a fold recognition setting in which a set of 107 test sequences were aligned to each of a panel of 3,539 domains that represent all known protein structures. Six procedures were tested; the straight Smith-Waterman (SW) and FASTA procedures, which used the Blosum62 single residue type substitution matrix; BLAST and PSI-BLAST procedures, which also used the Blosum62 matrix; PASH, which used Blosum62 and H2 matrices; and PASSC, which used Blosum62, H2, and T2 matrices. All procedures gave similar results when the probe and target sequences had greater than 30% sequence identity. However, when the sequence identity was below 30%, a similar structure could be found for more sequences using PASSC than using any other procedure. PASH and PSI-BLAST gave the next best results. 相似文献
11.
Multiple sequence alignments are a routine tool in protein fold recognition, but multiple structure alignments are computationally less cooperative. This work describes a method for protein sequence threading and sequence-to-structure alignments that uses multiple aligned structures, the aim being to improve models from protein threading calculations. Sequences are aligned into a field due to corresponding sites in homologous proteins. On the basis of a test set of more than 570 protein pairs, the procedure does improve alignment quality, although no more than averaging over sequences. For the force field tested, the benefit of structure averaging is smaller than that of adding sequence similarity terms or a contribution from secondary structure predictions. Although there is a significant improvement in the quality of sequence-to-structure alignments, this does not directly translate to an immediate improvement in fold recognition capability. 相似文献
12.
Structural and functional annotation of the large and growing database of genomic sequences is a major problem in modern biology. Protein structure prediction by detecting remote homology to known structures is a well-established and successful annotation technique. However, the broad spectrum of evolutionary change that accompanies the divergence of close homologues to become remote homologues cannot easily be captured with a single algorithm. Recent advances to tackle this problem have involved the use of multiple predictive algorithms available on the Internet. Here we demonstrate how such ensembles of predictors can be designed in-house under controlled conditions and permit significant improvements in recognition by using a concept taken from protein loop energetics and applying it to the general problem of 3D clustering. We have developed a stringent test that simulates the situation where a protein sequence of interest is submitted to multiple different algorithms and not one of these algorithms can make a confident (95%) correct assignment. A method of meta-server prediction (Phyre) that exploits the benefits of a controlled environment for the component methods was implemented. At 95% precision or higher, Phyre identified 64.0% of all correct homologous query-template relationships, and 84.0% of the individual test query proteins could be accurately annotated. In comparison to the improvement that the single best fold recognition algorithm (according to training) has over PSI-Blast, this represents a 29.6% increase in the number of correct homologous query-template relationships, and a 46.2% increase in the number of accurately annotated queries. It has been well recognised in fold prediction, other bioinformatics applications, and in many other areas, that ensemble predictions generally are superior in accuracy to any of the component individual methods. However there is a paucity of information as to why the ensemble methods are superior and indeed this has never been systematically addressed in fold recognition. Here we show that the source of ensemble power stems from noise reduction in filtering out false positive matches. The results indicate greater coverage of sequence space and improved model quality, which can consequently lead to a reduction in the experimental workload of structural genomics initiatives. 相似文献
13.
Sequence--and structure-based searching strategies have proven useful in the identification of remote homologs and have facilitated both structural and functional predictions of many uncharacterized protein families. We implement these strategies to predict the structure of and to classify a previously uncharacterized cluster of orthologs (COG3019) in the thioredoxin-like fold superfamily. The results of each searching method indicate that thioltransferases are the closest structural family to COG3019. We substantiate this conclusion using the ab initio structure prediction method rosetta, which generates a thioredoxin-like fold similar to that of the glutaredoxin-like thioltransferase (NrdH) for a COG3019 target sequence. This structural model contains the thiol-redox functional motif CYS-X-X-CYS in close proximity to other absolutely conserved COG3019 residues, defining a novel thioredoxin-like active site that potentially binds metal ions. Finally, the rosetta-derived model structure assists us in assembling a global multiple-sequence alignment of COG3019 with two other thioredoxin-like fold families, the thioltransferases and the bacterial arsenate reductases (ArsC). 相似文献
14.
In the past few years, a new generation of fold recognition methods has been developed, in which the classical sequence information is combined with information obtained from secondary structure and, sometimes, accessibility predictions. The results are promising, indicating that this approach may compete with potential-based methods (Rost B et al., 1997, J Mol Biol 270:471-480). Here we present a systematic study of the different factors contributing to the performance of these methods, in particular when applied to the problem of fold recognition of remote homologues. Our results indicate that secondary structure and accessibility prediction methods have reached an accuracy level where they are not the major factor limiting the accuracy of fold recognition. The pattern degeneracy problem is confirmed as the major source of error of these methods. On the basis of these results, we study three different options to overcome these limitations: normalization schemes, mapping of the coil state into the different zones of the Ramachandran plot, and post-threading graphical analysis. 相似文献
15.
Reduced or simplified amino acid alphabets group the 20 naturally occurring amino acids into a smaller number of representative protein residues. To date, several reduced amino acid alphabets have been proposed, which have been derived and optimized by a variety of methods. The resulting reduced amino acid alphabets have been applied to pattern recognition, generation of consensus sequences from multiple alignments, protein folding, and protein structure prediction. In this work, amino acid substitution matrices and statistical potentials were derived based on several reduced amino acid alphabets and their performance assessed in a large benchmark for the tasks of sequence alignment and fold assessment of protein structure models, using as a reference frame the standard alphabet of 20 amino acids. The results showed that a large reduction in the total number of residue types does not necessarily translate into a significant loss of discriminative power for sequence alignment and fold assessment. Therefore, some definitions of a few residue types are able to encode most of the relevant sequence/structure information that is present in the 20 standard amino acids. Based on these results, we suggest that the use of reduced amino acid alphabets may allow to increasing the accuracy of current substitution matrices and statistical potentials for the prediction of protein structure of remote homologs. 相似文献
16.
Recognizing structural similarity without significant sequence identity has proved to be a challenging task. Sequence-based and structure-based methods as well as their combinations have been developed. Here, we propose a fold-recognition method that incorporates structural information without the need of sequence-to-structure threading. This is accomplished by generating sequence profiles from protein structural fragments. The structure-derived sequence profiles allow a simple integration with evolution-derived sequence profiles and secondary-structural information for an optimized alignment by efficient dynamic programming. The resulting method (called SP(3)) is found to make a statistically significant improvement in both sensitivity of fold recognition and accuracy of alignment over the method based on evolution-derived sequence profiles alone (SP) and the method based on evolution-derived sequence profile and secondary structure profile (SP(2)). SP(3) was tested in SALIGN benchmark for alignment accuracy and Lindahl, PROSPECTOR 3.0, and LiveBench 8.0 benchmarks for remote-homology detection and model accuracy. SP(3) is found to be the most sensitive and accurate single-method server in all benchmarks tested where other methods are available for comparison (although its results are statistically indistinguishable from the next best in some cases and the comparison is subjected to the limitation of time-dependent sequence and/or structural library used by different methods.). In LiveBench 8.0, its accuracy rivals some of the consensus methods such as ShotGun-INBGU, Pmodeller3, Pcons4, and ROBETTA. SP(3) fold-recognition server is available on http://theory.med.buffalo.edu. 相似文献
17.
Nica Borgese 《Traffic (Copenhagen, Denmark)》2020,21(10):647-658
The tryptophan rich basic protein/calcium signal‐modulating cyclophilin ligand (WRB/CAML) and Get1p/Get2p complexes, in vertebrates and yeast, respectively, mediate the final step of tail‐anchored protein insertion into the endoplasmic reticulum membrane via the Get pathway. While WRB appears to exist in all eukaryotes, CAML homologs were previously recognized only among chordates, raising the question as to how CAML's function is performed in other phyla. Furthermore, whereas WRB was recognized as the metazoan homolog of Get1, CAML and Get2, although functionally equivalent, were not considered to be homologous. CAML contains an N‐terminal basic, TRC40/Get3‐interacting, region, three transmembrane segments near the C‐terminus, and a poorly conserved region between these domains. Here, I searched the NCBI protein database for remote CAML homologs in all eukaryotes, using position‐specific iterated‐basic local alignment search tool, with the C‐terminal, the N‐terminal or the full‐length sequence of human CAML as query. The N‐terminal basic region and full‐length CAML retrieved homologs among metazoa, plants and fungi. In the latter group several hits were annotated as GET2. The C‐terminal query did not return entries outside of the animal kingdom, but did retrieve over one hundred invertebrate metazoan CAML‐like proteins, which all conserved the N‐terminal TRC40‐binding domain. The results indicate that CAML homologs exist throughout the eukaryotic domain of life, and suggest that metazoan CAML and yeast GET2 share a common evolutionary origin. They further reveal a tight link between the particular features of the metazoan membrane‐anchoring domain and the TRC40‐interacting region. The list of sequences presented here should provide a useful resource for future studies addressing structure‐function relationships in CAML proteins. 相似文献
18.
《Saudi Journal of Biological Sciences》2020,27(9):2207-2214
Glyphosate is a commonly used organophosphate herbicide that has an adverse impact on humans, mammals and soil microbial ecosystems. The redundant utilize of glyphosate to control weed growth cause the pollution of the soil environment by this chemical. The discharge of glyphosate in the agricultural drainage can also cause serious environmental damage and water pollution problems. Therefore, it is important to develop methods for enhancing glyphosate degradation in the soil through bioremediation. In this study, thirty bacterial isolates were selected from an agro-industrial zone located in Sadat City of Monufia Governorate, Egypt. The isolates were able to grow in LB medium supplemented with 7.2 mg/ml glyphosate. Ten isolates only had the ability to grow in a medium containing different concentrations of glyphosate (50, 100, 150, 200 and 250 mg/ml). The FACU3 bacterial isolate showed the highest CFU in the different concentrations of glyphosate. The FACU3 isolate was Gram-positive, spore-forming and rod-shape bacteria. Based on API 50 CHB/E medium kit, biochemical properties and 16S rRNA gene sequencing, the FACU3 isolate was identified as Bacillus aryabhattai. Different bioinformatics tools, including multiple sequence alignment (MSA), basic local alignment search tool (BLAST) and primer alignment, were used to design specific primers for goxB gene amplification and isolation. The goxB gene encodes FAD-dependent glyphosate oxidase enzyme that responsible for biodegradation process. The selected primers were successfully used to amplify the goxB gene from Bacillus aryabhattai FACU3. The results indicated that the Bacillus aryabhattai FACU3 can be utilized in glyphosate-contaminated environments for bioremediation. According to our knowledge, this is the first time to isolate of FAD-dependent glyphosate oxidase (goxB) gene from Bacillus aryabhattai. 相似文献
19.
Jaroszewski L Rychlewski L Godzik A 《Protein science : a publication of the Protein Society》2000,9(8):1487-1496
Several recent publications illustrated advantages of using sequence profiles in recognizing distant homologies between proteins. At the same time, the practical usefulness of distant homology recognition depends not only on the sensitivity of the algorithm, but also on the quality of the alignment between a prediction target and the template from the database of known proteins. Here, we study this question for several supersensitive protein algorithms that were previously compared in their recognition sensitivity (Rychlewski et al., 2000). A database of protein pairs with similar structures, but low sequence similarity is used to rate the alignments obtained with several different methods, which included sequence-sequence, sequence-profile, and profile-profile alignment methods. We show that incorporation of evolutionary information encoded in sequence profiles into alignment calculation methods significantly increases the alignment accuracy, bringing them closer to the alignments obtained from structure comparison. In general, alignment quality is correlated with recognition and alignment score significance. For every alignment method, alignments with statistically significant scores correlate with both correct structural templates and good quality alignments. At the same time, average alignment lengths differ in various methods, making the comparison between them difficult. For instance, the alignments obtained by FFAS, the profile-profile alignment algorithm developed in our group are always longer that the alignments obtained with the PSI-BLAST algorithms. To address this problem, we develop methods to truncate or extend alignments to cover a specified percentage of protein lengths. In most cases, the elongation of the alignment by profile-profile methods is reasonable, adding fragments of similar structure. The examples of erroneous alignment are examined and it is shown that they can be identified based on the model quality. 相似文献
20.
Improving fold recognition without folds 总被引:4,自引:0,他引:4
The most reliable way to align two proteins of unknown structure is through sequence-profile and profile-profile alignment methods. If the structure for one of the two is known, fold recognition methods outperform purely sequence-based alignments. Here, we introduced a novel method that aligns generalised sequence and predicted structure profiles. Using predicted 1D structure (secondary structure and solvent accessibility) significantly improved over sequence-only methods, both in terms of correctly recognising pairs of proteins with different sequences and similar structures and in terms of correctly aligning the pairs. The scores obtained by our generalised scoring matrix followed an extreme value distribution; this yielded accurate estimates of the statistical significance of our alignments. We found that mistakes in 1D structure predictions correlated between proteins from different sequence-structure families. The impact of this surprising result was that our method succeeded in significantly out-performing sequence-only methods even without explicitly using structural information from any of the two. Since AGAPE also outperformed established methods that rely on 3D information, we made it available through. If we solved the problem of CPU-time required to apply AGAPE on millions of proteins, our results could also impact everyday database searches. 相似文献