共查询到20条相似文献,搜索用时 15 毫秒
1.
A Support Vector Machine learning system has been trained to predict protein solvent accessibility from the primary structure. Different kernel functions and sliding window sizes have been explored to find how they affect the prediction performance. Using a cut-off threshold of 15% that splits the dataset evenly (an equal number of exposed and buried residues), this method was able to achieve a prediction accuracy of 70.1% for single sequence input and 73.9% for multiple alignment sequence input, respectively. The prediction of three and more states of solvent accessibility was also studied and compared with other methods. The prediction accuracies are better than, or comparable to, those obtained by other methods such as neural networks, Bayesian classification, multiple linear regression, and information theory. In addition, our results further suggest that this system may be combined with other prediction methods to achieve more reliable results, and that the Support Vector Machine method is a very useful tool for biological sequence analysis. 相似文献
2.
A new optimization-based method is presented to predict the hydrophobic residue contacts in alpha-helical proteins. The proposed approach uses a high resolution distance dependent force field to calculate the interaction energy between different residues of a protein. The formulation predicts the hydrophobic contacts by minimizing the sum of these contact energies. These residue contacts are highly useful in narrowing down the conformational space searched by protein structure prediction algorithms. The proposed algorithm also offers the algorithmic advantage of producing a rank ordered list of the best contact sets. This model was tested on four independent alpha-helical protein test sets and was found to perform very well. The average accuracy of the predictions (separated by at least six residues) obtained using the presented method was approximately 66% for single domain proteins. The average true positive and false positive distances were also calculated for each protein test set and they are 8.87 and 14.67 A, respectively. 相似文献
3.
Knowing the quality of a protein structure model is important for its appropriate usage. We developed a model evaluation method to assess the absolute quality of a single protein model using only structural features with support vector machine regression. The method assigns an absolute quantitative score (i.e. GDT‐TS) to a model by comparing its secondary structure, relative solvent accessibility, contact map, and beta sheet structure with their counterparts predicted from its primary sequence. We trained and tested the method on the CASP6 dataset using cross‐validation. The correlation between predicted and true scores is 0.82. On the independent CASP7 dataset, the correlation averaged over 95 protein targets is 0.76; the average correlation for template‐based and ab initio targets is 0.82 and 0.50, respectively. Furthermore, the predicted absolute quality scores can be used to rank models effectively. The average difference (or loss) between the scores of the top‐ranked models and the best models is 5.70 on the CASP7 targets. This method performs favorably when compared with the other methods used on the same dataset. Moreover, the predicted absolute quality scores are comparable across models for different proteins. These features make the method a valuable tool for model quality assurance and ranking. Proteins 2009. © 2008 Wiley‐Liss, Inc. 相似文献
4.
A number of methods for predicting levels of solvent accessibility or accessible surface area (ASA) of amino acid residues in proteins have been developed. These methods either predict regularly spaced states of relative solvent accessibility or an analogue real value indicating relative solvent accessibility. While discrete states of exposure can be easily obtained by post prediction assignment of thresholds to the predicted or computed real values of ASA, the reverse, that is, obtaining a real value from quantized states of predicted ASA, is not straightforward as a two-state prediction in such cases would give a large real valued errors. However, prediction of ASA into larger number of ASA states and then finding a corresponding scheme for real value prediction may be helpful in integrating the two approaches of ASA prediction. We report a novel method of obtaining numerical real values of solvent accessibility, using accumulation cutoff set and support vector machine. This so-called SVM-Cabins method first predicts discrete states of ASA of amino acid residues from their evolutionary profile and then maps the predicted states onto a real valued linear space by simple algebraic methods. Resulting performance of such a rigorous approach using 13-state ASA prediction is at least comparable with the best methods of ASA prediction reported so far. The mean absolute error in this method reaches the best performance of 15.1% on the tested data set of 502 proteins with a coefficient of correlation equal to 0.66. Since, the method starts with the prediction of discrete states of ASA and leads to real value predictions, performance of prediction in binary states and real values are simultaneously optimized. 相似文献
5.
George A. Khoury Phanourios Tamamis Neesha Pinnaduwage James Smadbeck Chris A. Kieslich Christodoulos A. Floudas 《Proteins》2014,82(5):794-814
Protein structure refinement aims to perform a set of operations given a predicted structure to improve model quality and accuracy with respect to the native in a blind fashion. Despite the numerous computational approaches to the protein refinement problem reported in the previous three CASPs, an overwhelming majority of methods degrade models rather than improve them. We initially developed a method tested using blind predictions during CASP10 which was officially ranked in 5th place among all methods in the refinement category. Here, we present Princeton_TIGRESS, which when benchmarked on all CASP 7,8,9, and 10 refinement targets, simultaneously increased GDT_TS 76% of the time with an average improvement of 0.83 GDT_TS points per structure. The method was additionally benchmarked on models produced by top performing three‐dimensional structure prediction servers during CASP10. The robustness of the Princeton_TIGRESS protocol was also tested for different random seeds. We make the Princeton_TIGRESS refinement protocol freely available as a web server at http://atlas.princeton.edu/refinement . Using this protocol, one can consistently refine a prediction to help bridge the gap between a predicted structure and the actual native structure. Proteins 2014; 82:794–814. © 2013 Wiley Periodicals, Inc. 相似文献
6.
It is well established that protein structures are more conserved than protein sequences. One-third of all known protein structures can be classified into ten protein folds, which themselves are composed mainly of alpha-helical hairpin, beta hairpin, and betaalphabeta supersecondary structural elements. In this study, we explore the ability of a recent Monte Carlo-based procedure to generate the 3D structures of eight polypeptides that correspond to units of supersecondary structure and three-stranded antiparallel beta sheet. Starting from extended or misfolded compact conformations, all Monte Carlo simulations show significant success in predicting the native topology using a simplified chain representation and an energy model optimized on other structures. Preliminary results on model peptides from nucleotide binding proteins suggest that this simple protein folding model can help clarify the relation between sequence and topology. 相似文献
7.
Local quality assessment in homology models using statistical potentials and support vector machines
Fasnacht M Zhu J Honig B 《Protein science : a publication of the Protein Society》2007,16(8):1557-1568
In this study, we address the problem of local quality assessment in homology models. As a prerequisite for the evaluation of methods for predicting local model quality, we first examine the problem of measuring local structural similarities between a model and the corresponding native structure. Several local geometric similarity measures are evaluated. Two methods based on structural superposition are found to best reproduce local model quality assessments by human experts. We then examine the performance of state-of-the-art statistical potentials in predicting local model quality on three qualitatively distinct data sets. The best statistical potential, DFIRE, is shown to perform on par with the best current structure-based method in the literature, ProQres. A combination of different statistical potentials and structural features using support vector machines is shown to provide somewhat improved performance over published methods. 相似文献
8.
The three‐dimensional structure of a protein is organized around the packing of its secondary structure elements. Predicting the topology and constructing the geometry of structural motifs involving α‐helices and/or β‐strands are therefore key steps for accurate prediction of protein structure. While many efforts have focused on how to pack helices and on how to sample exhaustively the topologies and geometries of multiple strands forming a β‐sheet in a protein, there has been little progress on generating native‐like packings of helices on sheets. We describe a method that can generate the packing of multiple helices on a given β‐sheet for αβα sandwich type protein folds. This method mines the results of a statistical analysis of the conformations of αβ2 motifs in protein structures to provide input values for the geometric attributes of the packing of a helix on a sheet. It then proceeds with a geometric builder that generates multiple arrangements of the helices on the sheet of interest by sampling through these values and performing consistency checks that guarantee proper loop geometry between the helices and the strands, minimal number of collisions between the helices, and proper formation of a hydrophobic core. The method is implemented as a module of ProteinShop. Our results show that it produces structures that are within 4–6 Å RMSD of the native one, regardless of the number of helices that need to be packed, though this number may increase if the protein has several helices between two consecutive strands in the sequence that pack on the sheet formed by these two strands. Proteins 2011; Published 2011 Wiley‐Liss, Inc. 相似文献
9.
A combined transmembrane topology and signal peptide prediction method 总被引:31,自引:0,他引:31
An inherent problem in transmembrane protein topology prediction and signal peptide prediction is the high similarity between the hydrophobic regions of a transmembrane helix and that of a signal peptide, leading to cross-reaction between the two types of predictions. To improve predictions further, it is therefore important to make a predictor that aims to discriminate between the two classes. In addition, topology information can be gained when successfully predicting a signal peptide leading a transmembrane protein since it dictates that the N terminus of the mature protein must be on the non-cytoplasmic side of the membrane. Here, we present Phobius, a combined transmembrane protein topology and signal peptide predictor. The predictor is based on a hidden Markov model (HMM) that models the different sequence regions of a signal peptide and the different regions of a transmembrane protein in a series of interconnected states. Training was done on a newly assembled and curated dataset. Compared to TMHMM and SignalP, errors coming from cross-prediction between transmembrane segments and signal peptides were reduced substantially by Phobius. False classifications of signal peptides were reduced from 26.1% to 3.9% and false classifications of transmembrane helices were reduced from 19.0% to 7.7%. Phobius was applied to the proteomes of Homo sapiens and Escherichia coli. Here we also noted a drastic reduction of false classifications compared to TMHMM/SignalP, suggesting that Phobius is well suited for whole-genome annotation of signal peptides and transmembrane regions. The method is available at as well as at 相似文献
10.
Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines
In the post-genome era, the prediction of protein function is one of the most demanding tasks in the study of bioinformatics. Machine learning methods, such as the support vector machines (SVMs), greatly help to improve the classification of protein function. In this work, we integrated SVMs, protein sequence amino acid composition, and associated physicochemical properties into the study of nucleic-acid-binding proteins prediction. We developed the binary classifications for rRNA-, RNA-, DNA-binding proteins that play an important role in the control of many cell processes. Each SVM predicts whether a protein belongs to rRNA-, RNA-, or DNA-binding protein class. Self-consistency and jackknife tests were performed on the protein data sets in which the sequences identity was < 25%. Test results show that the accuracies of rRNA-, RNA-, DNA-binding SVMs predictions are approximately 84%, approximately 78%, approximately 72%, respectively. The predictions were also performed on the ambiguous and negative data set. The results demonstrate that the predicted scores of proteins in the ambiguous data set by RNA- and DNA-binding SVM models were distributed around zero, while most proteins in the negative data set were predicted as negative scores by all three SVMs. The score distributions agree well with the prior knowledge of those proteins and show the effectiveness of sequence associated physicochemical properties in the protein function prediction. The software is available from the author upon request. 相似文献
11.
A novel support vector regression (SVR) approach is proposed to predict protein accessible surface areas (ASAs) from their primary structures. In this work, we predict the real values of ASA in squared angstroms for residues instead of relative solvent accessibility. Based on protein residues, the mean and median absolute errors are 26.0 A(2) and 18.87 A(2), respectively. The correlation coefficient between the predicted and observed ASAs is 0.66. Cysteine is the best predicted amino acid (mean absolute error is 13.8 A(2) and median absolute error is 8.37 A(2)), while arginine is the least predicted amino acid (mean absolute error is 42.7 A(2) and median absolute error is 36.31 A(2)). Our work suggests that the SVR approach can be directly applied to the ASA prediction where data preclassification has been used. 相似文献
12.
Protein structure prediction is an important problem of both intellectual and practical interest. Most protein structure prediction approaches generate multiple candidate models first, and then use a scoring function to select the best model among these candidates. In this work, we develop a scoring function using support vector regression (SVR). Both consensus-based features and features from individual structures are extracted from a training data set containing native protein structures and predicted structural models submitted to CASP5 and CASP6. The SVR learns a scoring function that is a linear combination of these features. We test this scoring function on two data sets. First, when used to rank server models submitted to CASP7, the SVR score selects predictions that are comparable to the best performing server in CASP7, Zhang-Server, and significantly better than all the other servers. Even if the SVR score is not allowed to select Zhang-Server models, the SVR score still selects predictions that are significantly better than all the other servers. In addition, the SVR is able to select significantly better models and yield significantly better Pearson correlation coefficients than the two best Quality Assessment groups in CASP7, QA556 (LEE), and QA634 (Pcons). Second, this work aims to improve the ability of the Robetta server to select best models, and hence we evaluate the performance of the SVR score on ranking the Robetta server template-based models for the CASP7 targets. The SVR selects significantly better models than the Robetta K*Sync consensus alignment score. 相似文献
13.
Transmembrane proteins (TMPs) are important drug targets because they are essential for signaling, regulation, and transport. Despite important breakthroughs, experimental structure determination remains challenging for TMPs. Various methods have bridged the gap by predicting transmembrane helices (TMHs), but room for improvement remains. Here, we present TMSEG, a novel method identifying TMPs and accurately predicting their TMHs and their topology. The method combines machine learning with empirical filters. Testing it on a non‐redundant dataset of 41 TMPs and 285 soluble proteins, and applying strict performance measures, TMSEG outperformed the state‐of‐the‐art in our hands. TMSEG correctly distinguished helical TMPs from other proteins with a sensitivity of 98 ± 2% and a false positive rate as low as 3 ± 1%. Individual TMHs were predicted with a precision of 87 ± 3% and recall of 84 ± 3%. Furthermore, in 63 ± 6% of helical TMPs the placement of all TMHs and their inside/outside topology was correctly predicted. There are two main features that distinguish TMSEG from other methods. First, the errors in finding all helical TMPs in an organism are significantly reduced. For example, in human this leads to 200 and 1600 fewer misclassifications compared to the second and third best method available, and 4400 fewer mistakes than by a simple hydrophobicity‐based method. Second, TMSEG provides an add‐on improvement for any existing method to benefit from. Proteins 2016; 84:1706–1716. © 2016 Wiley Periodicals, Inc. 相似文献
14.
When a protein sequence does not share any significant sequence similarity with a protein of known structure, homology modeling cannot be applied. However, many novel and interesting methods, such as secondary structure prediction, fold recognition, and prediction of long-range interactions, are being developed and have been shown to be reasonably successful in predicting protein structures from sequence data and evolutionary information. The a priori evaluation of the correctness of a prediction obtained by one of these methods is however often problematic. Consequently, it is important to use all available information provided by as many different methods as possible and all the available experimental data about the protein of interest, since the consistency of the results is indicative of the reliability of the prediction. Hence the need has arisen for suitable tools able to compare results provided by different methods and evaluate their consistency. We have therefore constructed GLASS, a general platform to read, visualize, compare, and evaluate prediction results from many different sources and to project these prediction results into three dimensions. In addition, GLASS allows the comparison of selected parameters calculated for a model with the distribution observed in real protein structures, thus providing an easy way to test new methods for evaluating the likelihood of different structural models. GLASS can be considered as a “workbench” for structural predictions useful to both experimentalists and theoreticians. Proteins 30:339–351, 1998. © 1998 Wiley-Liss, Inc. 相似文献
15.
Julia Koehler Leman Ralf Mueller Mert Karakas Nils Woetzel Jens Meiler 《Proteins》2013,81(7):1127-1140
Prediction of transmembrane spans and secondary structure from the protein sequence is generally the first step in the structural characterization of (membrane) proteins. Preference of a stretch of amino acids in a protein to form secondary structure and being placed in the membrane are correlated. Nevertheless, current methods predict either secondary structure or individual transmembrane states. We introduce a method that simultaneously predicts the secondary structure and transmembrane spans from the protein sequence. This approach not only eliminates the necessity to create a consensus prediction from possibly contradicting outputs of several predictors but bears the potential to predict conformational switches, i.e., sequence regions that have a high probability to change for example from a coil conformation in solution to an α‐helical transmembrane state. An artificial neural network was trained on databases of 177 membrane proteins and 6048 soluble proteins. The output is a 3 × 3 dimensional probability matrix for each residue in the sequence that combines three secondary structure types (helix, strand, coil) and three environment types (membrane core, interface, solution). The prediction accuracies are 70.3% for nine possible states, 73.2% for three‐state secondary structure prediction, and 94.8% for three‐state transmembrane span prediction. These accuracies are comparable to state‐of‐the‐art predictors of secondary structure (e.g., Psipred) or transmembrane placement (e.g., OCTOPUS). The method is available as web server and for download at www.meilerlab.org . Proteins 2013; 81:1127–1140. © 2013 Wiley Periodicals, Inc. 相似文献
16.
Distance geometry generates native-like folds for small helical proteins using the consensus distances of predicted protein structures. 下载免费PDF全文
E. S. Huang R. Samudrala J. W. Ponder 《Protein science : a publication of the Protein Society》1998,7(9):1998-2003
For successful ab initio protein structure prediction, a method is needed to identify native-like structures from a set containing both native and non-native protein-like conformations. In this regard, the use of distance geometry has shown promise when accurate inter-residue distances are available. We describe a method by which distance geometry restraints are culled from sets of 500 protein-like conformations for four small helical proteins generated by the method of Simons et al. (1997). A consensus-based approach was applied in which every inter-Calpha distance was measured, and the most frequently occurring distances were used as input restraints for distance geometry. For each protein, a structure with lower coordinate root-mean-square (RMS) error than the mean of the original set was constructed; in three cases the topology of the fold resembled that of the native protein. When the fold sets were filtered for the best scoring conformations with respect to an all-atom knowledge-based scoring function, the remaining subset of 50 structures yielded restraints of higher accuracy. A second round of distance geometry using these restraints resulted in an average coordinate RMS error of 4.38 A. 相似文献
17.
We present a simulated annealing-based method for the prediction of the tertiary structures of proteins given knowledge of the secondary structure associated with each amino acid in the sequence. The backbone is represented in a detailed fashion whereas the sidechains and pairwise interactions are modeled in a simplified way, following the LINUS model of Srinivasan and Rose. A perceptron-based technique is used to optimize the interaction potentials for a training set of three proteins. For these proteins, the procedure is able to reproduce the tertiary structures to below 3 A in root mean square deviation (rmsd) from the PDB targets. We present the results of tests on twelve other proteins. For half of these, the lowest energy decoy has a rmsd from the native state below 6 A and, in 9 out of 12 cases, we obtain decoys whose rmsd from the native states are also well below 5 A. 相似文献
18.
A novel method for ab initio prediction of protein tertiary structures, PROFESY (PROFile Enumerating SYstem), is proposed. This method utilizes the secondary structure prediction information of a query sequence and the fragment assembly procedure based on global optimization. Fifteen-residue-long fragment libraries are constructed using the secondary structure prediction method PREDICT, and fragments in these libraries are assembled to generate full-length chains of a query protein. Tertiary structures of 50 to 100 conformations are obtained by minimizing an energy function for proteins, using the conformational space annealing method that enables one to sample diverse low-lying local minima of the energy. We apply PROFESY for benchmark tests to proteins with known structures to demonstrate its feasibility. In addition, we participated in CASP5 and applied PROFESY to four new-fold targets for blind prediction. The results are quite promising, despite the fact that PROFESY was in its early stages of development. In particular, PROFESY successfully provided us the best model-one structure for the target T0161. 相似文献
19.
Prediction and classification of human G-protein coupled receptors based on support vector machines 总被引:1,自引:0,他引:1
A computational system for the prediction and classification of human G-protein coupled receptors (GPCRs) has been developed based on the support vector machine (SVM) method and protein sequence information. The feature vectors used to develop the SVM prediction models consist of statistically significant features selected from single amino acid, dipeptide, and tripeptide compositions of protein sequences. Furthermore, the length distribution difference between GPCRs and non-GPCRs has also been exploited to improve the prediction performance. The testing results with annotated human protein sequences demonstrate that this system can get good performance for both prediction and classification of human GPCRs. 相似文献
20.
Predicting subcellular localization of proteins for Gram-negative bacteria by support vector machines based on n-peptide compositions 总被引:16,自引:0,他引:16
Gram-negative bacteria have five major subcellular localization sites: the cytoplasm, the periplasm, the inner membrane, the outer membrane, and the extracellular space. The subcellular location of a protein can provide valuable information about its function. With the rapid increase of sequenced genomic data, the need for an automated and accurate tool to predict subcellular localization becomes increasingly important. We present an approach to predict subcellular localization for Gram-negative bacteria. This method uses the support vector machines trained by multiple feature vectors based on n-peptide compositions. For a standard data set comprising 1443 proteins, the overall prediction accuracy reaches 89%, which, to the best of our knowledge, is the highest prediction rate ever reported. Our prediction is 14% higher than that of the recently developed multimodular PSORT-B. Because of its simplicity, this approach can be easily extended to other organisms and should be a useful tool for the high-throughput and large-scale analysis of proteomic and genomic data. 相似文献