首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
依据蛋白质氨基酸特性,以氨基酸组成和有偏自协方差函数为特征矢量,用BP神经网络提出了一种预测非同源蛋白质中α螺旋和β折叠二级结构含量的计算方法。采用相互独立的非同源蛋白质数据库对该方法进行了检验。用Ponnuswamy值时,对二级结构α螺旋和β折叠含量的预测结果是;自检验平均绝对误差分别为0.069和0.065,相应标准偏差分别为0.044和0.047;他检验平均绝对误差分别为0.077和0.070,相应标准偏差分别为0.051和0.049。与仅以氨基酸组成为特征矢量的BP神经网络方法比较,相应的他检验平均绝对误差分别减小了0.024和0.016,标准偏差分别减小了0.031和0.018;与改进的多元线性回归方法比较,相应的他检验平均绝对误差分别减小了0.018和0.011,准偏差分别减小了0.020和0.012。表明:基于氨基酸组成和有偏自协方差函数为特征矢量的BP神经网络预测蛋白质二级结构含量的方法可有效提高预测精度。  相似文献   

2.
Accurate prediction of protein secondary structural content   总被引:2,自引:0,他引:2  
An improved multiple linear regression (MLR) method is proposed to predict a protein's secondary structural content based on its primary sequence. The amino acid composition, the autocorrelation function, and the interaction function of side-chain mass derived from the primary sequence are taken into account. The average absolute errors of prediction over 704 unrelated proteins with the jackknife test are 0.088, 0.081, and 0.059 with standard deviations 0.073, 0.066, and 0.055 for -helix, -sheet, and coil, respectively. That the sum of predicted secondary structure content should be close to 1.0 was introduced as a criterion to evaluate whether the prediction is acceptable. While only the predictions with the sum of predicted secondary structure content between 0.99 and 1.01 are accepted (about 11% of all proteins), the absolute errors are 0.058 for -helix, 0.054 for -sheet, and 0.045 for coil.  相似文献   

3.
The predictive limits of the amino acid composition for the secondary structural content (percentage of residues in the secondary structural states helix, sheet, and coil) in proteins are assessed quantitatively. For the first time, techniques for prediction of secondary structural content are presented which rely on the amino acid composition as the only information on the query protein. In our first method, the amino acid composition of an unknown protein is represented by the best (in a least square sense) linear combination of the characteristic amino acid compositions of the three secondary structural types computed from a learning set of tertiary structures. The second technique is a generalization of the first one and takes into account also possible compositional couplings between any two sorts of amino acids. Its mathematical formulation results in an eigenvalue/eigenvector problem of the second moment matrix describing the amino acid compositional fluctuations of secondary structural types in various proteins of a learning set. Possible correlations of the principal directions of the eigenspaces with physical properties of the amino acids were also checked. For example, the first two eigenvectors of the helical eigenspace correlate with the size and hydrophobicity of the residue types respectively. As learning and test sets of tertiary structures, we utilized representative, automatically generated subsets of Protein Data Bank (PDB) consisting of non-homologous protein structures at the resolution thresholds ≤1.8Å, ≤2.0Å, ≤2.5Å, and ≤3.0Å. We show that the consideration of compositional couplings improves prediction accuracy, albeit not dramatically. Whereas in the self-consistency test (learning with the protein to be predicted), a clear decrease of prediction accuracy with worsening resolution is observed, the jackknife test (leave the predicted protein out) yielded best results for the largest dataset (≤3.0 Å, almost no difference to the self-consistency test!), i.e., only this set, with more than 400 proteins, is sufficient for stable computation of the parameters in the prediction function of the second method. The average absolute error in predicting the fraction of helix, sheet, and coil from amino acid composition of the query protein are 13.7, 12.6, and 11.4%, respectively with r.m.s. deviations in the range of 8.6 ÷ 11.8% for the 3.0 Å dataset in a jackknife test. The absolute precision of the average absolute errors is in the range of 1 ÷ 3% as measured for other representative subsets of the PDB. Secondary structural content prediction methods found in the literature have been clustered in accordance with their prediction accuracies. To our surprise, much more complex secondary structure prediction methods utilized for the same purpose of secondary structural content prediction achieve prediction accuracies very similar to those of the present analytic techniques, implying that all the information beyond the amino acid composition is, in fact, mainly utilized for positioning the secondary structural state in the sequence but not for determination of the overall number of residues in a secondary structural type. This result implies that higher prediction accuracies cannot be achieved relying solely on the amino acid composition of an unknown query protein as prediction input. Our prediction program SSCP has been made available as a World Wide Web and E-mail service. © 1996 Wiley-Liss, Inc.  相似文献   

4.
A priori knowledge of secondary structure content can be of great use in theoretical and experimental determination of protein structure. We present a method that uses two computer-simulated neural networks placed in "tandem" to predict the secondary structure content of water-soluble, globular proteins. The first of the two networks, NET1, predicts a protein's helix and strand content given information about the protein's amino acid composition, molecular weight and heme presence. Because NET1 contained more adjustable parameters (network weights) than learning examples, this network experienced problems with memorization, which is the inability to generalize onto new, never-seen-before examples. To overcome this problem, we designed a second network, NET2, which learned to determine when NET1 was in a state of generalization. Together, these two networks produce prediction errors as low as 5.0% and 5.6% for helix and strand content, respectively, on a set of protein crystal structures bearing little homology to those used in network training. A comparison between three other methods including a multiple linear regression analysis, a non-hidden-node network analysis and a secondary structure assignment analysis reveals that our tandem neural network scheme is, indeed, the best method for predicting secondary structure content. The results of our analysis suggest that the knowledge of sequence information is not necessary for highly accurate predictions of protein secondary structure content.  相似文献   

5.
Homaeian L  Kurgan LA  Ruan J  Cios KJ  Chen K 《Proteins》2007,69(3):486-498
Secondary protein structure carries information about local structural arrangements, which include three major conformations: alpha-helices, beta-strands, and coils. Significant majority of successful methods for prediction of the secondary structure is based on multiple sequence alignment. However, multiple alignment fails to provide accurate results when a sequence comes from the twilight zone, that is, it is characterized by low (<30%) homology. To this end, we propose a novel method for prediction of secondary structure content through comprehensive sequence representation, called PSSC-core. The method uses a multiple linear regression model and introduces a comprehensive feature-based sequence representation to predict amount of helices and strands for sequences from the twilight zone. The PSSC-core method was tested and compared with two other state-of-the-art prediction methods on a set of 2187 twilight zone sequences. The results indicate that our method provides better predictions for both helix and strand content. The PSSC-core is shown to provide statistically significantly better results when compared with the competing methods, reducing the prediction error by 5-7% for helix and 7-9% for strand content predictions. The proposed feature-based sequence representation uses a comprehensive set of physicochemical properties that are custom-designed for each of the helix and strand content predictions. It includes composition and composition moment vectors, frequency of tetra-peptides associated with helical and strand conformations, various property-based groups like exchange groups, chemical groups of the side chains and hydrophobic group, auto-correlations based on hydrophobicity, side-chain masses, hydropathy, and conformational patterns for beta-sheets. The PSSC-core method provides an alternative for predicting the secondary structure content that can be used to validate and constrain results of other structure prediction methods. At the same time, it also provides useful insight into design of successful protein sequence representations that can be used in developing new methods related to prediction of different aspects of the secondary protein structure.  相似文献   

6.
Prediction of protein (domain) structural classes based on amino-acid index.   总被引:10,自引:0,他引:10  
A protein (domain) is usually classified into one of the following four structural classes: all-alpha, all-beta, alpha/beta and alpha + beta. In this paper, a new formulation is proposed to predict the structural class of a protein (domain) from its primary sequence. Instead of the amino-acid composition used widely in the previous structural class prediction work, the auto-correlation functions based on the profile of amino-acid index along the primary sequence of the query protein (domain) are used for the structural class prediction. Consequently, the overall predictive accuracy is remarkably improved. For the same training database consisting of 359 proteins (domains) and the same component-coupled algorithm [Chou, K.C. & Maggiora, G.M. (1998) Protein Eng. 11, 523-538], the overall predictive accuracy of the new method for the jackknife test is 5-7% higher than the accuracy based only on the amino-acid composition. The overall predictive accuracy finally obtained for the jackknife test is as high as 90.5%, implying that a significant improvement has been achieved by making full use of the information contained in the primary sequence for the class prediction. This improvement depends on the size of the training database, the auto-correlation functions selected and the amino-acid index used. We have found that the amino-acid index proposed by Oobatake and Ooi, i.e. the average nonbonded energy per residue, leads to the optimal predictive result in the case for the database sets studied in this paper. This study may be considered as an alternative step towards making the structural class prediction more practical.  相似文献   

7.
In this study we classified regions of random coil into four types: coil between alpha helix and beta strand, coil between beta strand and alpha helix, coil between two alpha helices and coil between two beta strands. This classification may be considered as natural. We used 610 3D structures of proteins collected from the Protein Data Bank from bacteria with low, average and high genomic GC-content. Relatively short regions of coil are not random: certain amino acid residues are more or less frequent in each of the types of coil. Namely, hydrophobic amino acids with branched side chains (Ile, Val and Leu) are rare in coil between two beta strands, unlike some acrophilic amino acids (Asp, Asn and Gly). In contrast, coil between two alpha helices is enriched by Leu. Regions of coil between alpha helix and beta strand are enriched by positively charged amino acids (Arg and Lys), while the usage of residues with side chains possessing hydroxyl group (Ser and Thr) is low in them, in contrast to the regions of coil between beta strand and alpha helix. Regions of coil between beta strand and alpha helix are significantly enriched by Cys residues. The response to the symmetric mutational pressure (AT-pressure or GC-pressure) is also quite different for four types of coil. The most conserved regions of coil are “connecting bridges” between beta strand and alpha helix, since their amino acid content shows less strong dependence on GC-content of genes than amino acid contents of other three types of coil. Possible causes and consequences of the described differences in amino acid content distribution between different types of random coil have been discussed.  相似文献   

8.
The primary structure of hamster vimentin has been derived from the nucleotide sequence of the corresponding cDND. The elucidation of the amino acid sequence allowed the prediction of a model in which two helix domains occur. The N-terminal and C-terminal part have been characterized as nonhelical domains. Moreover the two helices are separated by a third nonhelical region.  相似文献   

9.
从非同源蛋白质的一级序列预测其结构类   总被引:8,自引:1,他引:7  
对基于氨基酸组成、自相关函数和自协方差函数提取特征的蛋白质结构类预测算法进行分析比较,对氨基酸组成和自相关函数相结合的方法,以及氨基酸组成和自协放差函数相结合的方法的预测算法进行了研究。结果表明:对非同源蛋白质,因氨基酸和自相关函数相结合的方法中,采用Miyazawa和Jernigan的疏水值时,训练的自检验的总精度为95.34%,其Jackknife检验的总精度为81.92%,检验加的他检验的总精工为86.61%。在氨基酸组成和自协方差函数相结合的方法中,采用Wold等的疏水值时,训练库的自检验的总精度为96.71%,其Jackknife检验的总精度为82.18%,检验加的他检验的总精工为86.88%。这说明氨基酸组成和自相关函数相结合的方法,以及氨基酸组成和自协方差函数相结合的方法可有效提高结构类预测精度,表明提取更多有效的序列信息是提高分类精度的关键。  相似文献   

10.
Characterizing and classifying regularities in protein structure is an important element in uncovering the mechanisms that regulate protein structure, function and evolution. Recent research concentrates on analysis of structural motifs that can be used to describe larger, fold-sized structures based on homologous primary sequences. At the same time, accuracy of secondary protein structure prediction based on multiple sequence alignment drops significantly when low homology (twilight zone) sequences are considered. To this end, this paper addresses a problem of providing an alternative sequences representation that would improve ability to distinguish secondary structures for the twilight zone sequences without using alignment. We consider a novel classification problem, in which, structural motifs, referred to as structural fragments (SFs) are defined as uniform strand, helix and coil fragments. Classification of SFs allows to design novel sequence representations, and to investigate which other factors and prediction algorithms may result in the improved discrimination. Comprehensive experimental results show that statistically significant improvement in classification accuracy can be achieved by: (1) improving sequence representations, and (2) removing possible noise on the terminal residues in the SFs. Combining these two approaches reduces the error rate on average by 15% when compared to classification using standard representation and noisy information on the terminal residues, bringing the classification accuracy to over 70%. Finally, we show that certain prediction algorithms, such as neural networks and boosted decision trees, are superior to other algorithms.This research was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC).  相似文献   

11.
12.
Xue B  Dor O  Faraggi E  Zhou Y 《Proteins》2008,72(1):427-433
The backbone structure of a protein is largely determined by the phi and psi torsion angles. Thus, knowing these angles, even if approximately, will be very useful for protein-structure prediction. However, in a previous work, a sequence-based, real-value prediction of psi angle could only achieve a mean absolute error of 54 degrees (83 degrees, 35 degrees, 33 degrees for coil, strand, and helix residues, respectively) between predicted and actual angles. Moreover, a real-value prediction of phi angle is not yet available. This article employs a neural-network based approach to improve psi prediction by taking advantage of angle periodicity and apply the new method to the prediction to phi angles. The 10-fold-cross-validated mean absolute error for the new method is 38 degrees (58 degrees, 33 degrees, 22 degrees for coil, strand, and helix, respectively) for psi and 25 degrees (35 degrees, 22 degrees, 16 degrees for coil, strand, and helix, respectively) for phi. The accuracy of real-value prediction is comparable to or more accurate than the predictions based on multistate classification of the phi-psi map. More accurate prediction of real-value angles will likely be useful for improving the accuracy of fold recognition and ab initio protein-structure prediction. The Real-SPINE 2.0 server is available on the website http://sparks.informatics.iupui.edu.  相似文献   

13.
Yuan Z  Huang B 《Proteins》2004,57(3):558-564
A novel support vector regression (SVR) approach is proposed to predict protein accessible surface areas (ASAs) from their primary structures. In this work, we predict the real values of ASA in squared angstroms for residues instead of relative solvent accessibility. Based on protein residues, the mean and median absolute errors are 26.0 A(2) and 18.87 A(2), respectively. The correlation coefficient between the predicted and observed ASAs is 0.66. Cysteine is the best predicted amino acid (mean absolute error is 13.8 A(2) and median absolute error is 8.37 A(2)), while arginine is the least predicted amino acid (mean absolute error is 42.7 A(2) and median absolute error is 36.31 A(2)). Our work suggests that the SVR approach can be directly applied to the ASA prediction where data preclassification has been used.  相似文献   

14.
Wu S  Zhang Y 《PloS one》2008,3(10):e3400
We developed a composite machine-learning based algorithm, called ANGLOR, to predict real-value protein backbone torsion angles from amino acid sequences. The input features of ANGLOR include sequence profiles, predicted secondary structure and solvent accessibility. In a large-scale benchmarking test, the mean absolute error (MAE) of the phi/psi prediction is 28 degrees/46 degrees , which is approximately 10% lower than that generated by software in literature. The prediction is statistically different from a random predictor (or a purely secondary-structure-based predictor) with p-value <1.0 x 10(-300) (or <1.0 x 10(-148)) by Wilcoxon signed rank test. For some residues (ILE, LEU, PRO and VAL) and especially the residues in helix and buried regions, the MAE of phi angles is much smaller (10-20 degrees ) than that in other environments. Thus, although the average accuracy of the ANGLOR prediction is still low, the portion of the accurately predicted dihedral angles may be useful in assisting protein fold recognition and ab initio 3D structure modeling.  相似文献   

15.
The amino acid sequence of the first domain (positions 1-175) of Panulirus interruptus hemocyanin subunit a has been determined. The sequence of residues 1-158 (18-kDa fragment obtained by limited proteolysis) was derived from peptides obtained by digestion of this fragment with CNBr and trypsin and by subdigestion of these peptides with other enzymes. The peptides were sequences automatically or manually. The amino acid sequence has been fitted into the electron-density map at 0.32-nm resolution. The residues of domain 1 are folded into a large, mainly helical, globular part, containing one disulfide bridge, and a smaller part near the molecular twofold axis. The latter part consists of an alpha helix and a beta strand which contains a covalently attached carbohydrate moiety. The sites susceptible to limited proteolytic cleavage of the subunit are discussed. Comparison of the N-terminal sequence with those of other arthropod hemocyanins revealed, besides an N-terminal extension of five residues, the presence of a 21-residue loop (positions 22-42) in the crustacean sequences. This loop contains helix 1.2, a less defined region in the electron-density map. It is absent in chelicerate sequences. Strong evidence is presented that: (a) the structure of the first 21 residues (including helix 1.1) is the same in all arthropod hemocyanins with known amino acid sequence; (b) a stretch containing about 15 residues (including part of helix 1.3) following the 21-residue loop has a different structure in crustaceans and chelicerates; (c) the rest of domain 1 has the same structure again. It is shown that all conserved residues are in the contact region with the other two domains.  相似文献   

16.
A suite of FORTRAN programs, PREF, is described for calculating preference functions from the data base of known protein structures and for comparing smoothed profiles of sequence-dependent preferences in proteins of unknown structure. Amino acid preferences for a secondary structure are considered as functions of a sequence environment. Sequence environment of amino acid residue in a protein is defined as an average over some physical, chemical, or statistical property of its primary structure neighbors. The frequency distribution of sequence environments in the data base of soluble protein structures is approximately normal for each amino acid type of known secondary conformation. An analytical expression for the dependence of preferences on sequence environment is obtained after each frequency distribution is replaced by corresponding Gaussian function. The preference for the α-helical conformation increases for each amino acid type with the increase of sequence environment of buried solvent-accessible surface areas. We show that a set of preference functions based on buried surface area is useful for predicting folding motifs in α-class proteins and in integral membrane proteins. The prediction accuracy for helical residues is 79% for 5 integral membrane proteins and 74% for 11 α-class soluble proteins. Most residues found in transmembrane segments of membrane proteins with known α-helical structure are predicted to be indeed in the helical conformation because of very high middle helix preferences. Both extramembrane and transmembrane helices in the photosynthetic reaction center M and L subunits are correctly predicted. We point out in the discussion that our method of conformational preference functions can identify what physical properties of the amino acids are important in the formation of particular secondary structure elements. © 1993 John Wiley & Sons, Inc.  相似文献   

17.
MOTIVATION: Locating protein-coding exons (CDSs) on a eukaryotic genomic DNA sequence is the initial and an essential step in predicting the functions of the genes embedded in that part of the genome. Accurate prediction of CDSs may be achieved by directly matching the DNA sequence with a known protein sequence or profile of a homologous family member(s). RESULTS: A new convention for encoding a DNA sequence into a series of 23 possible letters (translated codon or tron code) was devised to improve this type of analysis. Using this convention, a dynamic programming algorithm was developed to align a DNA sequence and a protein sequence or profile so that the spliced and translated sequence optimally matches the reference the same as the standard protein sequence alignment allowing for long gaps. The objective function also takes account of frameshift errors, coding potentials, and translational initiation, termination and splicing signals. This method was tested on Caenorhabditis elegans genes of known structures. The accuracy of prediction measured in terms of a correlation coefficient (CC) was about 95% at the nucleotide level for the 288 genes tested, and 97. 0% for the 170 genes whose product and closest homologue share more than 30% identical amino acids. We also propose a strategy to improve the accuracy of prediction for a set of paralogous genes by means of iterative gene prediction and reconstruction of the reference profile derived from the predicted sequences. AVAILABILITY: The source codes for the program 'aln' written in ANSI-C and the test data will be available via anonymous FTP at ftp.genome.ad.jp/pub/genomenet/saitama-cc. CONTACT: gotoh@cancer-c.pref.saitama.jp  相似文献   

18.
Lee S  Lee BC  Kim D 《Proteins》2006,62(4):1107-1114
Knowing protein structure and inferring its function from the structure are one of the main issues of computational structural biology, and often the first step is studying protein secondary structure. There have been many attempts to predict protein secondary structure contents. Previous attempts assumed that the content of protein secondary structure can be predicted successfully using the information on the amino acid composition of a protein. Recent methods achieved remarkable prediction accuracy by using the expanded composition information. The overall average error of the most successful method is 3.4%. Here, we demonstrate that even if we only use the simple amino acid composition information alone, it is possible to improve the prediction accuracy significantly if the evolutionary information is included. The idea is motivated by the observation that evolutionarily related proteins share the similar structure. After calculating the homolog-averaged amino acid composition of a protein, which can be easily obtained from the multiple sequence alignment by running PSI-BLAST, those 20 numbers are learned by a multiple linear regression, an artificial neural network and a support vector regression. The overall average error of method by a support vector regression is 3.3%. It is remarkable that we obtain the comparable accuracy without utilizing the expanded composition information such as pair-coupled amino acid composition. This work again demonstrates that the amino acid composition is a fundamental characteristic of a protein. It is anticipated that our novel idea can be applied to many areas of protein bioinformatics where the amino acid composition information is utilized, such as subcellular localization prediction, enzyme subclass prediction, domain boundary prediction, signal sequence prediction, and prediction of unfolded segment in a protein sequence, to name a few.  相似文献   

19.
PDC-109, a protein of unknown function, is a major component of bovine seminal plasma. Using a computer program designed to detect evolutionary relationships between proteins, I find that the PDC-109 protein is similar to the gelatin-binding domain of bovine fibronectin and part of a kringle domain of human tissue-type plasminogen activator (t-PA). The computer-based comparison of the amino acid sequence of PDC-109 with that of the gelatin-binding domain of fibronectin and part of the second kringle domain of t-PA yields scores that are 15.5 standard deviations and 7.8 standard deviations higher, respectively, than were obtained with a comparison of randomized sequences of these proteins. The probability (p) of getting these scores by chance is less than 10(-50) and 3 X 10(-15), respectively. The similarity between the amino acid sequences of PDC-109 and the gelatin-binding domain in fibronectin and the kringle of t-PA suggests some approaches for identifying the functions of PDC-109. Both t-PA and the gelatin-binding domain of fibronectin have adhesive functions, and the gelatin-binding domain promotes viral transformation of fibroblasts in culture. These functions may be associated with the PDC-109 protein.  相似文献   

20.
Feng ZP 《In silico biology》2002,2(3):291-303
The present paper overviews the issue on predicting the subcellular location of a protein. Five measures of extracting information from the global sequence based on the Bayes discriminant algorithm are reviewed. 1) The auto-correlation functions of amino acid indices along the sequence; 2) The quasi-sequence-order approach; 3) the pseudo-amino acid composition; 4) the unified attribute vector in Hilbert space, 5) Zp parameters extracted from the Zp curve. The actual performance of the predictive accuracy is closely related to the degree of similarity between the training and testing sets or to the average degree of pairwise similarity in dataset in a cross-validated study. Many scholars considered that the current higher predictive accuracy still cannot ensure that some available algorithms are effective in practice prediction for the higher pairwise sequence identity of the datasets, but some of them declared that construction of the dataset used for developing software should base on the reality determined by the Mother Nature that some subcellular locations really contain only a minor number of proteins of which some even have a high percentage of sequence similarity. Owing to the complexity of the problem itself, some very sophisticated and special programs are needed for both constructing dataset and improving the prediction. Anyhow finding the target information in mature protein sequence and properly cooperating it with sorting signals in prediction may further improve the overall predictive accuracy and make the prediction into practice.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号