首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Prediction of protein classification is an important topic in molecular biology. This is because it is able to not only provide useful information from the viewpoint of structure itself, but also greatly stimulate the characterization of many other features of proteins that may be closely correlated with their biological functions. In this paper, the LogitBoost, one of the boosting algorithms developed recently, is introduced for predicting protein structural classes. It performs classification using a regression scheme as the base learner, which can handle multi-class problems and is particularly superior in coping with noisy data. It was demonstrated that the LogitBoost outperformed the support vector machines in predicting the structural classes for a given dataset, indicating that the new classifier is very promising. It is anticipated that the power in predicting protein structural classes as well as many other bio-macromolecular attributes will be further strengthened if the LogitBoost and some other existing algorithms can be effectively complemented with each other.  相似文献   

2.
Structural class characterizes the overall folding type of a protein or its domain and the prediction of protein structural class has become both an important and a challenging topic in protein science. Moreover, the prediction itself can stimulate the development of novel predictors that may be straightforwardly applied to many other relational areas. In this paper, 10 frequently used sequence-derived structural and physicochemical features, which can be easily computed by the PROFEAT (Protein Features) web server, were taken as inputs of support vector machines to develop statistical learning models for predicting the protein structural class. More importantly, a strategy of merging different features, called best-first search, was developed. It was shown through the rigorous jackknife cross-validation test that the success rates by our method were significantly improved. We anticipate that the present method may also have important impacts on boosting the predictive accuracies for a series of other protein attributes, such as subcellular localization, membrane types, enzyme family and subfamily classes, among many others.  相似文献   

3.
Wang ZX  Yuan Z 《Proteins》2000,38(2):165-175
Proteins of known structures are usually classified into four structural classes: all-alpha, all-beta, alpha+beta, and alpha/beta type of proteins. A number of methods to predicting the structural class of a protein based on its amino acid composition have been developed during the past few years. Recently, a component-coupled method was developed for predicting protein structural class according to amino acid composition. This method is based on the least Mahalanobis distance principle, and yields much better predicted results in comparison with the previous methods. However, the success rates reported for structural class prediction by different investigators are contradictory. The highest reported accuracies by this method are near 100%, but the lowest one is only about 60%. The goal of this study is to resolve this paradox and to determine the possible upper limit of prediction rate for structural classes. In this paper, based on the normality assumption and the Bayes decision rule for minimum error, a new method is proposed for predicting the structural class of a protein according to its amino acid composition. The detailed theoretical analysis indicates that if the four protein folding classes are governed by the normal distributions, the present method will yield the optimum predictive result in a statistical sense. A non-redundant data set of 1,189 protein domains is used to evaluate the performance of the new method. Our results demonstrate that 60% correctness is the upper limit for a 4-type class prediction from amino acid composition alone for an unknown query protein. The apparent relatively high accuracy level (more than 90%) attained in the previous studies was due to the preselection of test sets, which may not be adequately representative of all unrelated proteins.  相似文献   

4.
Prediction of protein domain with mRMR feature selection and analysis   总被引:2,自引:0,他引:2  
Li BQ  Hu LL  Chen L  Feng KY  Cai YD  Chou KC 《PloS one》2012,7(6):e39308
The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28-40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine.  相似文献   

5.
A protein is usually classified into one of the following four structural classes: all alpha, all beta, (alpha + beta) and alpha/beta. In this paper, based on the maximum correlation-coefficient principle, a new formulation is proposed for predicting the structural class of a protein according to its amino acid composition. Calculations have been made for a development set of proteins from which the amino acid compositions for the standard structural classes were derived, and an independent set of proteins which are outside the development set. The former can test the self consistency of a method and the latter can test its extrapolating effectiveness. In both cases, the results showed that the new method gave a considerably higher rate of correct prediction than any of the previous methods, implying that a significant improvement has been achieved by implementing the maximum-correlation-coefficient principle in the new method.  相似文献   

6.
The extremely complicated nature of many biological problems makes them bear the features of fuzzy sets, such as with vague, imprecise, noisy, ambiguous, or input-missing information For instance, the current data in classifying protein structural classes are typically a fuzzy set To deal with this kind of problem, the AAPCA (Amino Acid Principal Component Analysis) approach was introduced. In the AAPCA approach the 20-dimensional amino acid composition space is reduced to an orthogonal space with fewer dimensions, and the original base functions are converted into a set of orthogonal and normalized base functions The advantage of such an approach is that it can minimize the random errors and redundant information in protein dataset through a principal component selection, remarkably improving the success rates in predicting protein structural classes It is anticipated that the AAPCA approach can be used to deal with many other classification problems in proteins as well.  相似文献   

7.
Protein domains are functional and structural units of proteins. Therefore, identification of domain–domain interactions (DDIs) can provide insight into the biological functions of proteins. In this article, we propose a novel discriminative approach for predicting DDIs based on both protein–protein interactions (PPIs) and the derived information of non‐PPIs. We make a threefold contribution to the work in this area. First, we take into account non‐PPIs explicitly and treat the domain combinations that can discriminate PPIs from non‐PPIs as putative DDIs. Second, DDI identification is formalized as a feature selection problem, in which it tries to find out a minimum set of informative features (i.e., putative DDIs) that discriminate PPIs from non‐PPIs, which is plausible in biology and is able to predict DDIs in a systematic and accurate manner. Third, multidomain combinations including two‐domain combinations are taken into account in the proposed method, where multidomain cooperations may help proteins to interact with each other. Numerical results on several DDI prediction benchmark data sets show that the proposed discriminative method performs comparably well with other top algorithms with respect to overall performance, and outperforms other methods in terms of precision. The PPI data sets used for prediction of DDIs and prediction results can be found at http://csb.shu.edu.cn/dipd . Proteins 2010. © 2009 Wiley‐Liss, Inc.  相似文献   

8.
Kaleel  Manaz  Torrisi  Mirko  Mooney  Catherine  Pollastri  Gianluca 《Amino acids》2019,51(9):1289-1296

Predicting the three-dimensional structure of proteins is a long-standing challenge of computational biology, as the structure (or lack of a rigid structure) is well known to determine a protein’s function. Predicting relative solvent accessibility (RSA) of amino acids within a protein is a significant step towards resolving the protein structure prediction challenge especially in cases in which structural information about a protein is not available by homology transfer. Today, arguably the core of the most powerful prediction methods for predicting RSA and other structural features of proteins is some form of deep learning, and all the state-of-the-art protein structure prediction tools rely on some machine learning algorithm. In this article we present a deep neural network architecture composed of stacks of bidirectional recurrent neural networks and convolutional layers which is capable of mining information from long-range interactions within a protein sequence and apply it to the prediction of protein RSA using a novel encoding method that we shall call “clipped”. The final system we present, PaleAle 5.0, which is available as a public server, predicts RSA into two, three and four classes at an accuracy exceeding 80% in two classes, surpassing the performances of all the other predictors we have benchmarked.

  相似文献   

9.
Prediction of β-turns from amino acid sequences has long been recognized as an important problem in structural bioinformatics due to their frequent occurrence as well as their structural and functional significance. Because various structural features of proteins are intercorrelated, secondary structure information has been often employed as an additional input for machine learning algorithms while predicting β-turns. Here we present a novel bidirectional Elman-type recurrent neural network with multiple output layers (MOLEBRNN) capable of predicting multiple mutually dependent structural motifs and demonstrate its efficiency in recognizing three aspects of protein structure: β-turns, β-turn types, and secondary structure. The advantage of our method compared to other predictors is that it does not require any external input except for sequence profiles because interdependencies between different structural features are taken into account implicitly during the learning process. In a sevenfold cross-validation experiment on a standard test dataset our method exhibits the total prediction accuracy of 77.9% and the Mathew's Correlation Coefficient of 0.45, the highest performance reported so far. It also outperforms other known methods in delineating individual turn types. We demonstrate how simultaneous prediction of multiple targets influences prediction performance on single targets. The MOLEBRNN presented here is a generic method applicable in a variety of research fields where multiple mutually depending target classes need to be predicted. Availability: http://webclu.bio.wzw.tum.de/predator-web/.  相似文献   

10.
Proteins are generally classified into four structural classes: all-alpha proteins, all-beta proteins, alpha + beta proteins, and alpha/beta proteins. In this article, a protein is expressed as a vector of 20-dimensional space, in which its 20 components are defined by the composition of its 20 amino acids. Based on this, a new method, the so-called maximum component coefficient method, is proposed for predicting the structural class of a protein according to its amino acid composition. In comparison with the existing methods, the new method yields a higher general accuracy of prediction. Especially for the all-alpha proteins, the rate of correct prediction obtained by the new method is much higher than that by any of the existing methods. For instance, for the 19 all-alpha proteins investigated previously by P.Y. Chou, the rate of correct prediction by means of his method was 84.2%, but the correct rate when predicted with the new method would be 100%! Furthermore, the new method is characterized by an explicable physical picture. This is reflected by the process in which the vector representing a protein to be predicted is decomposed into four component vectors, each of which corresponds to one of the norms of the four protein structural classes.  相似文献   

11.
The membrane protein type is an important feature in characterizing the overall topological folding type of a protein or its domains therein. Many investigators have put their efforts to the prediction of membrane protein type. Here, we propose a new approach, the bootstrap aggregating method or bragging learner, to address this problem based on the protein amino acid composition. As a demonstration, the benchmark dataset constructed by K.C. Chou and D.W. Elrod was used to test the new method. The overall success rate thus obtained by jackknife cross-validation was over 84%, indicating that the bragging learner as presented in this paper holds a quite high potential in predicting the attributes of proteins, or at least can play a complementary role to many existing algorithms in this area. It is anticipated that the prediction quality can be further enhanced if the pseudo amino acid composition can be effectively incorporated into the current predictor. An online membrane protein type prediction web server developed in our lab is available at http://chemdata.shu.edu.cn/protein/protein.jsp.  相似文献   

12.
It is a critical challenge to develop automated methods for fast and accurately determining the structures of proteins because of the increasingly widening gap between the number of sequence-known proteins and that of structure-known proteins in the post-genomic age. The knowledge of protein structural class can provide useful information towards the determination of protein structure. Thus, it is highly desirable to develop computational methods for identifying the structural classes of newly found proteins based on their primary sequence. In this study, according to the concept of Chou's pseudo amino acid composition (PseAA), eight PseAA vectors are used to represent protein samples. Each of the PseAA vectors is a 40-D (dimensional) vector, which is constructed by the conventional amino acid composition (AA) and a series of sequence-order correlation factors as original introduced by Chou. The difference among the eight PseAA representations is that different physicochemical properties are used to incorporate the sequence-order effects for the protein samples. Based on such a framework, a dual-layer fuzzy support vector machine (FSVM) network is proposed to predict protein structural classes. In the first layer of the FSVM network, eight FSVM classifiers trained by different PseAA vectors are established. The 2nd layer FSVM classifier is applied to reclassify the outputs of the first layer. The results thus obtained are quite promising, indicating that the new method may become a useful tool for predicting not only the structural classification of proteins but also their other attributes.  相似文献   

13.
The juvenile X-linked retinoschisis (XLRS) is a retinal disease caused by mutations in the secretory protein, retinoschisin (RS1). Majority of the disease is resulted from single point mutations on the RS1 discoidin domain with cysteine mutations being related to some of the more severe cases of XLRS. Previous studies have indicated that two mutations (C110Y and C219G), which involve cysteines that form intramolecular disulfide bonds in the native discoidin domain, resulted in different oligomerization states of the proteins and did not correlate with the degree of protein stability as calculated by the change in folding free energy. Through homology modeling, bioinformatics predictions, molecular dynamics (MD) and docking simulations, we attempt to investigate the effects of these two mutations on the structure of the RS1 discoidin domain in relevance to the discrepancy found between structural stability and aggregation propensity. Based on our findings, this discrepancy can be explained by the ability of C110Y mutant to establish suitable modules for initiating amorphous aggregation and to expand the aggregating mass through predominantly hydrophobic interactions. The low capability of C219G mutant to oligomerize, on the other hand, may be due to its greater structural instability and lesser hydrophobic tendency, two properties that may be unsupportive of aggregation. The results, altogether, indicate that aggregation propensity in the RS1 C110Y mutant is dependent upon the formation of suitable aggregating substrates for propagation of aggregation and not directly related to or determined by overall structural instability. As for the wildtype protein, the binding specificity of the spikes for biological function and the formation of octameric structure are contributed by important loop interactions, as well as evolved structural and sequence-based properties that prevent aggregation.  相似文献   

14.
G-Protein Coupled Receptors (GPCR) are the largest family of membrane bound receptor and plays a vital role in various biological processes with their amenability to drug intervention. They are the spotlight for the pharmaceutical industry. Experimental methods are both time consuming and expensive so there is need to develop a computational approach for classification to expedite the drug discovery process. In the present study domain based classification model has been developed by employing and evaluating various machine learning approaches like Bagging, J48, Bayes net, and Naive Bayes. Various softwares are available for predicting domains. The result and accuracy of output for the same input varies for these software''s. Thus, there is dilemma in choosing any one of it. To address this problem, a simulation model has been developed using well known five softwares for domain prediction to explore the best predicted result with maximum accuracy. The classifier is developed for classification up to 3 levels for class A. An accuracy of 98.59% by Naïve Bayes for level I, 92.07% by J48 for level II and 82.14% by Bagging for level III has been achieved.  相似文献   

15.
Correlations of amino acids in proteins   总被引:2,自引:0,他引:2  
Du Q  Wei D  Chou KC 《Peptides》2003,24(12):1863-1869
A correlation analysis among 20 amino acids is performed for four protein structural classes (, β, /β, and +β) in a total of 204 proteins. The correlation relationships among amino acids can be classified into the following four types: (1) strong positive correlation, (2) strong negative correlation, (3) weak correlation, and (4) no correlation. The correlation relationships are different for different proteins and are correlated with the features of their structural classes. The amino acids with the weak correlation relationship can be treated as the independent basis functions for the space where proteins are defined. The amino acids with large correlation coefficients are linear correlative with each other and they are not independent. The strong correlation among amino acids reflects their mutual constrained relationship, as exhibited by their relevant structural features. The information obtained through the correlation analysis is used for predicting protein structural classes and a better prediction quality is obtained than that by the simple geometry distance methods without taking into account the correlation effects.  相似文献   

16.
Whole-genome or multiple gene phylogenetic analysis is of interest since single gene analysis often results in poorly resolved trees. Here, the use of spectral techniques for analyzing multigene data sets is explored. The protein sequences are treated as categorical time series, and a measure of similarity between a pair of sequences, the spectral covariance, is based on the common periodicity between these two sequences. Unlike the other methods, the spectral covariance method focuses on the relationship between the sites of genetic sequences. By properly scaling the dissimilarity measures derived from different genes between a pair of species, we can use the mean of these scaled dissimilarity measures as a summary statistic to measure the taxonomic distances across multiple genes. The methods are applied to three different data sets, one noncontroversial and two with some dispute over the correct placement of the taxa in the tree. Trees are constructed using two distance-based methods, BIONJ and FITCH. A variation of block bootstrap sampling method is used for inference. The methods are able to recover all major clades in the corresponding reference trees with moderate to high bootstrap support. Through simulations, we show that the covariance-based methods effectively capture phylogenetic signal even when structural information is not fully retained. Comparisons of simulation results with the bootstrap permutation results indicate that the covariance-based methods are fairly robust under perturbations in sequence similarity but more sensitive to perturbations in structural similarity.  相似文献   

17.
Protein tyrosine binding (PTB) and ‘post synaptic density disc-large zo-1’ (PDZ) domains bind to short peptidic ligands by augmentation of one of the domain's β sheets and other recognition mechanisms. The two domain classes have a superficial resemblance to each other, even though no sequential homology exists. The structural bases of the interactions are well understood for the domains now experimentally determined, and ligand—target pairs can probably be identified in favorable cases by analogy with the known domains. For both PTB and PDZ classes, functional activities are still not fully defined: it is possible that these domain classes, along with pleckstrin homology domains, have multiple roles.  相似文献   

18.
Zhao XM  Wang Y  Chen L  Aihara K 《Proteins》2008,72(1):461-473
Domains are structural and functional units of proteins and play an important role in functional genomics. Theoretically, the functions of a protein can be directly inferred if the biological functions of its component domains are determined. Despite the important role that domains play, only a small number of domains have been annotated so far, and few works have been performed to predict the functions of domains. Hence, it is necessary to develop automatic methods for predicting domain functions based on various available data. In this article, two new methods, that is, the threshold-based classification method and the support vector machines method, are proposed for protein domain function prediction by integrating heterogeneous information sources, including protein-domain mapping features, domain-domain interactions, and domain coexisting features. We show that the integration of heterogeneous information sources improves not only prediction accuracy but also annotation reliability when compared with the methods using only individual information sources.  相似文献   

19.
Using evolutionary information contained in multiple sequence alignments as input to neural networks, secondary structure can be predicted at significantly increased accuracy. Here, we extend our previous three-level system of neural networks by using additional input information derived from multiple alignments. Using a position-specific conservation weight as part of the input increases performance. Using the number of insertions and deletions reduces the tendency for overprediction and increases overall accuracy. Addition of the global amino acid content yields a further improvement, mainly in predicting structural class. The final network system has a sustained overall accuracy of 71.6% in a multiple cross-validation test on 126 unique protein chains. A test on a new set of 124 recently solved protein structures that have no significant sequence similarity to the learning set confirms the high level of accuracy. The average cross-validated accuracy for all 250 sequence-unique chains is above 72%. Using various data sets, the method is compared to alternative prediction methods, some of which also use multiple alignments: the performance advantage of the network system is at least 6 percentage points in three-state accuracy. In addition, the network estimates secondary structure content from multiple sequence alignments about as well as circular dichroism spectroscopy on a single protein and classifies 75% of the 250 proteins correctly into one of four protein structural classes. Of particular practical importance is the definition of a position-specific reliability index. For 40% of all residues the method has a sustained three-state accuracy of 88%, as high as the overall average for homology modelling. A further strength of the method is greatly increased accuracy in predicting the placement of secondary structure segments. © 1994 Wiley-Liss, Inc.  相似文献   

20.
The Streptomyces lividans DnaA protein (73 kDa) consists, like other bacterial DnaA proteins, of four domains; it binds to 19 DnaA boxes in the complex oriC region. The S. lividans DnaA protein differs from others in that it contains an additional stretch of 120 predominantly acidic amino acids within domain II. Interactions between the DnaA protein and the two DnaA boxes derived from the promoter region of the S. lividans dnaA gene were analysed in vitro using three independent methods: Dnase-I-footprinting experiments, mobility-shift assay and surface plasmon resonance (SPR). The Dnase-I-footprinting analysis showed that the wild-type DnaA protein binds to both DnaA boxes. Thus, as in Escherichia coli and Bacillus subtilis, the S. lividans dnaA gene may be autoregulated. SPR analysis showed that the affinity of the DnaA protein for a DNA fragment containing both DnaA boxes from the dnaA promoter region (KD = 1.25 nM) is 10 times higher than its affinity for the single 'strong' DnaA box (KD = 12.0 nM). The mobility-shift assay suggests the presence of at least two classes of complex containing different numbers of bound DnaA molecules. The above data reveal that the DnaA protein binds to the two DnaA boxes in a cooperative manner. To deduce structural features of the Streptomyces domain II of DnaA protein, the amino acid DnaA sequences of three Streptomyces species were compared. However, according to the secondary structure prediction, Streptomyces domain II does not contain any common relevant secondary structural element(s). It can be assumed that domain II of DnaA protein can play a role as a flexible protein spacer between the N-terminal domain I and the highly conserved C-terminal part of DnaA protein containing ATP-binding domain III and DNA-binding domain IV.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号