期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A model for the evaluation of domain based classification of GPCR

Tannu Kumari Bhaskar Pant Kamalraj Raj Pardasani 《Bioinformation》2009,4(4):138-142

G-Protein Coupled Receptors (GPCR) are the largest family of membrane bound receptor and plays a vital role in various biological processes with their amenability to drug intervention. They are the spotlight for the pharmaceutical industry. Experimental methods are both time consuming and expensive so there is need to develop a computational approach for classification to expedite the drug discovery process. In the present study domain based classification model has been developed by employing and evaluating various machine learning approaches like Bagging, J48, Bayes net, and Naive Bayes. Various softwares are available for predicting domains. The result and accuracy of output for the same input varies for these software''s. Thus, there is dilemma in choosing any one of it. To address this problem, a simulation model has been developed using well known five softwares for domain prediction to explore the best predicted result with maximum accuracy. The classifier is developed for classification up to 3 levels for class A. An accuracy of 98.59% by Naïve Bayes for level I, 92.07% by J48 for level II and 82.14% by Bagging for level III has been achieved. 相似文献

2.

Application of recurrence quantification analysis for the automated identification of epileptic EEG signals

Acharya UR Sree SV Chattopadhyay S Yu W Ang PC 《International journal of neural systems》2011,21(3):199-211

Epilepsy is a common neurological disorder that is characterized by the recurrence of seizures. Electroencephalogram (EEG) signals are widely used to diagnose seizures. Because of the non-linear and dynamic nature of the EEG signals, it is difficult to effectively decipher the subtle changes in these signals by visual inspection and by using linear techniques. Therefore, non-linear methods are being researched to analyze the EEG signals. In this work, we use the recorded EEG signals in Recurrence Plots (RP), and extract Recurrence Quantification Analysis (RQA) parameters from the RP in order to classify the EEG signals into normal, ictal, and interictal classes. Recurrence Plot (RP) is a graph that shows all the times at which a state of the dynamical system recurs. Studies have reported significantly different RQA parameters for the three classes. However, more studies are needed to develop classifiers that use these promising features and present good classification accuracy in differentiating the three types of EEG segments. Therefore, in this work, we have used ten RQA parameters to quantify the important features in the EEG signals.These features were fed to seven different classifiers: Support vector machine (SVM), Gaussian Mixture Model (GMM), Fuzzy Sugeno Classifier, K-Nearest Neighbor (KNN), Naive Bayes Classifier (NBC), Decision Tree (DT), and Radial Basis Probabilistic Neural Network (RBPNN). Our results show that the SVM classifier was able to identify the EEG class with an average efficiency of 95.6%, sensitivity and specificity of 98.9% and 97.8%, respectively. 相似文献

3.

GPCR-MPredictor: multi-level prediction of G protein-coupled receptors using genetic ensemble

Naveed M Khan A Khan AU 《Amino acids》2012,42(5):1809-1823

G protein-coupled receptors (GPCRs) are transmembrane proteins, which transduce signals from extracellular ligands to intracellular G protein. Automatic classification of GPCRs can provide important information for the development of novel drugs in pharmaceutical industry. In this paper, we propose an evolutionary approach, GPCR-MPredictor, which combines individual classifiers for predicting GPCRs. GPCR-MPredictor is a web predictor that can efficiently predict GPCRs at five levels. The first level determines whether a protein sequence is a GPCR or a non-GPCR. If the predicted sequence is a GPCR, then it is further classified into family, subfamily, sub-subfamily, and subtype levels. In this work, our aim is to analyze the discriminative power of different feature extraction and classification strategies in case of GPCRs prediction and then to use an evolutionary ensemble approach for enhanced prediction performance. Features are extracted using amino acid composition, pseudo amino acid composition, and dipeptide composition of protein sequences. Different classification approaches, such as k-nearest neighbor (KNN), support vector machine (SVM), probabilistic neural networks (PNN), J48, Adaboost, and Naives Bayes, have been used to classify GPCRs. The proposed hierarchical GA-based ensemble classifier exploits the prediction results of SVM, KNN, PNN, and J48 at each level. The GA-based ensemble yields an accuracy of 99.75, 92.45, 87.80, 83.57, and 96.17% at the five levels, on the first dataset. We further perform predictions on a dataset consisting of 8,000 GPCRs at the family, subfamily, and sub-subfamily level, and on two other datasets of 365 and 167 GPCRs at the second and fourth levels, respectively. In comparison with the existing methods, the results demonstrate the effectiveness of our proposed GPCR-MPredictor in classifying GPCRs families. It is accessible at . 相似文献

4.

Functional prediction of unidentified lipids using supervised classifiers

Laxman Yetukuri Jarkko Tikka Jaakko Hollmén Matej Orešič 《Metabolomics : Official journal of the Metabolomic Society》2010,6(1):18-26

Mass spectrometry (MS)-based metabolomics studies often require handling of both identified and unidentified metabolite data. In order to avoid bias in data interpretation, it would be of advantage for the data analysis to include all available data. A practical challenge in exploratory metabolomics analysis is therefore how to interpret the changes related to unidentified peaks. In this paper, we address the challenge by predicting the class membership of unknown peaks by applying and comparing multiple supervised classifiers to selected lipidomics datasets. The employed classifiers include k-nearest neighbours (k-NN), support vector machines (SVM), partial least squares and discriminant analysis (PLS-DA) and Naive Bayes methods which are known to be effective and efficient in predicting the labels for unseen data. Here, the class label predictions are sought for unidentified lipid profiles coming from high throughput global screening in Ultra Performance Liquid Chromatography Mass Spectrometry (UPLC^TM/MS) experimental setup. Our investigation reveals that k-NN and SVM classifiers outperform both PLS-DA and Naive Bayes classifiers. Naive Bayes classifier perform poorly among all models and this observation seems logical as lipids are highly co-regulated and do not respect Naive Bayes assumptions of features being conditionally independent given the class. Common label predictions from k-NN and SVM can serve as a good starting point to explore full data and thereby facilitating exploratory studies where label information is critical for the data interpretation. 相似文献

5.

A Bayesian network classification methodology for gene expression data. 总被引：5，自引：0，他引：5

Paul Helman Robert Veroff Susan R Atlas Cheryl Willman 《Journal of computational biology》2004,11(4):581-615

We present new techniques for the application of a Bayesian network learning framework to the problem of classifying gene expression data. The focus on classification permits us to develop techniques that address in several ways the complexities of learning Bayesian nets. Our classification model reduces the Bayesian network learning problem to the problem of learning multiple subnetworks, each consisting of a class label node and its set of parent genes. We argue that this classification model is more appropriate for the gene expression domain than are other structurally similar Bayesian network classification models, such as Naive Bayes and Tree Augmented Naive Bayes (TAN), because our model is consistent with prior domain experience suggesting that a relatively small number of genes, taken in different combinations, is required to predict most clinical classes of interest. Within this framework, we consider two different approaches to identifying parent sets which are supported by the gene expression observations and any other currently available evidence. One approach employs a simple greedy algorithm to search the universe of all genes; the second approach develops and applies a gene selection algorithm whose results are incorporated as a prior to enable an exhaustive search for parent sets over a restricted universe of genes. Two other significant contributions are the construction of classifiers from multiple, competing Bayesian network hypotheses and algorithmic methods for normalizing and binning gene expression data in the absence of prior expert knowledge. Our classifiers are developed under a cross validation regimen and then validated on corresponding out-of-sample test sets. The classifiers attain a classification rate in excess of 90% on out-of-sample test sets for two publicly available datasets. We present an extensive compilation of results reported in the literature for other classification methods run against these same two datasets. Our results are comparable to, or better than, any we have found reported for these two sets, when a train-test protocol as stringent as ours is followed. 相似文献

6.

Prediction of protein homo-oligomer types by pseudo amino acid composition: Approached with an improved feature extraction and Naive Bayes Feature Fusion

Zhang SW Pan Q Zhang HC Shao ZC Shi JY 《Amino acids》2006,30(4):461-468

Summary. The interaction of non-covalently bound monomeric protein subunits forms oligomers. The oligomeric proteins are superior to the monomers within the scope of functional evolution of biomacromolecules. Such complexes are involved in various biological processes, and play an important role. It is highly desirable to predict oligomer types automatically from their sequence. Here, based on the concept of pseudo amino acid composition, an improved feature extraction method of weighted auto-correlation function of amino acid residue index and Naive Bayes multi-feature fusion algorithm is proposed and applied to predict protein homo-oligomer types. We used the support vector machine (SVM) as base classifiers, in order to obtain better results. For example, the total accuracies of A, B, C, D and E sets based on this improved feature extraction method are 77.63, 77.16, 76.46, 76.70 and 75.06% respectively in the jackknife test, which are 6.39, 5.92, 5.22, 5.46 and 3.82% higher than that of G set based on conventional amino acid composition method with the same SVM. Comparing with Chou’s feature extraction method of incorporating quasi-sequence-order effect, our method can increase the total accuracy at a level of 3.51 to 1.01%. The total accuracy improves from 79.66 to 80.83% by using the Naive Bayes Feature Fusion algorithm. These results show: 1) The improved feature extraction method is effective and feasible, and the feature vectors based on this method may contain more protein quaternary structure information and appear to capture essential information about the composition and hydrophobicity of residues in the surface patches that buried in the interfaces of associated subunits; 2) Naive Bayes Feature Fusion algorithm and SVM can be referred as a powerful computational tool for predicting protein homo-oligomer types. 相似文献

7.

Certain Investigations on Melanoma Detection Using Non-Subsampled Bendlet Transform with Different Classifiers

S. Poovizhi T. R. Ganesh Babu R. Praveena 《Molecular & cellular biomechanics : MCB》2021,18(4):201-219

Skin is the largest organ and outer enclosure of the integumentary system that protects the human body from pathogens. Among various cancers in the world, skin cancer is one of the most commonly diagnosed cancer which can be either melanoma or non-melanoma. Melanoma cancers are very fatal compared with non-melanoma cancers but the chances of survival rate are high when diagnosed and treated earlier. The main aim of this work is to analyze and investigate the performance of Non-Subsampled Bendlet Transform (NSBT) on various classifiers for detecting melanoma from dermoscopic images. NSBT is a multiscale and multidirectional transform based on second order shearlet system which precisely classifies the curvature over other directional representation systems. Here two-phase classification is employed using k-Nearest Neighbour (kNN), Naive Bayes (NB), Decision Trees (DT) and Support Vector Machines (SVM). The first phase classification is used to classify the images of PH2 database into normal and abnormal images and the second phase classification classifies the abnormal images into benign and malignant. Experimental result shows the improvement in classification accuracy, sensitivity and specificity compared with the state of art methods. 相似文献

8.

Non-linear models based on simple topological indices to identify RNase III protein members

Agüero-Chapin G de la Riva GA Molina-Ruiz R Sánchez-Rodríguez A Pérez-Machado G Vasconcelos V Antunes A 《Journal of theoretical biology》2011,273(1):167-178

Alignment-free classifiers are especially useful in the functional classification of protein classes with variable homology and different domain structures. Thus, the Topological Indices to BioPolymers (TI2BioP) methodology (Agüero-Chapin et al., 2010) inspired in both the TOPS-MODE and the MARCH-INSIDE methodologies allows the calculation of simple topological indices (TIs) as alignment-free classifiers. These indices were derived from the clustering of the amino acids into four classes of hydrophobicity and polarity revealing higher sequence-order information beyond the amino acid composition level. The predictability power of such TIs was evaluated for the first time on the RNase III family, due to the high diversity of its members (primary sequence and domain organization). Three non-linear models were developed for RNase III class prediction: Decision Tree Model (DTM), Artificial Neural Networks (ANN)-model and Hidden Markov Model (HMM). The first two are alignment-free approaches, using TIs as input predictors. Their performances were compared with a non-classical HMM, modified according to our amino acid clustering strategy. The alignment-free models showed similar performances on the training and the test sets reaching values above 90% in the overall classification. The non-classical HMM showed the highest rate in the classification with values above 95% in training and 100% in test. Although the higher accuracy of the HMM, the DTM showed simplicity for the RNase III classification with low computational cost. Such simplicity was evaluated in respect to HMM and ANN models for the functional annotation of a new bacterial RNase III class member, isolated and annotated by our group. 相似文献

9.

Automated diagnosis of epileptic EEG using entropies

U. Rajendra Acharya Filippo Molinari S. Vinitha Sree Subhagata Chattopadhyay Kwan-Hoong Ng Jasjit S. Suri 《Biomedical signal processing and control》2012,7(4):401-408

Epilepsy is a neurological disorder characterized by the presence of recurring seizures. Like many other neurological disorders, epilepsy can be assessed by the electroencephalogram (EEG). The EEG signal is highly non-linear and non-stationary, and hence, it is difficult to characterize and interpret it. However, it is a well-established clinical technique with low associated costs. In this work, we propose a methodology for the automatic detection of normal, pre-ictal, and ictal conditions from recorded EEG signals. Four entropy features namely Approximate Entropy (ApEn), Sample Entropy (SampEn), Phase Entropy 1 (S1), and Phase Entropy 2 (S2) were extracted from the collected EEG signals. These features were fed to seven different classifiers: Fuzzy Sugeno Classifier (FSC), Support Vector Machine (SVM), K-Nearest Neighbour (KNN), Probabilistic Neural Network (PNN), Decision Tree (DT), Gaussian Mixture Model (GMM), and Naive Bayes Classifier (NBC). Our results show that the Fuzzy classifier was able to differentiate the three classes with a high accuracy of 98.1%. Overall, compared to previous techniques, our proposed strategy is more suitable for diagnosis of epilepsy with higher accuracy. 相似文献

10.

Enzyme family classification by support vector machines

Cai CZ Han LY Ji ZL Chen YZ 《Proteins》2004,55(1):66-76

One approach for facilitating protein function prediction is to classify proteins into functional families. Recent studies on the classification of G-protein coupled receptors and other proteins suggest that a statistical learning method, Support vector machines (SVM), may be potentially useful for protein classification into functional families. In this work, SVM is applied and tested on the classification of enzymes into functional families defined by the Enzyme Nomenclature Committee of IUBMB. SVM classification system for each family is trained from representative enzymes of that family and seed proteins of Pfam curated protein families. The classification accuracy for enzymes from 46 families and for non-enzymes is in the range of 50.0% to 95.7% and 79.0% to 100% respectively. The corresponding Matthews correlation coefficient is in the range of 54.1% to 96.1%. Moreover, 80.3% of the 8,291 correctly classified enzymes are uniquely classified into a specific enzyme family by using a scoring function, indicating that SVM may have certain level of unique prediction capability. Testing results also suggest that SVM in some cases is capable of classification of distantly related enzymes and homologous enzymes of different functions. Effort is being made to use a more comprehensive set of enzymes as training sets and to incorporate multi-class SVM classification systems to further enhance the unique prediction accuracy. Our results suggest the potential of SVM for enzyme family classification and for facilitating protein function prediction. Our software is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi. 相似文献

11.

Automated protein subfamily identification and classification

下载免费PDF全文

Brown DP Krishnamurthy N Sjölander K 《PLoS computational biology》2007,3(8):e160

Function prediction by homology is widely used to provide preliminary functional annotations for genes for which experimental evidence of function is unavailable or limited. This approach has been shown to be prone to systematic error, including percolation of annotation errors through sequence databases. Phylogenomic analysis avoids these errors in function prediction but has been difficult to automate for high-throughput application. To address this limitation, we present a computationally efficient pipeline for phylogenomic classification of proteins. This pipeline uses the SCI-PHY (Subfamily Classification in Phylogenomics) algorithm for automatic subfamily identification, followed by subfamily hidden Markov model (HMM) construction. A simple and computationally efficient scoring scheme using family and subfamily HMMs enables classification of novel sequences to protein families and subfamilies. Sequences representing entirely novel subfamilies are differentiated from those that can be classified to subfamilies in the input training set using logistic regression. Subfamily HMM parameters are estimated using an information-sharing protocol, enabling subfamilies containing even a single sequence to benefit from conservation patterns defining the family as a whole or in related subfamilies. SCI-PHY subfamilies correspond closely to functional subtypes defined by experts and to conserved clades found by phylogenetic analysis. Extensive comparisons of subfamily and family HMM performances show that subfamily HMMs dramatically improve the separation between homologous and non-homologous proteins in sequence database searches. Subfamily HMMs also provide extremely high specificity of classification and can be used to predict entirely novel subtypes. The SCI-PHY Web server at http://phylogenomics.berkeley.edu/SCI-PHY/ allows users to upload a multiple sequence alignment for subfamily identification and subfamily HMM construction. Biologists wishing to provide their own subfamily definitions can do so. Source code is available on the Web page. The Berkeley Phylogenomics Group PhyloFacts resource contains pre-calculated subfamily predictions and subfamily HMMs for more than 40,000 protein families and domains at http://phylogenomics.berkeley.edu/phylofacts/. 相似文献

12.

An eco-informatics tool for microbial community studies: supervised classification of Amplicon Length Heterogeneity (ALH) profiles of 16S rRNA

Yang C Mills D Mathee K Wang Y Jayachandran K Sikaroodi M Gillevet P Entry J Narasimhan G 《Journal of microbiological methods》2006,65(1):49-62

Support vector machines (SVM) and K-nearest neighbors (KNN) are two computational machine learning tools that perform supervised classification. This paper presents a novel application of such supervised analytical tools for microbial community profiling and to distinguish patterning among ecosystems. Amplicon length heterogeneity (ALH) profiles from several hypervariable regions of 16S rRNA gene of eubacterial communities from Idaho agricultural soil samples and from Chesapeake Bay marsh sediments were separately analyzed. The profiles from all available hypervariable regions were concatenated to obtain a combined profile, which was then provided to the SVM and KNN classifiers. Each profile was labeled with information about the location or time of its sampling. We hypothesized that after a learning phase using feature vectors from labeled ALH profiles, both these classifiers would have the capacity to predict the labels of previously unseen samples. The resulting classifiers were able to predict the labels of the Idaho soil samples with high accuracy. The classifiers were less accurate for the classification of the Chesapeake Bay sediments suggesting greater similarity within the Bay's microbial community patterns in the sampled sites. The profiles obtained from the V1+V2 region were more informative than that obtained from any other single region. However, combining them with profiles from the V1 region (with or without the profiles from the V3 region) resulted in the most accurate classification of the samples. The addition of profiles from the V 9 region appeared to confound the classifiers. Our results show that SVM and KNN classifiers can be effectively applied to distinguish between eubacterial community patterns from different ecosystems based only on their ALH profiles. 相似文献

13.

基于遗传算法特征选择的HBV再激活分类预测模型

下载免费PDF全文

吴冠朋刘毅慧王帅黄伟刘同海尹勇《生物信息学》2016,14(4):243-248

探讨原发性肝癌患者精确放疗后乙型肝炎病毒(hepatitis b virus,HBV)再激活的危险特征和分类预测模型。提出基于遗传算法的特征选择方法,从原发性肝癌数据的初始特征集中选择HBV再激活的最优特征子集。建立贝叶斯和支持向量机的HBV再激活分类预测模型,并预测最优特征子集和初始特征集的分类性能。实验结果表明,基于遗传算法的特征选择提高了HBV再激活分类性能,最优特征子集的分类性能明显优于初始特征子集的分类性能。影响HBV再激活的最优特征子集包括:HBV DNA水平,肿瘤分期TNM,Child-Pugh,外放边界和全肝最大剂量。贝叶斯的分类准确性最高可达82.89%,支持向量机的分类准确性最高可达83.34%。相似文献

14.

Improving cancer classification accuracy using gene pairs

Chopra P Lee J Kang J Lee S 《PloS one》2010,5(12):e14305

Recent studies suggest that the deregulation of pathways, rather than individual genes, may be critical in triggering carcinogenesis. The pathway deregulation is often caused by the simultaneous deregulation of more than one gene in the pathway. This suggests that robust gene pair combinations may exploit the underlying bio-molecular reactions that are relevant to the pathway deregulation and thus they could provide better biomarkers for cancer, as compared to individual genes. In order to validate this hypothesis, in this paper, we used gene pair combinations, called doublets, as input to the cancer classification algorithms, instead of the original expression values, and we showed that the classification accuracy was consistently improved across different datasets and classification algorithms. We validated the proposed approach using nine cancer datasets and five classification algorithms including Prediction Analysis for Microarrays (PAM), C4.5 Decision Trees (DT), Naive Bayesian (NB), Support Vector Machine (SVM), and k-Nearest Neighbor (k-NN). 相似文献

15.

Substring selection for biomedical document classification

Han B Obradovic Z Hu ZZ Wu CH Vucetic S 《Bioinformatics (Oxford, England)》2006,22(17):2136-2142

MOTIVATION: Attribute selection is a critical step in development of document classification systems. As a standard practice, words are stemmed and the most informative ones are used as attributes in classification. Owing to high complexity of biomedical terminology, general-purpose stemming algorithms are often conservative and could also remove informative stems. This can lead to accuracy reduction, especially when the number of labeled documents is small. To address this issue, we propose an algorithm that omits stemming and, instead, uses the most discriminative substrings as attributes. RESULTS: The approach was tested on five annotated sets of abstracts from iProLINK that report on the experimental evidence about five types of protein post-translational modifications. The experiments showed that Naive Bayes and support vector machine classifiers perform consistently better [with area under the ROC curve (AUC) accuracy in range 0.92-0.97] when using the proposed attribute selection than when using attributes obtained by the Porter stemmer algorithm (AUC in 0.86-0.93 range). The proposed approach is particularly useful when labeled datasets are small. 相似文献

16.

Protein solubility: sequence based prediction and experimental verification

Smialowski P Martin-Galiano AJ Mikolajka A Girschick T Holak TA Frishman D 《Bioinformatics (Oxford, England)》2007,23(19):2536-2542

MOTIVATION: Obtaining soluble proteins in sufficient concentrations is a recurring limiting factor in various experimental studies. Solubility is an individual trait of proteins which, under a given set of experimental conditions, is determined by their amino acid sequence. Accurate theoretical prediction of solubility from sequence is instrumental for setting priorities on targets in large-scale proteomics projects. RESULTS: We present a machine-learning approach called PROSO to assess the chance of a protein to be soluble upon heterologous expression in Escherichia coli based on its amino acid composition. The classification algorithm is organized as a two-layered structure in which the output of primary support vector machine (SVM) classifiers serves as input for a secondary Naive Bayes classifier. Experimental progress information from the TargetDB database as well as previously published datasets were used as the source of training data. In comparison with previously published methods our classification algorithm possesses improved discriminatory capacity characterized by the Matthews Correlation Coefficient (MCC) of 0.434 between predicted and known solubility states and the overall prediction accuracy of 72% (75 and 68% for positive and negative class, respectively). We also provide experimental verification of our predictions using solubility measurements for 31 mutational variants of two different proteins. 相似文献

17.

Classifying DNA repair genes by kernel-based support vector machines

Jiang H Ching WK 《Bioinformation》2011,7(5):257-263

相似文献

18.

Length analyses of mammalian G-protein-coupled receptors.

J M Otaki S Firestein 《Journal of theoretical biology》2001,211(2):77-100

G-protein-coupled receptors (GPCRs) play a crucial role in mediating effects of extracellular messengers in a wide variety of biological systems, comprising the largest gene superfamily at least in mammals. Mammalian GPCRs are broadly classified into three families based on pharmacological properties and sequence similarities. These sequence similarities are largely confined to the seven transmembrane domains, and much less in the extracellular and intracellular loops and terminals (LTs). Together with the fact that the LTs vary considerably in length and sequence, the LT length of GPCRs has not been studied systematically. Here we have applied a statistical analysis to the length of the LTs of a wide variety of mammalian GPCRs in order to examine the existence of any trends in molecular architecture among a known mammalian GPCR population. Tree diagrams constructed by cluster analyses, using eight length factors in a given GPCR, revealed possible length relations among GPCRs and defined at least three groups. Most samples in Group J (joined) and Group M (minor) had an exceptionally long N-terminal and I3 loop, respectively; and other samples were considered as Group O (other/original). This length-based classification largely coincided with the conventional sequence- and pharmacology-based classification, suggesting that the LT length contains some biological information when analysed at the population level. Principle component analyses suggested the existence of inherent length differences between loops and terminals as well as between extracellular and intracellular LTs. Wilcoxon rank transformation tests unveiled statistically significant differences between Group O and Group J, not only in the N-terminal and I3 loop, but also in the E3 loop. Correlation analyses identified an E1-I2 length-correlation in Group O and Group J and an N-E3 length-correlation in Group J. Taken together, these results suggest a possible functional importance of LT length in the GPCR superfamily. 相似文献

19.

Decision tree based information integration for automated protein classification

Camoğlu O Can T Singh AK Wang YF 《Journal of bioinformatics and computational biology》2005,3(3):717-742

We propose a novel technique for automatically generating the SCOP classification of a protein structure with high accuracy. We achieve accurate classification by combining the decisions of multiple methods using the consensus of a committee (or an ensemble) classifier. Our technique, based on decision trees, is rooted in machine learning which shows that by judicially employing component classifiers, an ensemble classifier can be constructed to outperform its components. We use two sequence- and three structure-comparison tools as component classifiers. Given a protein structure and using the joint hypothesis, we first determine if the protein belongs to an existing category (family, superfamily, fold) in the SCOP hierarchy. For the proteins that are predicted as members of the existing categories, we compute their family-, superfamily-, and fold-level classifications using the consensus classifier. We show that we can significantly improve the classification accuracy compared to the individual component classifiers. In particular, we achieve error rates that are 3-12 times less than the individual classifiers' error rates at the family level, 1.5-4.5 times less at the superfamily level, and 1.1-2.4 times less at the fold level. 相似文献

20.

Evaluation of different biological data and computational classification methods for use in protein interaction prediction 总被引：1，自引：0，他引：1

Qi Y Bar-Joseph Z Klein-Seetharaman J 《Proteins》2006,63(3):490-500

Protein–protein interactions play a key role in many biological systems. High‐throughput methods can directly detect the set of interacting proteins in yeast, but the results are often incomplete and exhibit high false‐positive and false‐negative rates. Recently, many different research groups independently suggested using supervised learning methods to integrate direct and indirect biological data sources for the protein interaction prediction task. However, the data sources, approaches, and implementations varied. Furthermore, the protein interaction prediction task itself can be subdivided into prediction of (1) physical interaction, (2) co‐complex relationship, and (3) pathway co‐membership. To investigate systematically the utility of different data sources and the way the data is encoded as features for predicting each of these types of protein interactions, we assembled a large set of biological features and varied their encoding for use in each of the three prediction tasks. Six different classifiers were used to assess the accuracy in predicting interactions, Random Forest (RF), RF similarity‐based k‐Nearest‐Neighbor, Naïve Bayes, Decision Tree, Logistic Regression, and Support Vector Machine. For all classifiers, the three prediction tasks had different success rates, and co‐complex prediction appears to be an easier task than the other two. Independently of prediction task, however, the RF classifier consistently ranked as one of the top two classifiers for all combinations of feature sets. Therefore, we used this classifier to study the importance of different biological datasets. First, we used the splitting function of the RF tree structure, the Gini index, to estimate feature importance. Second, we determined classification accuracy when only the top‐ranking features were used as an input in the classifier. We find that the importance of different features depends on the specific prediction task and the way they are encoded. Strikingly, gene expression is consistently the most important feature for all three prediction tasks, while the protein interactions identified using the yeast‐2‐hybrid system were not among the top‐ranking features under any condition. Proteins 2006. © 2006 Wiley‐Liss, Inc. 相似文献