首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Huang WL  Tung CW  Huang HL  Hwang SF  Ho SY 《Bio Systems》2007,90(2):573-581
Accurate prediction methods of protein subnuclear localizations rely on the cooperation between informative features and classifier design. Support vector machine (SVM) based learning methods are shown effective for predictions of protein subcellular and subnuclear localizations. This study proposes an evolutionary support vector machine (ESVM) based classifier with automatic selection from a large set of physicochemical composition (PCC) features to design an accurate system for predicting protein subnuclear localization, named ProLoc. ESVM using an inheritable genetic algorithm combined with SVM can automatically determine the best number m of PCC features and identify m out of 526 PCC features simultaneously. To evaluate ESVM, this study uses two datasets SNL6 and SNL9, which have 504 proteins localized in 6 subnuclear compartments and 370 proteins localized in 9 subnuclear compartments. Using a leave-one-out cross-validation, ProLoc utilizing the selected m=33 and 28 PCC features has accuracies of 56.37% for SNL6 and 72.82% for SNL9, which are better than 51.4% for the SVM-based system using k-peptide composition features applied on SNL6, and 64.32% for an optimized evidence-theoretic k-nearest neighbor classifier utilizing pseudo amino acid composition applied on SNL9, respectively.  相似文献   

2.
The identification of catalytic residues is an essential step in functional characterization of enzymes. We present a purely structural approach to this problem, which is motivated by the difficulty of evolution-based methods to annotate structural genomics targets that have few or no homologs in the databases. Our approach combines a state-of-the-art support vector machine (SVM) classifier with novel structural features that augment structural clues by spatial averaging and Z scoring. Special attention is paid to the class imbalance problem that stems from the overwhelming number of non-catalytic residues in enzymes compared to catalytic residues. This problem is tackled by: (1) optimizing the classifier to maximize a performance criterion that considers both Type I and Type II errors in the classification of catalytic and non-catalytic residues; (2) under-sampling non-catalytic residues before SVM training; and (3) during SVM training, penalizing errors in learning catalytic residues more than errors in learning non-catalytic residues. Tested on four enzyme datasets, one specifically designed by us to mimic the structural genomics scenario and three previously evaluated datasets, our structure-based classifier is never inferior to similar structure-based classifiers and comparable to classifiers that use both structural and evolutionary features. In addition to the evaluation of the performance of catalytic residue identification, we also present detailed case studies on three proteins. This analysis suggests that many false positive predictions may correspond to binding sites and other functional residues. A web server that implements the method, our own-designed database, and the source code of the programs are publicly available at http://www.cs.bgu.ac.il/~meshi/functionPrediction.  相似文献   

3.
泛素化是目前广受关注的一种翻译后修饰过程,对蛋白质降解、DNA修复等多种细胞过程都具有重要的调控作用。本文根据国内外蛋白质泛素化位点预测的研究,分析了预测泛素化位点的特征属性,总结了对这些特征进行优化的特征选择方法,并对预测过程中所使用的各种机器学习分类器进行了概述。  相似文献   

4.
《IRBM》2014,35(5):244-254
ObjectiveThe overall goal of the study is to detect coronary artery lesions regardless their nature, calcified or hypo-dense. To avoid explicit modelling of heterogeneous lesions, we adopted an approach based on machine learning and using unsupervised or semi-supervised classifiers. The success of the classifiers based on machine learning strongly depends on the appropriate choice of features differentiating between lesions and regular appearance. The specific goal of this article is to propose a novel strategy devised to select the best feature set for the classifiers used, out of a given set of candidate features.Materials and methodsThe features are calculated in image planes orthogonal to the artery centerline, and the classifier assigns to each of these cross-sections a label “healthy” or “diseased”. The contribution of this article is a feature-selection strategy based on the empirical risk function that is used as a criterion in the initial feature ranking and in the selection process itself. We have assessed this strategy in association with two classifiers based on the density-level detection approach that seeks outliers from the distribution corresponding to the regular appearance. The method was evaluated using a total of 13,687 cross-sections extracted from 53 coronary arteries in 15 patients.ResultsUsing the feature subset selected by the risk-based strategy, balanced error rates achieved by the unsupervised and semi-supervised classifiers respectively were equal to 13.5% and 15.4%. These results were substantially better than the rates achieved using feature subsets selected by supervised strategies. The unsupervised and semi-supervised methods also outperformed supervised classifiers using feature subsets selected by the corresponding supervised strategies.DiscussionSupervised methods require large data sets annotated by experts, both to select the features and to train the classifiers, and collecting these annotations is time-consuming. With these methods, lesions whose appearance differs from the training data may remain undetected. Lesion-detection problem is highly imbalanced, since healthy cross-sections usually are much more numerous than the diseased ones. Training the classifiers based on the density-level detection approach needs a small number of annotations or no annotations at all. The same annotations are sufficient to compute the empirical risk and to perform the selection. Therefore, our strategy associated with an unsupervised or semi-supervised classifier requires a considerably smaller number of annotations as compared to conventional supervised selection strategies. The approach proposed is also better suited for highly imbalanced problems and can detect lesions differing from the training set.ConclusionThe risk-based selection strategy, associated with classifiers using the density-level detection approach, outperformed other strategies and classifiers when used to detect coronary artery lesions. It is well suited for highly imbalanced problems, where the lesions are represented as low-density regions of the feature space, and it can be used in other anomaly detection problems interpretable as a binary classification problem where the empirical risk can be calculated.  相似文献   

5.
MOTIVATION: Cancer diagnosis is one of the most important emerging clinical applications of gene expression microarray technology. We are seeking to develop a computer system for powerful and reliable cancer diagnostic model creation based on microarray data. To keep a realistic perspective on clinical applications we focus on multicategory diagnosis. To equip the system with the optimum combination of classifier, gene selection and cross-validation methods, we performed a systematic and comprehensive evaluation of several major algorithms for multicategory classification, several gene selection methods, multiple ensemble classifier methods and two cross-validation designs using 11 datasets spanning 74 diagnostic categories and 41 cancer types and 12 normal tissue types. RESULTS: Multicategory support vector machines (MC-SVMs) are the most effective classifiers in performing accurate cancer diagnosis from gene expression data. The MC-SVM techniques by Crammer and Singer, Weston and Watkins and one-versus-rest were found to be the best methods in this domain. MC-SVMs outperform other popular machine learning algorithms, such as k-nearest neighbors, backpropagation and probabilistic neural networks, often to a remarkable degree. Gene selection techniques can significantly improve the classification performance of both MC-SVMs and other non-SVM learning algorithms. Ensemble classifiers do not generally improve performance of the best non-ensemble models. These results guided the construction of a software system GEMS (Gene Expression Model Selector) that automates high-quality model construction and enforces sound optimization and performance estimation procedures. This is the first such system to be informed by a rigorous comparative analysis of the available algorithms and datasets. AVAILABILITY: The software system GEMS is available for download from http://www.gems-system.org for non-commercial use. CONTACT: alexander.statnikov@vanderbilt.edu.  相似文献   

6.
MicroRNAs are key regulators of eukaryotic gene expression whose fundamental role has already been identified in many cell pathways. The correct identification of miRNAs targets is still a major challenge in bioinformatics and has motivated the development of several computational methods to overcome inherent limitations of experimental analysis. Indeed, the best results reported so far in terms of specificity and sensitivity are associated to machine learning-based methods for microRNA-target prediction. Following this trend, in the current paper we discuss and explore a microRNA-target prediction method based on a random forest classifier, namely RFMirTarget. Despite its well-known robustness regarding general classifying tasks, to the best of our knowledge, random forest have not been deeply explored for the specific context of predicting microRNAs targets. Our framework first analyzes alignments between candidate microRNA-target pairs and extracts a set of structural, thermodynamics, alignment, seed and position-based features, upon which classification is performed. Experiments have shown that RFMirTarget outperforms several well-known classifiers with statistical significance, and that its performance is not impaired by the class imbalance problem or features correlation. Moreover, comparing it against other algorithms for microRNA target prediction using independent test data sets from TarBase and starBase, we observe a very promising performance, with higher sensitivity in relation to other methods. Finally, tests performed with RFMirTarget show the benefits of feature selection even for a classifier with embedded feature importance analysis, and the consistency between relevant features identified and important biological properties for effective microRNA-target gene alignment.  相似文献   

7.
When the standard approach to predict protein function by sequence homology fails, other alternative methods can be used that require only the amino acid sequence for predicting function. One such approach uses machine learning to predict protein function directly from amino acid sequence features. However, there are two issues to consider before successful functional prediction can take place: identifying discriminatory features, and overcoming the challenge of a large imbalance in the training data. We show that by applying feature subset selection followed by undersampling of the majority class, significantly better support vector machine (SVM) classifiers are generated compared with standard machine learning approaches. As well as revealing that the features selected could have the potential to advance our understanding of the relationship between sequence and function, we also show that undersampling to produce fully balanced data significantly improves performance. The best discriminating ability is achieved using SVMs together with feature selection and full undersampling; this approach strongly outperforms other competitive learning algorithms. We conclude that this combined approach can generate powerful machine learning classifiers for predicting protein function directly from sequence.  相似文献   

8.
This study investigated whether infrared spectroscopy combined with a deep learning algorithm could be a useful tool for determining causes of death by analyzing pulmonary edema fluid from forensic autopsies. A newly designed convolutional neural network‐based deep learning framework, named DeepIR and eight popular machine learning algorithms, were used to construct classifiers. The prediction performances of these classifiers demonstrated that DeepIR outperformed the machine learning algorithms in establishing classifiers to determine the causes of death. Moreover, DeepIR was generally less dependent on preprocessing procedures than were the machine learning algorithms; it provided the validation accuracy with a narrow range from 0.9661 to 0.9856 and the test accuracy ranging from 0.8774 to 0.9167 on the raw pulmonary edema fluid spectral dataset and the nine preprocessing protocol‐based datasets in our study. In conclusion, this study demonstrates that the deep learning‐equipped Fourier transform infrared spectroscopy technique has the potential to be an effective aid for determining causes of death.  相似文献   

9.
SUMMARY: Several papers have been published where nonlinear machine learning algorithms, e.g. artificial neural networks, support vector machines and decision trees, have been used to model the specificity of the HIV-1 protease and extract specificity rules. We show that the dataset used in these studies is linearly separable and that it is a misuse of nonlinear classifiers to apply them to this problem. The best solution on this dataset is achieved using a linear classifier like the simple perceptron or the linear support vector machine, and it is straightforward to extract rules from these linear models. We identify key residues in peptides that are efficiently cleaved by the HIV-1 protease and list the most prominent rules, relating them to experimental results for the HIV-1 protease. MOTIVATION: Understanding HIV-1 protease specificity is important when designing HIV inhibitors and several different machine learning algorithms have been applied to the problem. However, little progress has been made in understanding the specificity because nonlinear and overly complex models have been used. RESULTS: We show that the problem is much easier than what has previously been reported and that linear classifiers like the simple perceptron or linear support vector machines are at least as good predictors as nonlinear algorithms. We also show how sets of specificity rules can be generated from the resulting linear classifiers. AVAILABILITY: The datasets used are available at http://www.hh.se/staff/bioinf/  相似文献   

10.
PCP: a program for supervised classification of gene expression profiles   总被引:1,自引:0,他引:1  
PCP (Pattern Classification Program) is an open-source machine learning program for supervised classification of patterns (vectors of measurements). The principal use of PCP in bioinformatics is design and evaluation of classifiers for use in clinical diagnostic tests based on measurements of gene expression. PCP implements leading pattern classification and gene selection algorithms and incorporates cross-validation estimation of classifier performance. Importantly, the implementation integrates gene selection and class prediction stages, which is vital for computing reliable performance estimates in small-sample scenarios. Additionally, the program includes automated and efficient model selection (optimization of parameters) for support vector machine (SVM) classifier. The distribution includes Linux and Windows/Cygwin binaries. The program can easily be ported to other platforms. AVAILABILITY: Free download at http://pcp.sourceforge.net  相似文献   

11.
Protein–protein interactions play a key role in many biological systems. High‐throughput methods can directly detect the set of interacting proteins in yeast, but the results are often incomplete and exhibit high false‐positive and false‐negative rates. Recently, many different research groups independently suggested using supervised learning methods to integrate direct and indirect biological data sources for the protein interaction prediction task. However, the data sources, approaches, and implementations varied. Furthermore, the protein interaction prediction task itself can be subdivided into prediction of (1) physical interaction, (2) co‐complex relationship, and (3) pathway co‐membership. To investigate systematically the utility of different data sources and the way the data is encoded as features for predicting each of these types of protein interactions, we assembled a large set of biological features and varied their encoding for use in each of the three prediction tasks. Six different classifiers were used to assess the accuracy in predicting interactions, Random Forest (RF), RF similarity‐based k‐Nearest‐Neighbor, Naïve Bayes, Decision Tree, Logistic Regression, and Support Vector Machine. For all classifiers, the three prediction tasks had different success rates, and co‐complex prediction appears to be an easier task than the other two. Independently of prediction task, however, the RF classifier consistently ranked as one of the top two classifiers for all combinations of feature sets. Therefore, we used this classifier to study the importance of different biological datasets. First, we used the splitting function of the RF tree structure, the Gini index, to estimate feature importance. Second, we determined classification accuracy when only the top‐ranking features were used as an input in the classifier. We find that the importance of different features depends on the specific prediction task and the way they are encoded. Strikingly, gene expression is consistently the most important feature for all three prediction tasks, while the protein interactions identified using the yeast‐2‐hybrid system were not among the top‐ranking features under any condition. Proteins 2006. © 2006 Wiley‐Liss, Inc.  相似文献   

12.
Identification and characterization of antigenic determinants on proteins has received considerable attention utilizing both, experimental as well as computational methods. For computational routines mostly structural as well as physicochemical parameters have been utilized for predicting the antigenic propensity of protein sites. However, the performance of computational routines has been low when compared to experimental alternatives. Here we describe the construction of machine learning based classifiers to enhance the prediction quality for identifying linear B-cell epitopes on proteins. Our approach combines several parameters previously associated with antigenicity, and includes novel parameters based on frequencies of amino acids and amino acid neighborhood propensities. We utilized machine learning algorithms for deriving antigenicity classification functions assigning antigenic propensities to each amino acid of a given protein sequence. We compared the prediction quality of the novel classifiers with respect to established routines for epitope scoring, and tested prediction accuracy on experimental data available for HIV proteins. The major finding is that machine learning classifiers clearly outperform the reference classification systems on the HIV epitope validation set.  相似文献   

13.
MOTIVATION: Promoter prediction is important for the analysis of gene regulations. Although a number of promoter prediction algorithms have been reported in literature, significant improvement in prediction accuracy remains a challenge. In this paper, an effective promoter identification algorithm, which is called PromoterExplorer, is proposed. In our approach, we analyze the different roles of various features, that is, local distribution of pentamers, positional CpG island features and digitized DNA sequence, and then combine them to build a high-dimensional input vector. A cascade AdaBoost-based learning procedure is adopted to select the most 'informative' or 'discriminating' features to build a sequence of weak classifiers, which are combined to form a strong classifier so as to achieve a better performance. The cascade structure used for identification can also reduce the false positive. RESULTS: PromoterExplorer is tested based on large-scale DNA sequences from different databases, including the EPD, DBTSS, GenBank and human chromosome 22. Experimental results show that consistent and promising performance can be achieved.  相似文献   

14.
This paper presents a novel application of particle swarm optimization (PSO) in combination with another computational intelligence (CI) technique, namely, proximal support vector machine (PSVM) for machinery fault detection. Both real-valued and binary PSO algorithms have been considered along with linear and nonlinear versions of PSVM. The time domain vibration signals of a rotating machine with normal and defective bearings are processed for feature extraction. The features extracted from original and preprocessed signals are used as inputs to the classifiers (PSVM) for detection of machine condition. Input features are selected using a PSO algorithm. The classifiers are trained with a subset of experimental data for known machine conditions and are tested using the remaining data. The procedure is illustrated using the experimental vibration data of a rotating machine. The influences of the number of features, PSO algorithms and type of classifiers (linear or nonlinear PSVM) on the detection success are investigated. Results are compared with a genetic algorithm (GA) and principal component analysis (PCA). The PSO based approach gave test classification success above 90% which were comparable with the GA and much better than PCA. The results show the effectiveness of the selected features and classifiers in detection of machine condition.  相似文献   

15.
Meissner M  Koch O  Klebe G  Schneider G 《Proteins》2009,74(2):344-352
We present machine learning approaches for turn prediction from the amino acid sequence. Different turn classes and types were considered based on a novel turn classification scheme. We trained an unsupervised (self-organizing map) and two kernel-based classifiers, namely the support vector machine and a probabilistic neural network. Turn versus non-turn classification was carried out for turn families containing intramolecular hydrogen bonds and three to six residues. Support vector machine classifiers yielded a Matthews correlation coefficient (mcc) of approximately 0.6 and a prediction accuracy of 80%. Probabilistic neural networks were developed for beta-turn type prediction. The method was able to distinguish between five types of beta-turns yielding mcc > 0.5 and at least 80% overall accuracy. We conclude that the proposed new turn classification is distinct and well-defined, and machine learning classifiers are suited for sequence-based turn prediction. Their potential for sequence-based prediction of turn structures is discussed.  相似文献   

16.
Plant‐pathogenic fungi secrete effector proteins to facilitate infection. We describe extensive improvements to EffectorP, the first machine learning classifier for fungal effector prediction. EffectorP 2.0 is now trained on a larger set of effectors and utilizes a different approach based on an ensemble of classifiers trained on different subsets of negative data, offering different views on classification. EffectorP 2.0 achieves an accuracy of 89%, compared with 82% for EffectorP 1.0 and 59.8% for a small size classifier. Important features for effector prediction appear to be protein size, protein net charge as well as the amino acids serine and cysteine. EffectorP 2.0 decreases the number of predicted effectors in secretomes of fungal plant symbionts and saprophytes by 40% when compared with EffectorP 1.0. However, EffectorP 1.0 retains value, and combining EffectorP 1.0 and 2.0 results in a stringent classifier with a low false positive rate of 9%. EffectorP 2.0 predicts significant enrichments of effectors in 12 of 13 sets of infection‐induced proteins from diverse fungal pathogens, whereas a small cysteine‐rich classifier detects enrichment in only seven of 13. EffectorP 2.0 will fast track the prioritization of high‐confidence effector candidates for functional validation and aid in improving our understanding of effector biology. EffectorP 2.0 is available at http://effectorp.csiro.au .  相似文献   

17.
N7-methylguanosine (m7G) is an essential, ubiquitous, and positively charged modification at the 5′ cap of eukaryotic mRNA, modulating its export, translation, and splicing processes. Although several machine learning (ML)-based computational predictors for m7G have been developed, all utilized specific computational framework. This study is the first instance we explored four different computational frameworks and identified the best approach. Based on that we developed a novel predictor, THRONE (A three-layer ensemble predictor for identifying human RNA N7-methylguanosine sites) to accurately identify m7G sites from the human genome. THRONE employs a wide range of sequence-based features inputted to several ML classifiers and combines these models through ensemble learning. The three-step ensemble learning is as follows: 54 baseline models were constructed in the first layer and the predicted probability of m7G was considered as a new feature vector for the sequential step. Subsequently, six meta-models were created using the new feature vector and their predicted probability was yet again considered as novel features. Finally, random forest was deemed as the best super classifier learner for the final prediction using a systematic approach incorporated with novel features. Interestingly, THRONE outperformed other existing methods in the prediction of m7G sites on both cross-validation analysis and independent evaluation. The proposed method is publicly accessible at: http://thegleelab.org/THRONE/ and expects to help the scientific community identify the putative m7G sites and formulate a novel testable biological hypothesis.  相似文献   

18.
An increasing number of genes have been experimentally confirmed in recent years as causative genes to various human diseases. The newly available knowledge can be exploited by machine learning methods to discover additional unknown genes that are likely to be associated with diseases. In particular, positive unlabeled learning (PU learning) methods, which require only a positive training set P (confirmed disease genes) and an unlabeled set U (the unknown candidate genes) instead of a negative training set N, have been shown to be effective in uncovering new disease genes in the current scenario. Using only a single source of data for prediction can be susceptible to bias due to incompleteness and noise in the genomic data and a single machine learning predictor prone to bias caused by inherent limitations of individual methods. In this paper, we propose an effective PU learning framework that integrates multiple biological data sources and an ensemble of powerful machine learning classifiers for disease gene identification. Our proposed method integrates data from multiple biological sources for training PU learning classifiers. A novel ensemble-based PU learning method EPU is then used to integrate multiple PU learning classifiers to achieve accurate and robust disease gene predictions. Our evaluation experiments across six disease groups showed that EPU achieved significantly better results compared with various state-of-the-art prediction methods as well as ensemble learning classifiers. Through integrating multiple biological data sources for training and the outputs of an ensemble of PU learning classifiers for prediction, we are able to minimize the potential bias and errors in individual data sources and machine learning algorithms to achieve more accurate and robust disease gene predictions. In the future, our EPU method provides an effective framework to integrate the additional biological and computational resources for better disease gene predictions.  相似文献   

19.
MOTIVATION: Various studies have shown that cancer tissue samples can be successfully detected and classified by their gene expression patterns using machine learning approaches. One of the challenges in applying these techniques for classifying gene expression data is to extract accurate, readily interpretable rules providing biological insight as to how classification is performed. Current methods generate classifiers that are accurate but difficult to interpret. This is the trade-off between credibility and comprehensibility of the classifiers. Here, we introduce a new classifier in order to address these problems. It is referred to as k-TSP (k-Top Scoring Pairs) and is based on the concept of 'relative expression reversals'. This method generates simple and accurate decision rules that only involve a small number of gene-to-gene expression comparisons, thereby facilitating follow-up studies. RESULTS: In this study, we have compared our approach to other machine learning techniques for class prediction in 19 binary and multi-class gene expression datasets involving human cancers. The k-TSP classifier performs as efficiently as Prediction Analysis of Microarray and support vector machine, and outperforms other learning methods (decision trees, k-nearest neighbour and na?ve Bayes). Our approach is easy to interpret as the classifier involves only a small number of informative genes. For these reasons, we consider the k-TSP method to be a useful tool for cancer classification from microarray gene expression data. AVAILABILITY: The software and datasets are available at http://www.ccbm.jhu.edu CONTACT: actan@jhu.edu.  相似文献   

20.
Machine learning (ML) has been extensively applied to develop models and to understand high-throughput data of biological processes. However, new ML models, trained with novel experimental results, are required to build regularly for more precise predictions. ML methods can build models from numeric data, whereas biological data are generally textual (DNA, protein sequences) or images and needs feature calculation algorithms to generate quantitative features. Programming skills along with domain knowledge are required to develop these algorithms. Therefore, the process of knowledge discovery through ML is decelerated due to lack of generic tools to construct features and to build models directly from the data. Hence, we developed a schema that calculates about 5,000 features, selects relevant features and develops protein classifiers from the training data. To demonstrate the general applicability and robustness of our method, fungal adhesins and nuclear receptor proteins were used for building classifiers which outperformed existing classifiers when tested on independent data. Next, we built a classifier for mitochondrial proteins of Plasmodium falciparum which causes human malaria because the latest corresponding classifiers are not publically accessible. Our classifier attained 98.18 % accuracy and 0.95 Matthews correlation coefficient by fivefold cross-validation and outperformed existing classifiers on independent test set. We implemented this schema as user-friendly and open source application Pro-Gyan (http://code.google.com/p/pro-gyan/), to build and share executable classifiers without programming knowledge.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号