首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Data mining in bioinformatics using Weka   总被引:8,自引:0,他引:8  
The Weka machine learning workbench provides a general-purpose environment for automatic classification, regression, clustering and feature selection-common data mining problems in bioinformatics research. It contains an extensive collection of machine learning algorithms and data pre-processing methods complemented by graphical user interfaces for data exploration and the experimental comparison of different machine learning techniques on the same problem. Weka can process data given in the form of a single relational table. Its main objectives are to (a) assist users in extracting useful information from data and (b) enable them to easily identify a suitable algorithm for generating an accurate predictive model from it. AVAILABILITY: http://www.cs.waikato.ac.nz/ml/weka.  相似文献   

2.
Guo J  Chen H  Sun Z  Lin Y 《Proteins》2004,54(4):738-743
A high-performance method was developed for protein secondary structure prediction based on the dual-layer support vector machine (SVM) and position-specific scoring matrices (PSSMs). SVM is a new machine learning technology that has been successfully applied in solving problems in the field of bioinformatics. The SVM's performance is usually better than that of traditional machine learning approaches. The performance was further improved by combining PSSM profiles with the SVM analysis. The PSSMs were generated from PSI-BLAST profiles, which contain important evolution information. The final prediction results were generated from the second SVM layer output. On the CB513 data set, the three-state overall per-residue accuracy, Q3, reached 75.2%, while segment overlap (SOV) accuracy increased to 80.0%. On the CB396 data set, the Q3 of our method reached 74.0% and the SOV reached 78.1%. A web server utilizing the method has been constructed and is available at http://www.bioinfo.tsinghua.edu.cn/pmsvm.  相似文献   

3.
MOTIVATION: Small non-coding RNA (ncRNA) genes play important regulatory roles in a variety of cellular processes. However, detection of ncRNA genes is a great challenge to both experimental and computational approaches. In this study, we describe a new approach called positive sample only learning (PSoL) to predict ncRNA genes in the Escherichia coli genome. Although PSoL is a machine learning method for classification, it requires no negative training data, which, in general, is hard to define properly and affects the performance of machine learning dramatically. In addition, using the support vector machine (SVM) as the core learning algorithm, PSoL can integrate many different kinds of information to improve the accuracy of prediction. Besides the application of PSoL for predicting ncRNAs, PSoL is applicable to many other bioinformatics problems as well. RESULTS: The PSoL method is assessed by 5-fold cross-validation experiments which show that PSoL can achieve about 80% accuracy in recovery of known ncRNAs. We compared PSoL predictions with five previously published results. The PSoL method has the highest percentage of predictions overlapping with those from other methods.  相似文献   

4.
The creation of classification kernel models to categorize unknown data samples of massive magnitude is an extremely advantageous tool for the scientific community. Excel2SVM, a stand-alone Python mathematical analysis tool, bridges the gap between researchers and computer science to create a simple graphical user interface that allows users to examine data and perform maximal margin classification. This valuable ability to train support vector machines and classify unknown data files is harnessed in this fast and efficient software, granting researchers full access to this complicated, high-level algorithm. Excel2SVM offers the ability to convert data to the proper sparse format while performing a variety of kernel functions along with cost factors/modes, grids, crossvalidation, and several other functions. This program functions with any type of quantitative data making Excel2SVM the ideal tool for analyzing a wide variety of input. The software is free and available at www.bioinformatics.org/excel2svm. A link to the software may also be found at www.kernel-machines.org. This software provides a useful graphical user interface that has proven to provide kernel models with accurate results and data classification through a decision boundary.  相似文献   

5.
In a social network, users hold and express positive and negative attitudes (e.g. support/opposition) towards other users. Those attitudes exhibit some kind of binary relationships among the users, which play an important role in social network analysis. However, some of those binary relationships are likely to be latent as the scale of social network increases. The essence of predicting latent binary relationships have recently began to draw researchers'' attention. In this paper, we propose a machine learning algorithm for predicting positive and negative relationships in social networks inspired by structural balance theory and social status theory. More specifically, we show that when two users in the network have fewer common neighbors, the prediction accuracy of the relationship between them deteriorates. Accordingly, in the training phase, we propose a segment-based training framework to divide the training data into two subsets according to the number of common neighbors between users, and build a prediction model for each subset based on support vector machine (SVM). Moreover, to deal with large-scale social network data, we employ a sampling strategy that selects small amount of training data while maintaining high accuracy of prediction. We compare our algorithm with traditional algorithms and adaptive boosting of them. Experimental results of typical data sets show that our algorithm can deal with large social networks and consistently outperforms other methods.  相似文献   

6.
The Maximal Margin (MAMA) linear programming classification algorithm has recently been proposed and tested for cancer classification based on expression data. It demonstrated sound performance on publicly available expression datasets. We developed a web interface to allow potential users easy access to the MAMA classification tool. Basic and advanced options provide flexibility in exploitation. The input data format is the same as that used in most publicly available datasets. This makes the web resource particularly convenient for non-expert machine learning users working in the field of expression data analysis.  相似文献   

7.
李高磊  黄玮  孙浩  李余动 《微生物学报》2021,61(9):2581-2593
随着大数据时代的到来,如何将生物组学海量数据转化为易理解及可视化的知识是当前生物信息学面临的重要挑战之一.为了处理复杂、高维的微生物组数据,目前机器学习算法已被应用于人体微生物组研究,以揭示疾病背后的复杂机制.本文首先简述了微生物组数据处理方法及常用的机器学习算法,如支持向量机(SVM)、随机森林(RF)和人工神经网络...  相似文献   

8.
孙远帅  陈垚  玄萍  江弋 《生物信息学》2013,11(3):161-166
基因芯片技术的发展为生物信息学带来了机遇,使在基因表达水平上进行癌症诊断成为可能。但基因芯片数据高维小样本的特征也使传统机器学习方法面临挑战。本文利用真实的基因表达数据,测试了目前主要的分类方法和降维方法在癌症诊断方面的效果,通过实验对比发现:基于线性核函数的支持向量机可以有效地分类肿瘤与非肿瘤的基因表达,从而为癌症诊断提供借鉴。  相似文献   

9.
A DSRPCL-SVM approach to informative gene analysis   总被引:1,自引:0,他引:1  
Microarray data based tumor diagnosis is a very interesting topic in bioinformatics. One of the key problems is the discovery and analysis of informative genes of a tumor. Although there are many elaborate approaches to this problem, it is still difficult to select a reasonable set of informative genes for tumor diagnosis only with microarray data. In this paper, we classify the genes expressed through microarray data into a number of clusters via the distance sensitive rival penalized competitive learning (DSRPCL) algorithm and then detect the informative gene cluster or set with the help of support vector machine (SVM). Moreover, the critical or powerful informative genes can be found through further classifications and detections on the obtained informative gene clusters. It is well demonstrated by experiments on the colon, leukemia, and breast cancer datasets that our proposed DSRPCL-SVM approach leads to a reasonable selection of informative genes for tumor diagnosis.  相似文献   

10.
支持向量机是一种基于统计学习理论的新型学习机。文章提出一种基于支持向量机的癫痫脑电特征提取与识别方法,充分发挥其泛化能力强的特点,在与神经网络方法的比较中,表现出较低的漏检率和较好的鲁棒性,有深入研究的价值和良好的应用前景。  相似文献   

11.
基于支持向量机(SVM)的剪接位点识别   总被引:14,自引:1,他引:13  
剪接位点的识别作为基因识别中的一个重要环节, 一直受到研究人员的关注。考虑到剪接位点附近存在的序列保守性,已有一些基于统计特性的方法被用于剪接位点的识别中,但效果仍有待进一步改进。支持向量机(Support Vector Machines) 作为一种新的基于统计学习理论的学习机,近几年有了很大的发展,已被应用在模式识别的许多问题中。文中将其用于剪接位点的识别中,并针对满足GT- AG 规则的序列样本中虚假剪接位点的样本数远大于真实位点这一特性, 提出了一种基于SVM 的平衡取小法以获得更好的识别效果。实验结果表明,应用支持向量机进行剪接位点的识别能更好地提取位点附近保守序列的统计特征,对测试集具有更好的推广能力,并且使用上更加简单。这一结果为剪接位点的识别提供了一种新的方法,同时也为生物大分子研究中结构和位点的识别问题的解决提供了新的线索。  相似文献   

12.
As one important post-translational modification of prokaryotic proteins, pupylation plays a key role in regulating various biological processes. The accurate identification of pupylation sites is crucial for understanding the underlying mechanisms of pupylation. Although several computational methods have been developed for the identification of pupylation sites, the prediction accuracy of them is still unsatisfactory. Here, a novel bioinformatics tool named IMP–PUP is proposed to improve the prediction of pupylation sites. IMP–PUP is constructed on the composition of k-spaced amino acid pairs and trained with a modified semi-supervised self-training support vector machine (SVM) algorithm. The proposed algorithm iteratively trains a series of support vector machine classifiers on both annotated and non-annotated pupylated proteins. Computational results show that IMP–PUP achieves the area under receiver operating characteristic curves of 0.91, 0.73, and 0.75 on our training set, Tung's testing set, and our testing set, respectively, which are better than those of the different error costs SVM algorithm and the original self-training SVM algorithm. Independent tests also show that IMP–PUP significantly outperforms three other existing pupylation site predictors: GPS–PUP, iPUP, and pbPUP. Therefore, IMP–PUP can be a useful tool for accurate prediction of pupylation sites. A MATLAB software package for IMP–PUP is available at https://juzhe1120.github.io/.  相似文献   

13.
This paper applies and studies the behavior of three learning algorithms, i.e. the Support Vector machine (SVM), the Radial Basis Function Network (the RBF network), and k-Nearest Neighbor (k-NN) for predicting HIV-1 drug resistance from genotype data. In addition, a new algorithm for classifier combination is proposed. The results of comparing the predictive performance of three learning algorithms show that, SVM yields the highest average accuracy, the RBF network gives the highest sensitivity, and k-NN yields the best in specificity. Finally, the comparison of the predictive performance of the composite classifier with three learning algorithms demonstrates that the proposed composite classifier provides the highest average accuracy.  相似文献   

14.
PCP: a program for supervised classification of gene expression profiles   总被引:1,自引:0,他引:1  
PCP (Pattern Classification Program) is an open-source machine learning program for supervised classification of patterns (vectors of measurements). The principal use of PCP in bioinformatics is design and evaluation of classifiers for use in clinical diagnostic tests based on measurements of gene expression. PCP implements leading pattern classification and gene selection algorithms and incorporates cross-validation estimation of classifier performance. Importantly, the implementation integrates gene selection and class prediction stages, which is vital for computing reliable performance estimates in small-sample scenarios. Additionally, the program includes automated and efficient model selection (optimization of parameters) for support vector machine (SVM) classifier. The distribution includes Linux and Windows/Cygwin binaries. The program can easily be ported to other platforms. AVAILABILITY: Free download at http://pcp.sourceforge.net  相似文献   

15.
AH Beiki  S Saboor  M Ebrahimi 《PloS one》2012,7(9):e44164
Various methods have been used to identify cultivares of olive trees; herein we used different bioinformatics algorithms to propose new tools to classify 10 cultivares of olive based on RAPD and ISSR genetic markers datasets generated from PCR reactions. Five RAPD markers (OPA0a21, OPD16a, OP01a1, OPD16a1 and OPA0a8) and five ISSR markers (UBC841a4, UBC868a7, UBC841a14, U12BC807a and UBC810a13) selected as the most important markers by all attribute weighting models. K-Medoids unsupervised clustering run on SVM dataset was fully able to cluster each olive cultivar to the right classes. All trees (176) induced by decision tree models generated meaningful trees and UBC841a4 attribute clearly distinguished between foreign and domestic olive cultivars with 100% accuracy. Predictive machine learning algorithms (SVM and Naïve Bayes) were also able to predict the right class of olive cultivares with 100% accuracy. For the first time, our results showed data mining techniques can be effectively used to distinguish between plant cultivares and proposed machine learning based systems in this study can predict new olive cultivars with the best possible accuracy.  相似文献   

16.
Stiglic G  Kocbek S  Pernek I  Kokol P 《PloS one》2012,7(3):e33812

Purpose

Classification is an important and widely used machine learning technique in bioinformatics. Researchers and other end-users of machine learning software often prefer to work with comprehensible models where knowledge extraction and explanation of reasoning behind the classification model are possible.

Methods

This paper presents an extension to an existing machine learning environment and a study on visual tuning of decision tree classifiers. The motivation for this research comes from the need to build effective and easily interpretable decision tree models by so called one-button data mining approach where no parameter tuning is needed. To avoid bias in classification, no classification performance measure is used during the tuning of the model that is constrained exclusively by the dimensions of the produced decision tree.

Results

The proposed visual tuning of decision trees was evaluated on 40 datasets containing classical machine learning problems and 31 datasets from the field of bioinformatics. Although we did not expected significant differences in classification performance, the results demonstrate a significant increase of accuracy in less complex visually tuned decision trees. In contrast to classical machine learning benchmarking datasets, we observe higher accuracy gains in bioinformatics datasets. Additionally, a user study was carried out to confirm the assumption that the tree tuning times are significantly lower for the proposed method in comparison to manual tuning of the decision tree.

Conclusions

The empirical results demonstrate that by building simple models constrained by predefined visual boundaries, one not only achieves good comprehensibility, but also very good classification performance that does not differ from usually more complex models built using default settings of the classical decision tree algorithm. In addition, our study demonstrates the suitability of visually tuned decision trees for datasets with binary class attributes and a high number of possibly redundant attributes that are very common in bioinformatics.  相似文献   

17.
MOTIVATION: Structural genomics projects are beginning to produce protein structures with unknown function, therefore, accurate, automated predictors of protein function are required if all these structures are to be properly annotated in reasonable time. Identifying the interface between two interacting proteins provides important clues to the function of a protein and can reduce the search space required by docking algorithms to predict the structures of complexes. RESULTS: We have combined a support vector machine (SVM) approach with surface patch analysis to predict protein-protein binding sites. Using a leave-one-out cross-validation procedure, we were able to successfully predict the location of the binding site on 76% of our dataset made up of proteins with both transient and obligate interfaces. With heterogeneous cross-validation, where we trained the SVM on transient complexes to predict on obligate complexes (and vice versa), we still achieved comparable success rates to the leave-one-out cross-validation suggesting that sufficient properties are shared between transient and obligate interfaces. AVAILABILITY: A web application based on the method can be found at http://www.bioinformatics.leeds.ac.uk/ppi_pred. The dataset of 180 proteins used in this study is also available via the same web site. CONTACT: westhead@bmb.leeds.ac.uk SUPPLEMENTARY INFORMATION: http://www.bioinformatics.leeds.ac.uk/ppi-pred/supp-material.  相似文献   

18.
Li ZC  Zhou XB  Lin YR  Zou XY 《Amino acids》2008,35(3):581-590
Structural class characterizes the overall folding type of a protein or its domain. Most of the existing methods for determining the structural class of a protein are based on a group of features that only possesses a kind of discriminative information for the prediction of protein structure class. However, different types of discriminative information associated with primary sequence have been completely missed, which undoubtedly has reduced the success rate of prediction. We present a novel method for the prediction of protein structure class by coupling the improved genetic algorithm (GA) with the support vector machine (SVM). This improved GA was applied to the selection of an optimized feature subset and the optimization of SVM parameters. Jackknife tests on the working datasets indicated that the prediction accuracies for the different classes were in the range of 97.8–100% with an overall accuracy of 99.5%. The results indicate that the approach has a high potential to become a useful tool in bioinformatics.  相似文献   

19.
《Genomics》2020,112(3):2524-2534
The development of embryonic cells involves several continuous stages, and some genes are related to embryogenesis. To date, few studies have systematically investigated changes in gene expression profiles during mammalian embryogenesis. In this study, a computational analysis using machine learning algorithms was performed on the gene expression profiles of mouse embryonic cells at seven stages. First, the profiles were analyzed through a powerful Monte Carlo feature selection method for the generation of a feature list. Second, increment feature selection was applied on the list by incorporating two classification algorithms: support vector machine (SVM) and repeated incremental pruning to produce error reduction (RIPPER). Through SVM, we extracted several latent gene biomarkers, indicating the stages of embryonic cells, and constructed an optimal SVM classifier that produced a nearly perfect classification of embryonic cells. Furthermore, some interesting rules were accessed by the RIPPER algorithm, suggesting different expression patterns for different stages.  相似文献   

20.
Pluripotent stem cells are able to self-renew, and to differentiate into all adult cell types. Many studies report data describing these cells, and characterize them in molecular terms. Machine learning yields classifiers that can accurately identify pluripotent stem cells, but there is a lack of studies yielding minimal sets of best biomarkers (genes/features). We assembled gene expression data of pluripotent stem cells and non-pluripotent cells from the mouse. After normalization and filtering, we applied machine learning, classifying samples into pluripotent and non-pluripotent with high cross-validated accuracy. Furthermore, to identify minimal sets of best biomarkers, we used three methods: information gain, random forests and a wrapper of genetic algorithm and support vector machine (GA/SVM). We demonstrate that the GA/SVM biomarkers work best in combination with each other; pathway and enrichment analyses show that they cover the widest variety of processes implicated in pluripotency. The GA/SVM wrapper yields best biomarkers, no matter which classification method is used. The consensus best biomarker based on the three methods is Tet1, implicated in pluripotency just recently. The best biomarker based on the GA/SVM wrapper approach alone is Fam134b, possibly a missing link between pluripotency and some standard surface markers of unknown function processed by the Golgi apparatus.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号