首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
Ho SY  Hsieh CH  Chen HM  Huang HL 《Bio Systems》2006,85(3):165-176
An accurate classifier with linguistic interpretability using a small number of relevant genes is beneficial to microarray data analysis and development of inexpensive diagnostic tests. Several frequently used techniques for designing classifiers of microarray data, such as support vector machine, neural networks, k-nearest neighbor, and logistic regression model, suffer from low interpretabilities. This paper proposes an interpretable gene expression classifier (named iGEC) with an accurate and compact fuzzy rule base for microarray data analysis. The design of iGEC has three objectives to be simultaneously optimized: maximal classification accuracy, minimal number of rules, and minimal number of used genes. An "intelligent" genetic algorithm IGA is used to efficiently solve the design problem with a large number of tuning parameters. The performance of iGEC is evaluated using eight commonly-used data sets. It is shown that iGEC has an accurate, concise, and interpretable rule base (1.1 rules per class) on average in terms of test classification accuracy (87.9%), rule number (3.9), and used gene number (5.0). Moreover, iGEC not only has better performance than the existing fuzzy rule-based classifier in terms of the above-mentioned objectives, but also is more accurate than some existing non-rule-based classifiers.  相似文献   

2.
MOTIVATION: The nearest shrunken centroids classifier has become a popular algorithm in tumor classification problems using gene expression microarray data. Feature selection is an embedded part of the method to select top-ranking genes based on a univariate distance statistic calculated for each gene individually. The univariate statistics summarize gene expression profiles outside of the gene co-regulation network context, leading to redundant information being included in the selection procedure. RESULTS: We propose an Eigengene-based Linear Discriminant Analysis (ELDA) to address gene selection in a multivariate framework. The algorithm uses a modified rotated Spectral Decomposition (SpD) technique to select 'hub' genes that associate with the most important eigenvectors. Using three benchmark cancer microarray datasets, we show that ELDA selects the most characteristic genes, leading to substantially smaller classifiers than the univariate feature selection based analogues. The resulting de-correlated expression profiles make the gene-wise independence assumption more realistic and applicable for the shrunken centroids classifier and other diagonal linear discriminant type of models. Our algorithm further incorporates a misclassification cost matrix, allowing differential penalization of one type of error over another. In the breast cancer data, we show false negative prognosis can be controlled via a cost-adjusted discriminant function. AVAILABILITY: R code for the ELDA algorithm is available from author upon request.  相似文献   

3.
Microarray and beadchip are two most efficient techniques for measuring gene expression and methylation data in bioinformatics. Biclustering deals with the simultaneous clustering of genes and samples. In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusion-maximal biclustering techniques from the biological datasets. At first, a novel statistical strategy has been utilized to eliminate the insignificant/low-significant/redundant genes in such way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). The data is then discretized and post-discretized, consecutively. Thereafter, the biclustering technique is applied to identify maximal frequent closed homogeneous itemsets. Corresponding special type of rules are then extracted from the selected itemsets. Our proposed rule mining method performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets. Thus, it saves elapsed time, and can work on big dataset. Pathway and Gene Ontology analyses are conducted on the genes of the evolved rules using David database. Frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers. Furthermore, we also classify the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we also compare the average classification accuracy, and other related factors with other rule-based classifiers. Statistical significance tests are also performed for verifying the statistical relevance of the comparative results. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data-matrix. Finally, we have also included the integrated analysis of gene expression and methylation for determining epigenetic effect (viz., effect of methylation) on gene expression level.  相似文献   

4.
In this paper, we propose a new hybrid method based on Correlation-based feature selection method and Artificial Bee Colony algorithm,namely Co-ABC to select a small number of relevant genes for accurate classification of gene expression profile. The Co-ABC consists of three stages which are fully cooperated: The first stage aims to filter noisy and redundant genes in high dimensionality domains by applying Correlation-based feature Selection (CFS) filter method. In the second stage, Artificial Bee Colony (ABC) algorithm is used to select the informative and meaningful genes. In the third stage, we adopt a Support Vector Machine (SVM) algorithm as classifier using the preselected genes form second stage. The overall performance of our proposed Co-ABC algorithm was evaluated using six gene expression profile for binary and multi-class cancer datasets. In addition, in order to proof the efficiency of our proposed Co-ABC algorithm, we compare it with previously known related methods. Two of these methods was re-implemented for the sake of a fair comparison using the same parameters. These two methods are: Co-GA, which is CFS combined with a genetic algorithm GA. The second one named Co-PSO, which is CFS combined with a particle swarm optimization algorithm PSO. The experimental results shows that the proposed Co-ABC algorithm acquire the accurate classification performance using small number of predictive genes. This proofs that Co-ABC is a efficient approach for biomarker gene discovery using cancer gene expression profile.  相似文献   

5.
Random forest is an ensemble classification algorithm. It performs well when most predictive variables are noisy and can be used when the number of variables is much larger than the number of observations. The use of bootstrap samples and restricted subsets of attributes makes it more powerful than simple ensembles of trees. The main advantage of a random forest classifier is its explanatory power: it measures variable importance or impact of each factor on a predicted class label. These characteristics make the algorithm ideal for microarray data. It was shown to build models with high accuracy when tested on high-dimensional microarray datasets. Current implementations of random forest in the machine learning and statistics community, however, limit its usability for mining over large datasets, as they require that the entire dataset remains permanently in memory. We propose a new framework, an optimized implementation of a random forest classifier, which addresses specific properties of microarray data, takes computational complexity of a decision tree algorithm into consideration, and shows excellent computing performance while preserving predictive accuracy. The implementation is based on reducing overlapping computations and eliminating dependency on the size of main memory. The implementation's excellent computational performance makes the algorithm useful for interactive data analyses and data mining.  相似文献   

6.
7.
癌症基因表达谱挖掘中的特征基因选择算法GA/WV   总被引:1,自引:0,他引:1  
鉴定癌症表达谱的特征基因集合可以促进癌症类型分类的研究,这也可能使病人获得更好的临床诊断?虽然一些方法在基因表达谱分析上取得了成功,但是用基因表达谱数据进行癌症分类研究依然是一个巨大的挑战,其主要原因在于缺少通用而可靠的基因重要性评估方法。GA/WV是一种新的用复杂的生物表达数据评估基因分类重要性的方法,通过联合遗传算法(GA)和加权投票分类算法(WV)得到的特征基因集合不但适用于WV分类器,也适用于其它分类器?将GA/WV方法用癌症基因表达谱数据集的验证,结果表明本方法是一种成功可靠的特征基因选择方法。  相似文献   

8.
For medical classification problems, it is often desirable to have a probability associated with each class. Probabilistic classifiers have received relatively little attention for small n large p classification problems despite of their importance in medical decision making. In this paper, we introduce 2 criteria for assessment of probabilistic classifiers: well-calibratedness and refinement and develop corresponding evaluation measures. We evaluated several published high-dimensional probabilistic classifiers and developed 2 extensions of the Bayesian compound covariate classifier. Based on simulation studies and analysis of gene expression microarray data, we found that proper probabilistic classification is more difficult than deterministic classification. It is important to ensure that a probabilistic classifier is well calibrated or at least not "anticonservative" using the methods developed here. We provide this evaluation for several probabilistic classifiers and also evaluate their refinement as a function of sample size under weak and strong signal conditions. We also present a cross-validation method for evaluating the calibration and refinement of any probabilistic classifier on any data set.  相似文献   

9.
Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analysis such as small sample sizes, a large attribute space and high noise levels still limit its scientific and clinical applications. Increasing the interpretability of prediction models while retaining a high accuracy would help to exploit the information content in microarray data more effectively. For this purpose, we evaluate our rule-based evolutionary machine learning systems, BioHEL and GAssist, on three public microarray cancer datasets, obtaining simple rule-based models for sample classification. A comparison with other benchmark microarray sample classifiers based on three diverse feature selection algorithms suggests that these evolutionary learning techniques can compete with state-of-the-art methods like support vector machines. The obtained models reach accuracies above 90% in two-level external cross-validation, with the added value of facilitating interpretation by using only combinations of simple if-then-else rules. As a further benefit, a literature mining analysis reveals that prioritizations of informative genes extracted from BioHEL's classification rule sets can outperform gene rankings obtained from a conventional ensemble feature selection in terms of the pointwise mutual information between relevant disease terms and the standardized names of top-ranked genes.  相似文献   

10.
Fung ES  Ng MK 《Bioinformation》2007,2(5):230-234
One of the applications of the discriminant analysis on microarray data is to classify patient and normal samples based on gene expression values. The analysis is especially important in medical trials and diagnosis of cancer subtypes. The main contribution of this paper is to propose a simple Fisher-type discriminant method on gene selection in microarray data. In the new algorithm, we calculate a weight for each gene and use the weight values as an indicator to identify the subsets of relevant genes that categorize patient and normal samples. A l(2) - l(1) norm minimization method is implemented to the discriminant process to automatically compute the weights of all genes in the samples. The experiments on two microarray data sets have shown that the new algorithm can generate classification results as good as other classification methods, and effectively determine relevant genes for classification purpose. In this study, we demonstrate the gene selection's ability and the computational effectiveness of the proposed algorithm. Experimental results are given to illustrate the usefulness of the proposed model.  相似文献   

11.
Summary This paper proposes a modified radial basis function classification algorithm for non-linear cancer classification. In the algorithm, a modified simulated annealing method is developed and combined with the linear least square and gradient paradigms to optimize the structure of the radial basis function (RBF) classifier. The proposed algorithm can be adopted to perform non-linear cancer classification based on gene expression profiles and applied to two microarray data sets involving various human tumor classes: (1) Normal versus colon tumor; (2) acute myeloid leukemia (AML) versus acute lymphoblastic leukemia (ALL). Finally, accuracy and stability for the proposed algorithm are further demonstrated by comparing with the other cancer classification algorithms.  相似文献   

12.
One important application of gene expression analysis is to classify tissue samples according to their gene expression levels. Gene expression data are typically characterized by high dimensionality and small sample size, which makes the classification task quite challenging. In this paper, we present a data-dependent kernel for microarray data classification. This kernel function is engineered so that the class separability of the training data is maximized. A bootstrapping-based resampling scheme is introduced to reduce the possible training bias. The effectiveness of this adaptive kernel for microarray data classification is illustrated with a k-Nearest Neighbor (KNN) classifier. Our experimental study shows that the data-dependent kernel leads to a significant improvement in the accuracy of KNN classifiers. Furthermore, this kernel-based KNN scheme has been demonstrated to be competitive to, if not better than, more sophisticated classifiers such as Support Vector Machines (SVMs) and the Uncorrelated Linear Discriminant Analysis (ULDA) for classifying gene expression data.  相似文献   

13.
Microarray data classification using automatic SVM kernel selection   总被引:1,自引:0,他引:1  
Nahar J  Ali S  Chen YP 《DNA and cell biology》2007,26(10):707-712
Microarray data classification is one of the most important emerging clinical applications in the medical community. Machine learning algorithms are most frequently used to complete this task. We selected one of the state-of-the-art kernel-based algorithms, the support vector machine (SVM), to classify microarray data. As a large number of kernels are available, a significant research question is what is the best kernel for patient diagnosis based on microarray data classification using SVM? We first suggest three solutions based on data visualization and quantitative measures. Different types of microarray problems then test the proposed solutions. Finally, we found that the rule-based approach is most useful for automatic kernel selection for SVM to classify microarray data.  相似文献   

14.
Paul TK  Iba H 《Bio Systems》2005,82(3):208-225
Recently, DNA microarray-based gene expression profiles have been used to correlate the clinical behavior of cancers with the differential gene expression levels in cancerous and normal tissues. To this end, after selection of some predictive genes based on signal-to-noise (S2N) ratio, unsupervised learning like clustering and supervised learning like k-nearest neighbor (k NN) classifier are widely used. Instead of S2N ratio, adaptive searches like Probabilistic Model Building Genetic Algorithm (PMBGA) can be applied for selection of a smaller size gene subset that would classify patient samples more accurately. In this paper, we propose a new PMBGA-based method for identification of informative genes from microarray data. By applying our proposed method to classification of three microarray data sets of binary and multi-type tumors, we demonstrate that the gene subsets selected with our technique yield better classification accuracy.  相似文献   

15.
MOTIVATION: Cancer diagnosis is one of the most important emerging clinical applications of gene expression microarray technology. We are seeking to develop a computer system for powerful and reliable cancer diagnostic model creation based on microarray data. To keep a realistic perspective on clinical applications we focus on multicategory diagnosis. To equip the system with the optimum combination of classifier, gene selection and cross-validation methods, we performed a systematic and comprehensive evaluation of several major algorithms for multicategory classification, several gene selection methods, multiple ensemble classifier methods and two cross-validation designs using 11 datasets spanning 74 diagnostic categories and 41 cancer types and 12 normal tissue types. RESULTS: Multicategory support vector machines (MC-SVMs) are the most effective classifiers in performing accurate cancer diagnosis from gene expression data. The MC-SVM techniques by Crammer and Singer, Weston and Watkins and one-versus-rest were found to be the best methods in this domain. MC-SVMs outperform other popular machine learning algorithms, such as k-nearest neighbors, backpropagation and probabilistic neural networks, often to a remarkable degree. Gene selection techniques can significantly improve the classification performance of both MC-SVMs and other non-SVM learning algorithms. Ensemble classifiers do not generally improve performance of the best non-ensemble models. These results guided the construction of a software system GEMS (Gene Expression Model Selector) that automates high-quality model construction and enforces sound optimization and performance estimation procedures. This is the first such system to be informed by a rigorous comparative analysis of the available algorithms and datasets. AVAILABILITY: The software system GEMS is available for download from http://www.gems-system.org for non-commercial use. CONTACT: alexander.statnikov@vanderbilt.edu.  相似文献   

16.
17.
MOTIVATION: Various studies have shown that cancer tissue samples can be successfully detected and classified by their gene expression patterns using machine learning approaches. One of the challenges in applying these techniques for classifying gene expression data is to extract accurate, readily interpretable rules providing biological insight as to how classification is performed. Current methods generate classifiers that are accurate but difficult to interpret. This is the trade-off between credibility and comprehensibility of the classifiers. Here, we introduce a new classifier in order to address these problems. It is referred to as k-TSP (k-Top Scoring Pairs) and is based on the concept of 'relative expression reversals'. This method generates simple and accurate decision rules that only involve a small number of gene-to-gene expression comparisons, thereby facilitating follow-up studies. RESULTS: In this study, we have compared our approach to other machine learning techniques for class prediction in 19 binary and multi-class gene expression datasets involving human cancers. The k-TSP classifier performs as efficiently as Prediction Analysis of Microarray and support vector machine, and outperforms other learning methods (decision trees, k-nearest neighbour and na?ve Bayes). Our approach is easy to interpret as the classifier involves only a small number of informative genes. For these reasons, we consider the k-TSP method to be a useful tool for cancer classification from microarray gene expression data. AVAILABILITY: The software and datasets are available at http://www.ccbm.jhu.edu CONTACT: actan@jhu.edu.  相似文献   

18.
由于基因表达数据高属性维、低样本维的特点,Fisher分类器对该种数据分类性能不是很高。本文提出了Fisher的改进算法Fisher-List。该算法独特之处在于为每个类别确定一个决策阀值,每个阀值既包含总体样本信息,又含有某些对分类至关重要的个体样本信息。本文用实验证明新算法在基因表达数据分类方面比Fisher、LogitBoost、AdaBoost、k-近邻法、决策树和支持向量机具有更高的性能。  相似文献   

19.
MOTIVATION: It is understood that clustering genes are useful for exploring scientific knowledge from DNA microarray gene expression data. The explored knowledge can be finally used for annotating biological function for novel genes. Representing the explored knowledge in an efficient manner is then closely related to the classification accuracy. However, this issue has not yet been paid the attention it deserves. RESULT: A novel method based on template theory in cognitive psychology and pattern recognition is developed in this study for representing knowledge extracted from cluster analysis effectively. The basic principle is to represent knowledge according to the relationship between genes and a found cluster structure. Based on this novel knowledge representation method, a pattern recognition algorithm (the decision tree algorithm C4.5) is then used to construct a classifier for annotating biological functions of novel genes. The experiments on five published datasets show that this method has improved the classification performance compared with the conventional method. The statistical tests indicate that this improvement is significant. AVAILABILITY: The software package can be obtained upon request from the author.  相似文献   

20.
Advances in DNA microarray technologies have made gene expression profiles a significant candidate in identifying different types of cancers. Traditional learning-based cancer identification methods utilize labeled samples to train a classifier, but they are inconvenient for practical application because labels are quite expensive in the clinical cancer research community. This paper proposes a semi-supervised projective non-negative matrix factorization method (Semi-PNMF) to learn an effective classifier from both labeled and unlabeled samples, thus boosting subsequent cancer classification performance. In particular, Semi-PNMF jointly learns a non-negative subspace from concatenated labeled and unlabeled samples and indicates classes by the positions of the maximum entries of their coefficients. Because Semi-PNMF incorporates statistical information from the large volume of unlabeled samples in the learned subspace, it can learn more representative subspaces and boost classification performance. We developed a multiplicative update rule (MUR) to optimize Semi-PNMF and proved its convergence. The experimental results of cancer classification for two multiclass cancer gene expression profile datasets show that Semi-PNMF outperforms the representative methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号