首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Microarray data contains a large number of genes (usually more than 1000) and a relatively small number of samples (usually fewer than 100). This presents problems to discriminant analysis of microarray data. One way to alleviate the problem is to reduce dimensionality of data by selecting important genes to the discriminant problem. Gene selection can be cast as a feature selection problem in the context of pattern classification. Feature selection approaches are broadly grouped into filter methods and wrapper methods. The wrapper method outperforms the filter method but at the cost of more intensive computation. In the present study, we proposed a wrapper-like gene selection algorithm based on the Regularization Network. Compared with classical wrapper method, the computational costs in our gene selection algorithm is significantly reduced, because the evaluation criterion we proposed does not demand repeated training in the leave-one-out procedure.  相似文献   

2.
Discrimination of disease patients based on gene expression data is a crucial problem in clinical area. An important issue to solve this problem is to find a discriminative subset of genes from thousands of genes on a microarray or DNA chip. Aiming at finding informative genes for disease classification on microarray, we present a gene selection method based on the forward variable (gene) selection method (FSM) and show, using typical public microarray datasets, that our method can extract a small set of genes being crucial for discriminating different classes with a very high accuracy almost closed to perfect classification.  相似文献   

3.
This paper introduces a novel generic approach for classification problems with the objective of achieving maximum classification accuracy with minimum number of features selected. The method is illustrated with several case studies of gene expression data. Our approach integrates filter and wrapper gene selection methods with an added objective of selecting a small set of non-redundant genes that are most relevant for classification with the provision of bins for genes to be swapped in the search for their biological relevance. It is capable of selecting relatively few marker genes while giving comparable or better leave-one-out cross-validation accuracy when compared with gene ranking selection approaches. Additionally, gene profiles can be extracted from the evolving connectionist system, which provides a set of rules that can be further developed into expert systems. The approach uses an integration of Pearson correlation coefficient and signal-to-noise ratio methods with an adaptive evolving classifier applied through the leave-one-out method for validation. Datasets of gene expression from four case studies are used to illustrate the method. The results show the proposed approach leads to an improved feature selection process in terms of reducing the number of variables required and an increased in classification accuracy.  相似文献   

4.
A major challenge in biomedical studies in recent years has been the classification of gene expression profiles into categories, such as cases and controls. This is done by first training a classifier by using a labeled training set containing labeled samples from the two populations, and then using that classifier to predict the labels of new samples. Such predictions have recently been shown to improve the diagnosis and treatment selection practices for several diseases. This procedure is complicated, however, by the high dimensionality if the data. While microarrays can measure the levels of thousands of genes per sample, case-control microarray studies usually involve no more than several dozen samples. Standard classifiers do not work well in these situations where the number of features (gene expression levels measured in these microarrays) far exceeds the number of samples. Selecting only the features that are most relevant for discriminating between the two categories can help construct better classifiers, in terms of both accuracy and efficiency. In this work we developed a novel method for multivariate feature selection based on the Partial Least Squares algorithm. We compared the method''s variants with common feature selection techniques across a large number of real case-control datasets, using several classifiers. We demonstrate the advantages of the method and the preferable combinations of classifier and feature selection technique.  相似文献   

5.
Li Y  Wang N  Perkins EJ  Zhang C  Gong P 《PloS one》2010,5(10):e13715
Monitoring, assessment and prediction of environmental risks that chemicals pose demand rapid and accurate diagnostic assays. A variety of toxicological effects have been associated with explosive compounds TNT and RDX. One important goal of microarray experiments is to discover novel biomarkers for toxicity evaluation. We have developed an earthworm microarray containing 15,208 unique oligo probes and have used it to profile gene expression in 248 earthworms exposed to TNT, RDX or neither. We assembled a new machine learning pipeline consisting of several well-established feature filtering/selection and classification techniques to analyze the 248-array dataset in order to construct classifier models that can separate earthworm samples into three groups: control, TNT-treated, and RDX-treated. First, a total of 869 genes differentially expressed in response to TNT or RDX exposure were identified using a univariate statistical algorithm of class comparison. Then, decision tree-based algorithms were applied to select a subset of 354 classifier genes, which were ranked by their overall weight of significance. A multiclass support vector machine (MC-SVM) method and an unsupervised K-mean clustering method were applied to independently refine the classifier, producing a smaller subset of 39 and 30 classifier genes, separately, with 11 common genes being potential biomarkers. The combined 58 genes were considered the refined subset and used to build MC-SVM and clustering models with classification accuracy of 83.5% and 56.9%, respectively. This study demonstrates that the machine learning approach can be used to identify and optimize a small subset of classifier/biomarker genes from high dimensional datasets and generate classification models of acceptable precision for multiple classes.  相似文献   

6.
Feature extraction is one of the most important and effective method to reduce dimension in data mining, with emerging of high dimensional data such as microarray gene expression data. Feature extraction for gene selection, mainly serves two purposes. One is to identify certain disease-related genes. The other is to find a compact set of discriminative genes to build a pattern classifier with reduced complexity and improved generalization capabilities. Depending on the purpose of gene selection, two types of feature extraction algorithms including ranking-based feature extraction and set-based feature extraction are employed in microarray gene expression data analysis. In ranking-based feature extraction, features are evaluated on an individual basis, without considering inter-relationship between features in general, while set-based feature extraction evaluates features based on their role in a feature set by taking into account dependency between features. Just as learning methods, feature extraction has a problem in its generalization ability, which is robustness. However, the issue of robustness is often overlooked in feature extraction. In order to improve the accuracy and robustness of feature extraction for microarray data, a novel approach based on multi-algorithm fusion is proposed. By fusing different types of feature extraction algorithms to select the feature from the samples set, the proposed approach is able to improve feature extraction performance. The new approach is tested against gene expression dataset including Colon cancer data, CNS data, DLBCL data, and Leukemia data. The testing results show that the performance of this algorithm is better than existing solutions.  相似文献   

7.
Extracting a subset of informative genes from microarray expression data is a critical data preparation step in cancer classification and other biological function analyses. Though many algorithms have been developed, the Support Vector Machine - Recursive Feature Elimination (SVM-RFE) algorithm is one of the best gene feature selection algorithms. It assumes that a smaller "filter-out" factor in the SVM-RFE, which results in a smaller number of gene features eliminated in each recursion, should lead to extraction of a better gene subset. Because the SVM-RFE is highly sensitive to the "filter-out" factor, our simulations have shown that this assumption is not always correct and that the SVM-RFE is an unstable algorithm. To select a set of key gene features for reliable prediction of cancer types or subtypes and other applications, a new two-stage SVM-RFE algorithm has been developed. It is designed to effectively eliminate most of the irrelevant, redundant and noisy genes while keeping information loss small at the first stage. A fine selection for the final gene subset is then performed at the second stage. The two-stage SVM-RFE overcomes the instability problem of the SVM-RFE to achieve better algorithm utility. We have demonstrated that the two-stage SVM-RFE is significantly more accurate and more reliable than the SVM-RFE and three correlation-based methods based on our analysis of three publicly available microarray expression datasets. Furthermore, the two-stage SVM-RFE is computationally efficient because its time complexity is O(d*log(2)d}, where d is the size of the original gene set.  相似文献   

8.
Analysis of recursive gene selection approaches from microarray data   总被引:1,自引:0,他引:1  
MOTIVATION: Finding a small subset of most predictive genes from microarray for disease prediction is a challenging problem. Support vector machines (SVMs) have been found to be successful with a recursive procedure in selecting important genes for cancer prediction. However, it is not well understood how much of the success depends on the choice of the specific classifier and how much on the recursive procedure. We answer this question by examining multiple classifers [SVM, ridge regression (RR) and Rocchio] with feature selection in recursive and non-recursive settings on three DNA microarray datasets (ALL-AML Leukemia data, Breast Cancer data and GCM data). RESULTS: We found recursive RR most effective. On the AML-ALL dataset, it achieved zero error rate on the test set using only three genes (selected from over 7000), which is more encouraging than the best published result (zero error rate using 8 genes by recursive SVM). On the Breast Cancer dataset and the two largest categories of the GCM dataset, the results achieved by recursive RR are also very encouraging. A further analysis of the experimental results shows that different classifiers penalize redundant features to different extent and this property plays an important role in the recursive feature selection process. RR classifier tends to penalize redundant features to a much larger extent than the SVM does. This may be the reason why recursive RR has a better performance in selecting genes.  相似文献   

9.
Filters and wrappers are two prevailing approaches for gene selection in microarray data analysis. Filters make use of statistical properties of each gene to represent its discriminating power between different classes. The computation is fast but the predictions are inaccurate. Wrappers make use of a chosen classifier to select genes by maximizing classification accuracy, but the computation burden is formidable. Filters and wrappers have been combined in previous studies to maximize the classification accuracy for a chosen classifier with respect to a filtered set of genes. The drawback of this single-filter-single-wrapper (SFSW) approach is that the classification accuracy is dependent on the choice of specific filter and wrapper. In this paper, a multiple-filter-multiple-wrapper (MFMW) approach is proposed that makes use of multiple filters and multiple wrappers to improve the accuracy and robustness of the classification, and to identify potential biomarker genes. Experiments based on six benchmark data sets show that the MFMW approach outperforms SFSW models (generated by all combinations of filters and wrappers used in the corresponding MFMW model) in all cases and for all six data sets. Some of MFMW-selected genes have been confirmed to be biomarkers or contribute to the development of particular cancers by other studies.  相似文献   

10.
MOTIVATION: Feature selection approaches, such as filter and wrapper, have been applied to address the gene selection problem in the literature of microarray data analysis. In wrapper methods, the classification error is usually used as the evaluation criterion of feature subsets. Due to the nature of high dimensionality and small sample size of microarray data, however, counting-based error estimation may not necessarily be an ideal criterion for gene selection problem. RESULTS: Our study reveals that evaluating genes in terms of counting-based error estimators such as resubstitution error, leave-one-out error, cross-validation error and bootstrap error may encounter severe ties problem, i.e. two or more gene subsets score equally, and this in turn results in uncertainty in gene selection. Our analysis finds that the ties problem is caused by the discrete nature of counting-based error estimators and could be avoided by using continuous evaluation criteria instead. Experiment results show that continuous evaluation criteria such as generalised the absolute value of w2 measure for support vector machines and modified Relief's measure for k-nearest neighbors produce improved gene selection compared with counting-based error estimators. AVAILABILITY: The companion website is at http://www.ntu.edu.sg/home5/pg02776030/wrappers/ The website contains (1) the source code of all the gene selection algorithms and (2) the complete set of tables and figures of experiments.  相似文献   

11.
Li L  Jiang W  Li X  Moser KL  Guo Z  Du L  Wang Q  Topol EJ  Wang Q  Rao S 《Genomics》2005,85(1):16-23
Development of a robust and efficient approach for extracting useful information from microarray data continues to be a significant and challenging task. Microarray data are characterized by a high dimension, high signal-to-noise ratio, and high correlations between genes, but with a relatively small sample size. Current methods for dimensional reduction can further be improved for the scenario of the presence of a single (or a few) high influential gene(s) in which its effect in the feature subset would prohibit inclusion of other important genes. We have formalized a robust gene selection approach based on a hybrid between genetic algorithm and support vector machine. The major goal of this hybridization was to exploit fully their respective merits (e.g., robustness to the size of solution space and capability of handling a very large dimension of feature genes) for identification of key feature genes (or molecular signatures) for a complex biological phenotype. We have applied the approach to the microarray data of diffuse large B cell lymphoma to demonstrate its behaviors and properties for mining the high-dimension data of genome-wide gene expression profiles. The resulting classifier(s) (the optimal gene subset(s)) has achieved the highest accuracy (99%) for prediction of independent microarray samples in comparisons with marginal filters and a hybrid between genetic algorithm and K nearest neighbors.  相似文献   

12.

Background  

Feature gene extraction is a fundamental issue in microarray-based biomarker discovery. It is normally treated as an optimization problem of finding the best predictive feature genes that can effectively and stably discriminate distinct types of disease conditions, e.g. tumors and normals. Since gene microarray data normally involves thousands of genes at, tens or hundreds of samples, the gene extraction process may fall into local optimums if the gene set is optimized according to the maximization of classification accuracy of the classifier built from it.  相似文献   

13.
The most widely used statistical methods for finding differentially expressed genes (DEGs) are essentially univariate. In this study, we present a new T(2) statistic for analyzing microarray data. We implemented our method using a multiple forward search (MFS) algorithm that is designed for selecting a subset of feature vectors in high-dimensional microarray datasets. The proposed T2 statistic is a corollary to that originally developed for multivariate analyses and possesses two prominent statistical properties. First, our method takes into account multidimensional structure of microarray data. The utilization of the information hidden in gene interactions allows for finding genes whose differential expressions are not marginally detectable in univariate testing methods. Second, the statistic has a close relationship to discriminant analyses for classification of gene expression patterns. Our search algorithm sequentially maximizes gene expression difference/distance between two groups of genes. Including such a set of DEGs into initial feature variables may increase the power of classification rules. We validated our method by using a spike-in HGU95 dataset from Affymetrix. The utility of the new method was demonstrated by application to the analyses of gene expression patterns in human liver cancers and breast cancers. Extensive bioinformatics analyses and cross-validation of DEGs identified in the application datasets showed the significant advantages of our new algorithm.  相似文献   

14.
Paul TK  Iba H 《Bio Systems》2005,82(3):208-225
Recently, DNA microarray-based gene expression profiles have been used to correlate the clinical behavior of cancers with the differential gene expression levels in cancerous and normal tissues. To this end, after selection of some predictive genes based on signal-to-noise (S2N) ratio, unsupervised learning like clustering and supervised learning like k-nearest neighbor (k NN) classifier are widely used. Instead of S2N ratio, adaptive searches like Probabilistic Model Building Genetic Algorithm (PMBGA) can be applied for selection of a smaller size gene subset that would classify patient samples more accurately. In this paper, we propose a new PMBGA-based method for identification of informative genes from microarray data. By applying our proposed method to classification of three microarray data sets of binary and multi-type tumors, we demonstrate that the gene subsets selected with our technique yield better classification accuracy.  相似文献   

15.
癌症的早期诊断能够显著提高癌症患者的存活率,在肝细胞癌患者中这种情况更加明显。机器学习是癌症分类中的有效工具。如何在复杂和高维的癌症数据集中,选择出低维度、高分类精度的特征子集是癌症分类的难题。本文提出了一种二阶段的特征选择方法SC-BPSO:通过组合Spearman相关系数和卡方独立检验作为过滤器的评价函数,设计了一种新型的过滤器方法——SC过滤器,再组合SC过滤器方法和基于二进制粒子群算法(BPSO)的包裹器方法,从而实现两阶段的特征选择。并应用在高维数据的癌症分类问题中,区分正常样本和肝细胞癌样本。首先,对来自美国国家生物信息中心(NCBI)和欧洲生物信息研究所(EBI)的130个肝组织microRNA序列数据(64肝细胞癌,66正常肝组织)进行预处理,使用MiRME算法从原始序列文件中提取microRNA的表达量、编辑水平和编辑后表达量3类特征。然后,调整SC-BPSO算法在肝细胞癌分类场景中的参数,选择出关键特征子集。最后,建立分类模型,预测结果,并与信息增益过滤器、信息增益率过滤器、BPSO包裹器特征选择算法选出的特征子集,使用相同参数的随机森林、支持向量机、决策树、KNN四种分类器分类,对比分类结果。使用SC-BPSO算法选择出的特征子集,分类准确率高达98.4%。研究结果表明,与另外3个特征选择算法相比,SC-BPSO算法能有效地找到尺寸较小和精度更高的特征子集。这对于少量样本高维数据的癌症分类问题可能具有重要意义。  相似文献   

16.
MOTIVATION: The nearest shrunken centroids classifier has become a popular algorithm in tumor classification problems using gene expression microarray data. Feature selection is an embedded part of the method to select top-ranking genes based on a univariate distance statistic calculated for each gene individually. The univariate statistics summarize gene expression profiles outside of the gene co-regulation network context, leading to redundant information being included in the selection procedure. RESULTS: We propose an Eigengene-based Linear Discriminant Analysis (ELDA) to address gene selection in a multivariate framework. The algorithm uses a modified rotated Spectral Decomposition (SpD) technique to select 'hub' genes that associate with the most important eigenvectors. Using three benchmark cancer microarray datasets, we show that ELDA selects the most characteristic genes, leading to substantially smaller classifiers than the univariate feature selection based analogues. The resulting de-correlated expression profiles make the gene-wise independence assumption more realistic and applicable for the shrunken centroids classifier and other diagonal linear discriminant type of models. Our algorithm further incorporates a misclassification cost matrix, allowing differential penalization of one type of error over another. In the breast cancer data, we show false negative prognosis can be controlled via a cost-adjusted discriminant function. AVAILABILITY: R code for the ELDA algorithm is available from author upon request.  相似文献   

17.
One frontier of modern statistical research is the problems arising from data sets with extremely large k (>1000) populations, e.g. microarray and neuroimaging data. For many such problems the focus shifts from testing for significance to selecting, filtering, or screening. Classical Ranking and Selection Methodology (RSM) studied the probability of correct selection (PCS). PCS is the probability that the "best" (t = 1) of k populations is truly selected, according to some specified criteria of best. This paper extends and adapts two selection goals from the RSM literature that are suitable for large k problems (d-best and G-best selection). It is then shown how estimation of PCS for selecting multiple (t > 1) populations with d-best and G-best selection can be implemented to provide a useful measure of the quality of a given selection. A simulation study and the application of the proposed method to a benchmark microarray data set show it is an effective and versatile tool for assessing the probability that a particular gene selection or gene filtering step truly obtains the best genes. Moreover, the proposed method is fully general and may be applied to any such extremely large k problem.  相似文献   

18.
Fu LM  Fu-Liu CS 《FEBS letters》2004,561(1-3):186-190
Differential diagnosis among a group of histologically similar cancers poses a challenging problem in clinical medicine. Constructing a classifier based on gene expression signatures comprising multiple discriminatory molecular markers derived from microarray data analysis is an emerging trend for cancer diagnosis. To identify the best genes for classification using a small number of samples relative to the genome size remains the bottleneck of this approach, despite its promise. We have devised a new method of gene selection with reliability analysis, and demonstrated that this method can identify a more compact set of genes than other methods for constructing a classifier with optimum predictive performance for both small round blue cell tumors and leukemia. High consensus between our result and the results produced by methods based on artificial neural networks and statistical techniques confers additional evidence of the validity of our method. This study suggests a way for implementing a reliable molecular cancer classifier based on gene expression signatures.  相似文献   

19.
High-throughput biological technologies offer the promise of finding feature sets to serve as biomarkers for medical applications; however, the sheer number of potential features (genes, proteins, etc.) means that there needs to be massive feature selection, far greater than that envisioned in the classical literature. This paper considers performance analysis for feature-selection algorithms from two fundamental perspectives: How does the classification accuracy achieved with a selected feature set compare to the accuracy when the best feature set is used and what is the optimal number of features that should be used? The criteria manifest themselves in several issues that need to be considered when examining the efficacy of a feature-selection algorithm: (1) the correlation between the classifier errors for the selected feature set and the theoretically best feature set; (2) the regressions of the aforementioned errors upon one another; (3) the peaking phenomenon, that is, the effect of sample size on feature selection; and (4) the analysis of feature selection in the framework of high-dimensional models corresponding to high-throughput data.  相似文献   

20.
遗传优化算法在基因数据分类中的应用   总被引:1,自引:0,他引:1  
本文提出了一种基于遗传算法的基因微阵列数据特征提取方法。首先对原始数据进行标准化,然后利用方差分析方法对数据进行降低维数处理,最后利用遗传算法对数据进行优化。针对基因数据对遗传算子和适应度函数进行设置,优化数据集选取特征基因,得到较小的特征子集。为了验证选取的特征,利用样本划分法通过判别分析建立分类器进行判定。实验论证此方法具有理想的分类效果,算法稳定、效率高。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号