首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Paul TK  Iba H 《Bio Systems》2005,82(3):208-225
Recently, DNA microarray-based gene expression profiles have been used to correlate the clinical behavior of cancers with the differential gene expression levels in cancerous and normal tissues. To this end, after selection of some predictive genes based on signal-to-noise (S2N) ratio, unsupervised learning like clustering and supervised learning like k-nearest neighbor (k NN) classifier are widely used. Instead of S2N ratio, adaptive searches like Probabilistic Model Building Genetic Algorithm (PMBGA) can be applied for selection of a smaller size gene subset that would classify patient samples more accurately. In this paper, we propose a new PMBGA-based method for identification of informative genes from microarray data. By applying our proposed method to classification of three microarray data sets of binary and multi-type tumors, we demonstrate that the gene subsets selected with our technique yield better classification accuracy.  相似文献   

2.
A random forest method has been selected to perform both gene selection and classification of the microarray data. In this embedded method, the selection of smallest possible sets of genes with lowest error rates is the key factor in achieving highest classification accuracy. Hence, improved gene selection method using random forest has been proposed to obtain the smallest subset of genes as well as biggest subset of genes prior to classification. The option for biggest subset selection is done to assist researchers who intend to use the informative genes for further research. Enhanced random forest gene selection has performed better in terms of selecting the smallest subset as well as biggest subset of informative genes with lowest out of bag error rates through gene selection. Furthermore, the classification performed on the selected subset of genes using random forest has lead to lower prediction error rates compared to existing method and other similar available methods.  相似文献   

3.
Because of high dimensionality, machine learning algorithms typically rely on feature selection techniques in order to perform effective classification in microarray gene expression data sets. However, the large number of features compared to the number of samples makes the task of feature selection computationally hard and prone to errors. This paper interprets feature selection as a task of stochastic optimization, where the goal is to select among an exponential number of alternative gene subsets the one expected to return the highest generalization in classification. Blocking is an experimental design strategy which produces similar experimental conditions to compare alternative stochastic configurations in order to be confident that observed differences in accuracy are due to actual differences rather than to fluctuations and noise effects. We propose an original blocking strategy for improving feature selection which aggregates in a paired way the validation outcomes of several learning algorithms to assess a gene subset and compare it to others. This is a novelty with respect to conventional wrappers, which commonly adopt a sole learning algorithm to evaluate the relevance of a given set of variables. The rationale of the approach is that, by increasing the amount of experimental conditions under which we validate a feature subset, we can lessen the problems related to the scarcity of samples and consequently come up with a better selection. The paper shows that the blocking strategy significantly improves the performance of a conventional forward selection for a set of 16 publicly available cancer expression data sets. The experiments involve six different classifiers and show that improvements take place independent of the classification algorithm used after the selection step. Two further validations based on available biological annotation support the claim that blocking strategies in feature selection may improve the accuracy and the quality of the solution. The first validation is based on retrieving PubMEd abstracts associated to the selected genes and matching them to regular expressions describing the biological phenomenon underlying the expression data sets. The biological validation that follows is based on the use of the Bioconductor package GoStats in order to perform Gene Ontology statistical analysis.  相似文献   

4.
5.
癌症的早期诊断能够显著提高癌症患者的存活率,在肝细胞癌患者中这种情况更加明显。机器学习是癌症分类中的有效工具。如何在复杂和高维的癌症数据集中,选择出低维度、高分类精度的特征子集是癌症分类的难题。本文提出了一种二阶段的特征选择方法SC-BPSO:通过组合Spearman相关系数和卡方独立检验作为过滤器的评价函数,设计了一种新型的过滤器方法——SC过滤器,再组合SC过滤器方法和基于二进制粒子群算法(BPSO)的包裹器方法,从而实现两阶段的特征选择。并应用在高维数据的癌症分类问题中,区分正常样本和肝细胞癌样本。首先,对来自美国国家生物信息中心(NCBI)和欧洲生物信息研究所(EBI)的130个肝组织microRNA序列数据(64肝细胞癌,66正常肝组织)进行预处理,使用MiRME算法从原始序列文件中提取microRNA的表达量、编辑水平和编辑后表达量3类特征。然后,调整SC-BPSO算法在肝细胞癌分类场景中的参数,选择出关键特征子集。最后,建立分类模型,预测结果,并与信息增益过滤器、信息增益率过滤器、BPSO包裹器特征选择算法选出的特征子集,使用相同参数的随机森林、支持向量机、决策树、KNN四种分类器分类,对比分类结果。使用SC-BPSO算法选择出的特征子集,分类准确率高达98.4%。研究结果表明,与另外3个特征选择算法相比,SC-BPSO算法能有效地找到尺寸较小和精度更高的特征子集。这对于少量样本高维数据的癌症分类问题可能具有重要意义。  相似文献   

6.
Accurate molecular classification of cancer using simple rules   总被引:1,自引:0,他引:1  

Background

One intractable problem with using microarray data analysis for cancer classification is how to reduce the extremely high-dimensionality gene feature data to remove the effects of noise. Feature selection is often used to address this problem by selecting informative genes from among thousands or tens of thousands of genes. However, most of the existing methods of microarray-based cancer classification utilize too many genes to achieve accurate classification, which often hampers the interpretability of the models. For a better understanding of the classification results, it is desirable to develop simpler rule-based models with as few marker genes as possible.

Methods

We screened a small number of informative single genes and gene pairs on the basis of their depended degrees proposed in rough sets. Applying the decision rules induced by the selected genes or gene pairs, we constructed cancer classifiers. We tested the efficacy of the classifiers by leave-one-out cross-validation (LOOCV) of training sets and classification of independent test sets.

Results

We applied our methods to five cancerous gene expression datasets: leukemia (acute lymphoblastic leukemia [ALL] vs. acute myeloid leukemia [AML]), lung cancer, prostate cancer, breast cancer, and leukemia (ALL vs. mixed-lineage leukemia [MLL] vs. AML). Accurate classification outcomes were obtained by utilizing just one or two genes. Some genes that correlated closely with the pathogenesis of relevant cancers were identified. In terms of both classification performance and algorithm simplicity, our approach outperformed or at least matched existing methods.

Conclusion

In cancerous gene expression datasets, a small number of genes, even one or two if selected correctly, is capable of achieving an ideal cancer classification effect. This finding also means that very simple rules may perform well for cancerous class prediction.  相似文献   

7.
癌症基因表达谱挖掘中的特征基因选择算法GA/WV   总被引:1,自引:0,他引:1  
鉴定癌症表达谱的特征基因集合可以促进癌症类型分类的研究,这也可能使病人获得更好的临床诊断?虽然一些方法在基因表达谱分析上取得了成功,但是用基因表达谱数据进行癌症分类研究依然是一个巨大的挑战,其主要原因在于缺少通用而可靠的基因重要性评估方法。GA/WV是一种新的用复杂的生物表达数据评估基因分类重要性的方法,通过联合遗传算法(GA)和加权投票分类算法(WV)得到的特征基因集合不但适用于WV分类器,也适用于其它分类器?将GA/WV方法用癌症基因表达谱数据集的验证,结果表明本方法是一种成功可靠的特征基因选择方法。  相似文献   

8.
9.
Feature selection from DNA microarray data is a major challenge due to high dimensionality in expression data. The number of samples in the microarray data set is much smaller compared to the number of genes. Hence the data is improper to be used as the training set of a classifier. Therefore it is important to select features prior to training the classifier. It should be noted that only a small subset of genes from the data set exhibits a strong correlation with the class. This is because finding the relevant genes from the data set is often non-trivial. Thus there is a need to develop robust yet reliable methods for gene finding in expression data. We describe the use of several hybrid feature selection approaches for gene finding in expression data. These approaches include filtering (filter out the best genes from the data set) and wrapper (best subset of genes from the data set) phases. The methods use information gain (IG) and Pearson Product Moment Correlation (PPMC) as the filtering parameters and biogeography based optimization (BBO) as the wrapper approach. K nearest neighbour algorithm (KNN) and back propagation neural network are used for evaluating the fitness of gene subsets during feature selection. Our analysis shows that an impressive performance is provided by the IG-BBO-KNN combination in different data sets with high accuracy (>90%) and low error rate.  相似文献   

10.
Tclass: tumor classification system based on gene expression profile   总被引:9,自引:0,他引:9  
A method that incorporates feature selection into Fisher's linear discriminant analysis for gene expression based tumor classification and a corresponding program Tclass were developed. The proposed method was applied to a public gene expression data set for colon cancer that consists of 22 normal and 40 tumor colon tissue samples to evaluate its performance for classification. Preliminary results demonstrated that using only a subset of genes ranging from 3 to 10 can achieve high classification accuracy.  相似文献   

11.
Considering the two-class classification problem in brain imaging data analysis, we propose a sparse representation-based multi-variate pattern analysis (MVPA) algorithm to localize brain activation patterns corresponding to different stimulus classes/brain states respectively. Feature selection can be modeled as a sparse representation (or sparse regression) problem. Such technique has been successfully applied to voxel selection in fMRI data analysis. However, single selection based on sparse representation or other methods is prone to obtain a subset of the most informative features rather than all. Herein, our proposed algorithm recursively eliminates informative features selected by a sparse regression method until the decoding accuracy based on the remaining features drops to a threshold close to chance level. In this way, the resultant feature set including all the identified features is expected to involve all the informative features for discrimination. According to the signs of the sparse regression weights, these selected features are separated into two sets corresponding to two stimulus classes/brain states. Next, in order to remove irrelevant/noisy features in the two selected feature sets, we perform a nonparametric permutation test at the individual subject level or the group level. In data analysis, we verified our algorithm with a toy data set and an intrinsic signal optical imaging data set. The results show that our algorithm has accurately localized two class-related patterns. As an application example, we used our algorithm on a functional magnetic resonance imaging (fMRI) data set. Two sets of informative voxels, corresponding to two semantic categories (i.e., “old people” and “young people”), respectively, are obtained in the human brain.  相似文献   

12.
Extracting a subset of informative genes from microarray expression data is a critical data preparation step in cancer classification and other biological function analyses. Though many algorithms have been developed, the Support Vector Machine - Recursive Feature Elimination (SVM-RFE) algorithm is one of the best gene feature selection algorithms. It assumes that a smaller "filter-out" factor in the SVM-RFE, which results in a smaller number of gene features eliminated in each recursion, should lead to extraction of a better gene subset. Because the SVM-RFE is highly sensitive to the "filter-out" factor, our simulations have shown that this assumption is not always correct and that the SVM-RFE is an unstable algorithm. To select a set of key gene features for reliable prediction of cancer types or subtypes and other applications, a new two-stage SVM-RFE algorithm has been developed. It is designed to effectively eliminate most of the irrelevant, redundant and noisy genes while keeping information loss small at the first stage. A fine selection for the final gene subset is then performed at the second stage. The two-stage SVM-RFE overcomes the instability problem of the SVM-RFE to achieve better algorithm utility. We have demonstrated that the two-stage SVM-RFE is significantly more accurate and more reliable than the SVM-RFE and three correlation-based methods based on our analysis of three publicly available microarray expression datasets. Furthermore, the two-stage SVM-RFE is computationally efficient because its time complexity is O(d*log(2)d}, where d is the size of the original gene set.  相似文献   

13.
MOTIVATION: The increasing use of DNA microarray-based tumor gene expression profiles for cancer diagnosis requires mathematical methods with high accuracy for solving clustering, feature selection and classification problems of gene expression data. RESULTS: New algorithms are developed for solving clustering, feature selection and classification problems of gene expression data. The clustering algorithm is based on optimization techniques and allows the calculation of clusters step-by-step. This approach allows us to find as many clusters as a data set contains with respect to some tolerance. Feature selection is crucial for a gene expression database. Our feature selection algorithm is based on calculating overlaps of different genes. The database used, contains over 16 000 genes and this number is considerably reduced by feature selection. We propose a classification algorithm where each tissue sample is considered as the center of a cluster which is a ball. The results of numerical experiments confirm that the classification algorithm in combination with the feature selection algorithm perform slightly better than the published results for multi-class classifiers based on support vector machines for this data set. AVAILABILITY: Available on request from the authors.  相似文献   

14.
《Genomics》2020,112(1):114-126
Gene expression data are expected to make a great contribution in the producing of efficient cancer diagnosis and prognosis. Gene expression data are coded by large measured genes, and only of a few number of them carry precious information for different classes of samples. Recently, several researchers proposed gene selection methods based on metaheuristic algorithms for analysing and interpreting gene expression data. However, due to large number of selected genes with limited number of patient's samples and complex interaction between genes, many gene selection methods experienced challenges in order to approach the most relevant and reliable genes. Hence, in this paper, a hybrid filter/wrapper, called rMRMR-MBA is proposed for gene selection problem. In this method, robust Minimum Redundancy Maximum Relevancy (rMRMR) as filter to select the most promising genes and an modified bat algorithm (MBA) as search engine in wrapper approach is proposed to identify a small set of informative genes. The performance of the proposed method has been evaluated using ten gene expression datasets. For performance evaluation, MBA is evaluated by studying the convergence behaviour of MBA with and without TRIZ optimisation operators. For comparative evaluation, the results of the proposed rMRMR-MBA were compared against ten state-of-arts methods using the same datasets. The comparative study demonstrates that the proposed method produced better results in terms of classification accuracy and number of selected genes in two out of ten datasets and competitive results on the remaining datasets. In a nutshell, the proposed method is able to produce very promising results with high classification accuracy which can be considered a promising contribution for gene selection domain.  相似文献   

15.

Background  

With DNA microarray data, selecting a compact subset of discriminative genes from thousands of genes is a critical step for accurate classification of phenotypes for, e.g., disease diagnosis. Several widely used gene selection methods often select top-ranked genes according to their individual discriminative power in classifying samples into distinct categories, without considering correlations among genes. A limitation of these gene selection methods is that they may result in gene sets with some redundancy and yield an unnecessary large number of candidate genes for classification analyses. Some latest studies show that incorporating gene to gene correlations into gene selection can remove redundant genes and improve classification accuracy.  相似文献   

16.
探讨原发性肝癌患者精确放疗后乙型肝炎病毒(hepatitis b virus,HBV)再激活的危险特征和分类预测模型。提出基于遗传算法的特征选择方法,从原发性肝癌数据的初始特征集中选择HBV再激活的最优特征子集。建立贝叶斯和支持向量机的HBV再激活分类预测模型,并预测最优特征子集和初始特征集的分类性能。实验结果表明,基于遗传算法的特征选择提高了HBV再激活分类性能,最优特征子集的分类性能明显优于初始特征子集的分类性能。影响HBV再激活的最优特征子集包括:HBV DNA水平,肿瘤分期TNM,Child-Pugh,外放边界和全肝最大剂量。贝叶斯的分类准确性最高可达82.89%,支持向量机的分类准确性最高可达83.34%。  相似文献   

17.
Fung ES  Ng MK 《Bioinformation》2007,2(5):230-234
One of the applications of the discriminant analysis on microarray data is to classify patient and normal samples based on gene expression values. The analysis is especially important in medical trials and diagnosis of cancer subtypes. The main contribution of this paper is to propose a simple Fisher-type discriminant method on gene selection in microarray data. In the new algorithm, we calculate a weight for each gene and use the weight values as an indicator to identify the subsets of relevant genes that categorize patient and normal samples. A l(2) - l(1) norm minimization method is implemented to the discriminant process to automatically compute the weights of all genes in the samples. The experiments on two microarray data sets have shown that the new algorithm can generate classification results as good as other classification methods, and effectively determine relevant genes for classification purpose. In this study, we demonstrate the gene selection's ability and the computational effectiveness of the proposed algorithm. Experimental results are given to illustrate the usefulness of the proposed model.  相似文献   

18.
基于SVM和平均影响值的人肿瘤信息基因提取   总被引:1,自引:0,他引:1       下载免费PDF全文
基于基因表达谱的肿瘤分类信息基因选取是发现肿瘤特异表达基因、探索肿瘤基因表达模式的重要手段。借助由基因表达谱获得的分类信息进行肿瘤诊断是当今生物信息学领域中的一个重要研究方向,有望成为临床医学上一种快速而有效的肿瘤分子诊断方法。鉴于肿瘤基因表达谱样本数据维数高、样本量小以及噪音大等特点,提出一种结合支持向量机应用平均影响值来寻找肿瘤信息基因的算法,其优点是能够搜索到基因数量尽可能少而分类能力尽可能强的多个信息基因子集。采用二分类肿瘤数据集验证算法的可行性和有效性,对于结肠癌样本集,只需3个基因就能获得100%的留一法交叉验证识别准确率。为避免样本集的不同划分对分类性能的影响,进一步采用全折交叉验证方法来评估各信息基因子集的分类性能,优选出更可靠的信息基因子集。与基它肿瘤分类方法相比,实验结果在信息基因数量以及分类性能方面具有明显的优势。  相似文献   

19.
Microarrays have thousands to tens-of-thousands of gene features, but only a few hundred patient samples are available. The fundamental problem in microarray data analysis is identifying genes whose disruption causes congenital or acquired disease in humans. In this paper, we propose a new evolutionary method that can efficiently select a subset of potentially informative genes for support vector machine (SVM) classifiers. The proposed evolutionary method uses SVM with a given subset of gene features to evaluate the fitness function, and new subsets of features are selected based on the estimates of generalization error of SVMs and frequency of occurrence of the features in the evolutionary approach. Thus, in theory, selected genes reflect to some extent the generalization performance of SVM classifiers. We compare our proposed method with several existing methods and find that the proposed method can obtain better classification accuracy with a smaller number of selected genes than the existing methods.  相似文献   

20.
Discrimination of disease patients based on gene expression data is a crucial problem in clinical area. An important issue to solve this problem is to find a discriminative subset of genes from thousands of genes on a microarray or DNA chip. Aiming at finding informative genes for disease classification on microarray, we present a gene selection method based on the forward variable (gene) selection method (FSM) and show, using typical public microarray datasets, that our method can extract a small set of genes being crucial for discriminating different classes with a very high accuracy almost closed to perfect classification.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号