首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
基于SVM和平均影响值的人肿瘤信息基因提取   总被引:1,自引:0,他引:1       下载免费PDF全文
基于基因表达谱的肿瘤分类信息基因选取是发现肿瘤特异表达基因、探索肿瘤基因表达模式的重要手段。借助由基因表达谱获得的分类信息进行肿瘤诊断是当今生物信息学领域中的一个重要研究方向,有望成为临床医学上一种快速而有效的肿瘤分子诊断方法。鉴于肿瘤基因表达谱样本数据维数高、样本量小以及噪音大等特点,提出一种结合支持向量机应用平均影响值来寻找肿瘤信息基因的算法,其优点是能够搜索到基因数量尽可能少而分类能力尽可能强的多个信息基因子集。采用二分类肿瘤数据集验证算法的可行性和有效性,对于结肠癌样本集,只需3个基因就能获得100%的留一法交叉验证识别准确率。为避免样本集的不同划分对分类性能的影响,进一步采用全折交叉验证方法来评估各信息基因子集的分类性能,优选出更可靠的信息基因子集。与基它肿瘤分类方法相比,实验结果在信息基因数量以及分类性能方面具有明显的优势。  相似文献   

2.
Hong H  Tong W  Perkins R  Fang H  Xie Q  Shi L 《DNA and cell biology》2004,23(10):685-694
The wealth of knowledge imbedded in gene expression data from DNA microarrays portends rapid advances in both research and clinic. Turning the prodigious and noisy data into knowledge is a challenge to the field of bioinformatics, and development of classifiers using supervised learning techniques is the primary methodological approach for clinical application using gene expression data. In this paper, we present a novel classification method, multiclass Decision Forest (DF), that is the direct extension of the two-class DF previously developed in our lab. Central to DF is the synergistic combining of multiple heterogenic but comparable decision trees to reach a more accurate and robust classification model. The computationally inexpensive multiclass DF algorithm integrates gene selection and model development, and thus eliminates the bias of gene preselection in crossvalidation. Importantly, the method provides several statistical means for assessment of prediction accuracy, prediction confidence, and diagnostic capability. We demonstrate the method by application to gene expression data for 83 small round blue-cell tumors (SRBCTs) samples belonging to one of four different classes. Based on 500 runs of 10-fold crossvalidation, tumor prediction accuracy was approximately 97%, sensitivity was approximately 95%, diagnostic sensitivity was approximately 91%, and diagnostic accuracy was approximately 99.5%. Among 25 genes selected to distinguish tumor class, 12 have functional information in the literature implicating their involvement in cancer. The four types of SRBCTs samples are also distinguishable in a clustering analysis based on the expression profiles of these 25 genes. The results demonstrated that the multiclass DF is an effective classification method for analysis of gene expression data for the purpose of molecular diagnostics.  相似文献   

3.
Although it is increasingly evident that cancer is influenced by signals emanating from tumor stroma, little is known regarding how changes in stromal gene expression affect epithelial tumor progression. We used laser capture microdissection to compare gene expression profiles of tumor stroma from 53 primary breast tumors and derived signatures strongly associated with clinical outcome. We present a new stroma-derived prognostic predictor (SDPP) that stratifies disease outcome independently of standard clinical prognostic factors and published expression-based predictors. The SDPP predicts outcome in several published whole tumor-derived expression data sets, identifies poor-outcome individuals from multiple clinical subtypes, including lymph node-negative tumors, and shows increased accuracy with respect to previously published predictors, especially for HER2-positive tumors. Prognostic power increases substantially when the predictor is combined with existing outcome predictors. Genes represented in the SDPP reveal the strong prognostic capacity of differential immune responses as well as angiogenic and hypoxic responses, highlighting the importance of stromal biology in tumor progression.  相似文献   

4.
Tissue classification with gene expression profiles.   总被引:29,自引:0,他引:29  
Constantly improving gene expression profiling technologies are expected to provide understanding and insight into cancer-related cellular processes. Gene expression data is also expected to significantly aid in the development of efficient cancer diagnosis and classification platforms. In this work we examine three sets of gene expression data measured across sets of tumor(s) and normal clinical samples: The first set consists of 2,000 genes, measured in 62 epithelial colon samples (Alon et al., 1999). The second consists of approximately equal to 100,000 clones, measured in 32 ovarian samples (unpublished extension of data set described in Schummer et al. (1999)). The third set consists of approximately equal to 7,100 genes, measured in 72 bone marrow and peripheral blood samples (Golub et al, 1999). We examine the use of scoring methods, measuring separation of tissue type (e.g., tumors from normals) using individual gene expression levels. These are then coupled with high-dimensional classification methods to assess the classification power of complete expression profiles. We present results of performing leave-one-out cross validation (LOOCV) experiments on the three data sets, employing nearest neighbor classifier, SVM (Cortes and Vapnik, 1995), AdaBoost (Freund and Schapire, 1997) and a novel clustering-based classification technique. As tumor samples can differ from normal samples in their cell-type composition, we also perform LOOCV experiments using appropriately modified sets of genes, attempting to eliminate the resulting bias. We demonstrate success rate of at least 90% in tumor versus normal classification, using sets of selected genes, with, as well as without, cellular-contamination-related members. These results are insensitive to the exact selection mechanism, over a certain range.  相似文献   

5.
6.
MOTIVATION: The increasing use of DNA microarray-based tumor gene expression profiles for cancer diagnosis requires mathematical methods with high accuracy for solving clustering, feature selection and classification problems of gene expression data. RESULTS: New algorithms are developed for solving clustering, feature selection and classification problems of gene expression data. The clustering algorithm is based on optimization techniques and allows the calculation of clusters step-by-step. This approach allows us to find as many clusters as a data set contains with respect to some tolerance. Feature selection is crucial for a gene expression database. Our feature selection algorithm is based on calculating overlaps of different genes. The database used, contains over 16 000 genes and this number is considerably reduced by feature selection. We propose a classification algorithm where each tissue sample is considered as the center of a cluster which is a ball. The results of numerical experiments confirm that the classification algorithm in combination with the feature selection algorithm perform slightly better than the published results for multi-class classifiers based on support vector machines for this data set. AVAILABILITY: Available on request from the authors.  相似文献   

7.
8.
Yang H  Cheng C  Zhang W 《PloS one》2011,6(11):e27579

Background

Deregulation of biological pathways has been shown to be involved in the turmorigenesis of a variety of cancers. The co-regulation of pathways in tumor and normal tissues has not been studied in a systematic manner.

Results

In this study we propose a novel statistic named AR-score (average rank based score) to measure pathway activities based on microarray gene expression profiles. We calculate and compare the AR-scores of pathways in microarray datasets containing expression profiles for a wide range of cancer types as well as the corresponding normal tissues. We find that many pathways undergo significant activity changes in tumors with respect to normal tissues. AR-scores for a small subset of pathways are capable of distinguishing tumor from normal tissues or classifying tumor subtypes. In normal tissues many pathways are highly correlated in their activities, whereas their correlations reduce significantly in tumors and cancer cell lines. The co-expression of genes in the same pathways was also significantly perturbed in tumors.

Conclusions

The co-regulation of genes in the same pathways and co-regulation of different pathways are significantly perturbed in tumors versus normal tissues. Our method provides a useful tool for better understanding the mechanistic changes in tumors, which can also be used for exploring other biological problems.  相似文献   

9.
The NCI60 database is the largest available collection of compounds with measured anti-cancer activity. The strengths and limitations for using the NCI60 database as a source of new anti-cancer agents are explored and discussed in relation to previous studies. We selected a sub-set of 2333 compounds with reliable experimental half maximum growth inhibitions (GI(50)) values for 30 cell lines from the NCI60 data set and evaluated their growth inhibitory effect (chemosensitivity) with respect to tissue of origin. This was done by identifying natural clusters in the chemosensitivity data set and in a data set of expression profiles of 1901 genes for the corresponding tumor cell lines. Five clusters were identified based on the gene expression data using self-organizing maps (SOM), comprising leukemia, melanoma, ovarian and prostate, basal breast, and luminal breast cancer cells, respectively. The strong difference in gene expression between basal and luminal breast cancer cells was reflected clearly in the chemosensitivity data. Although most compounds in the data set were of low potency, high efficacy compounds that showed specificity with respect to tissue of origin could be found. Furthermore, eight potential topoisomerase II inhibitors were identified using a structural similarity search. Finally, a set of genes with expression profiles that were significantly correlated with anti-cancer drug activity was identified. Our study demonstrates that the combined data sets, which provide comprehensive information on drug activity and gene expression profiles of tumor cell lines studied, are useful for identifying potential new active compounds.  相似文献   

10.
Cluster-Rasch models for microarray gene expression data   总被引:1,自引:0,他引:1  
Li H  Hong F 《Genome biology》2001,2(8):research0031.1-research003113

Background

We propose two different formulations of the Rasch statistical models to the problem of relating gene expression profiles to the phenotypes. One formulation allows us to investigate whether a cluster of genes with similar expression profiles is related to the observed phenotypes; this model can also be used for future prediction. The other formulation provides an alternative way of identifying genes that are over- or underexpressed from their expression levels in tissue or cell samples of a given tissue or cell type.

Results

We illustrate the methods on available datasets of a classification of acute leukemias and of 60 cancer cell lines. For tumor classification, the results are comparable to those previously obtained. For the cancer cell lines dataset, we found four clusters of genes that are related to drug response for many of the 90 drugs that we considered. In addition, for each type of cell line, we identified genes that are over- or underexpressed relative to other genes.

Conclusions

The cluster-Rasch model provides a probabilistic model for describing gene expression patterns across samples and can be used to relate gene expression profiles to phenotypes.  相似文献   

11.
Tumor classification is a well-studied problem in the field of bioinformatics. Developments in the field of DNA chip design have now made it possible to measure the expression levels of thousands of genes in sample tissue from healthy cell lines or tumors. A number of studies have examined the problems of tumor classification: class discovery, the problem of defining a number of classes of tumors using the data from a DNA chip, and class prediction, the problem of accurately classifying an unknown tumor, given expression data from the unknown tumor and from a learning set. The current work has applied phylogenetic methods to both problems. To solve the class discovery problem, we impose a metric on a set of tumors as a function of their gene expression levels, and impose a tree structure on this metric, using standard tree fitting methods borrowed from the field of phylogenetics. Phylogenetic methods provide a simple way of imposing a clear hierarchical relationship on the data, with branch lengths in the classification tree representing the degree of separation witnessed. We tested our method for class discovery on two data sets: a data set of 87 tissues, comprised mostly of small, round, blue-cell tumors (SRBCTs), and a data set of 22 breast tumors. We fit the 87 samples of the first set to a classification tree, which neatly separated into four major clusters corresponding exactly to the four groups of tumors, namely neuroblastomas, rhabdomyosarcomas, Burkitt's lymphomas, and the Ewing's family of tumors. The classification tree built using the breast cancer data separated tumors with BRCA1 mutations from those with BRCA2 mutations, with sporadic tumors separated from both groups and from each other. We also demonstrate the flexibility of the class discovery method with regard to standard resampling methodology such as jackknifing and noise perturbation. To solve the class prediction problem, we built a classification tree on the learning set, and then sought the optimal placement of each test sample within the classification tree. We tested this method on the SRBCT data set, and classified each tumor successfully.  相似文献   

12.
癌症基因表达谱挖掘中的特征基因选择算法GA/WV   总被引:1,自引:0,他引:1  
鉴定癌症表达谱的特征基因集合可以促进癌症类型分类的研究,这也可能使病人获得更好的临床诊断?虽然一些方法在基因表达谱分析上取得了成功,但是用基因表达谱数据进行癌症分类研究依然是一个巨大的挑战,其主要原因在于缺少通用而可靠的基因重要性评估方法。GA/WV是一种新的用复杂的生物表达数据评估基因分类重要性的方法,通过联合遗传算法(GA)和加权投票分类算法(WV)得到的特征基因集合不但适用于WV分类器,也适用于其它分类器?将GA/WV方法用癌症基因表达谱数据集的验证,结果表明本方法是一种成功可靠的特征基因选择方法。  相似文献   

13.
14.
MOTIVATION: Temporal gene expression profiles provide an important characterization of gene function, as biological systems are predominantly developmental and dynamic. We propose a method of classifying collections of temporal gene expression curves in which individual expression profiles are modeled as independent realizations of a stochastic process. The method uses a recently developed functional logistic regression tool based on functional principal components, aimed at classifying gene expression curves into known gene groups. The number of eigenfunctions in the classifier can be chosen by leave-one-out cross-validation with the aim of minimizing the classification error. RESULTS: We demonstrate that this methodology provides low-error-rate classification for both yeast cell-cycle gene expression profiles and Dictyostelium cell-type specific gene expression patterns. It also works well in simulations. We compare our functional principal components approach with a B-spline implementation of functional discriminant analysis for the yeast cell-cycle data and simulations. This indicates comparative advantages of our approach which uses fewer eigenfunctions/base functions. The proposed methodology is promising for the analysis of temporal gene expression data and beyond. AVAILABILITY: MATLAB programs are available upon request.  相似文献   

15.
16.
17.
王蕊平  王年  苏亮亮  陈乐 《生物信息学》2011,9(2):164-166,170
海量数据的存在是现代信息社会的一大特点,如何在成千上万的基因中有效地选出样本的分类特征对癌症的诊治具有重要意义。采用局部非负矩阵分解方法对癌症基因表达谱数据进行特征提取。首先对基因表达谱数据进行筛选,然后构造局部非负矩阵并对其进行分解得到维数低、能充分表征样本的特征向量,最后用支持向量机对特征向量进行分类。结果表明该方法的可行性和有效性。  相似文献   

18.
19.
We show here an example of the application of a novel method, MUTIC (model utilization-based clustering), used for identifying complex interactions between genes or gene categories based on gene expression data. The method deals with binary categorical data which consist of a set of gene expression profiles divided into two biologically meaningful categories. It does not require data from multiple time points. Gene expression profiles are represented by feature vectors whose component features are either gene expression values, or averaged expression values corresponding to gene ontology or protein information resource categories. A supervised learning algorithm (genetic programming) is used to learn an ensemble of classification models distinguishing the two categories based on the feature vectors corresponding to their members. Each feature is associated with a "model utilization vector", which has an entry for each high-quality classification model found, indicating whether or not the feature was used in that model. These utilization vectors are then clustered using a variant of hierarchical clustering called Omniclust. The result is a set of model utilization-based clusters, in which features are gathered together if they are often considered together by classification models - which may be because they are co-expressed, or may be for subtler reasons involving multi-gene interactions. The MUTIC method is illustrated here by applying it to a dataset regarding gene expression in prostate cancer and control samples. Compared to traditional expression-based clustering, MUTIC yields clusters that have higher mathematical quality (in the sense of homogeneity and separation) and that also yield novel insights into the underlying biological processes.  相似文献   

20.
Both microRNA (miRNA) and mRNA expression profiles are important methods for cancer type classification. A comparative study of their classification performance will be helpful in choosing the means of classification. Here we evaluated the classification performance of miRNA and mRNA profiles using a new data mining approach based on a novel SVM (Support Vector Machines) based recursive fea- ture elimination (nRFE) algorithm. Computational experiments showed that information encoded in miRNAs is not sufficient to classify cancers; gut-derived samples cluster more accurately when using mRNA expression profiles compared with using miRNA profiles; and poorly differentiated tumors (PDT) could be classified by mRNA expression profiles at the accuracy of 100% versus 93.8% when using miRNA profiles. Furthermore, we showed that mRNA expression profiles have higher capacity in normal tissue classifications than miRNA. We concluded that classification performance using mRNA profiles is superior to that of miRNA profiles in multiple-class cancer classifications.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号