首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 937 毫秒
1.
MOTIVATION: One important application of gene expression microarray data is classification of samples into categories, such as the type of tumor. The use of microarrays allows simultaneous monitoring of thousands of genes expressions per sample. This ability to measure gene expression en masse has resulted in data with the number of variables p(genes) far exceeding the number of samples N. Standard statistical methodologies in classification and prediction do not work well or even at all when N < p. Modification of existing statistical methodologies or development of new methodologies is needed for the analysis of microarray data. RESULTS: We propose a novel analysis procedure for classifying (predicting) human tumor samples based on microarray gene expressions. This procedure involves dimension reduction using Partial Least Squares (PLS) and classification using Logistic Discrimination (LD) and Quadratic Discriminant Analysis (QDA). We compare PLS to the well known dimension reduction method of Principal Components Analysis (PCA). Under many circumstances PLS proves superior; we illustrate a condition when PCA particularly fails to predict well relative to PLS. The proposed methods were applied to five different microarray data sets involving various human tumor samples: (1) normal versus ovarian tumor; (2) Acute Myeloid Leukemia (AML) versus Acute Lymphoblastic Leukemia (ALL); (3) Diffuse Large B-cell Lymphoma (DLBCLL) versus B-cell Chronic Lymphocytic Leukemia (BCLL); (4) normal versus colon tumor; and (5) Non-Small-Cell-Lung-Carcinoma (NSCLC) versus renal samples. Stability of classification results and methods were further assessed by re-randomization studies.  相似文献   

2.
Sliced inverse regression with regularizations   总被引:2,自引:0,他引:2  
Li L  Yin X 《Biometrics》2008,64(1):124-131
Summary .   In high-dimensional data analysis, sliced inverse regression (SIR) has proven to be an effective dimension reduction tool and has enjoyed wide applications. The usual SIR, however, cannot work with problems where the number of predictors, p , exceeds the sample size, n , and can suffer when there is high collinearity among the predictors. In addition, the reduced dimensional space consists of linear combinations of all the original predictors and no variable selection is achieved. In this article, we propose a regularized SIR approach based on the least-squares formulation of SIR. The L 2 regularization is introduced, and an alternating least-squares algorithm is developed, to enable SIR to work with   n < p   and highly correlated predictors. The L 1 regularization is further introduced to achieve simultaneous reduction estimation and predictor selection. Both simulations and the analysis of a microarray expression data set demonstrate the usefulness of the proposed method.  相似文献   

3.
Gene selection: a Bayesian variable selection approach   总被引:13,自引:0,他引:13  
Selection of significant genes via expression patterns is an important problem in microarray experiments. Owing to small sample size and the large number of variables (genes), the selection process can be unstable. This paper proposes a hierarchical Bayesian model for gene (variable) selection. We employ latent variables to specialize the model to a regression setting and uses a Bayesian mixture prior to perform the variable selection. We control the size of the model by assigning a prior distribution over the dimension (number of significant genes) of the model. The posterior distributions of the parameters are not in explicit form and we need to use a combination of truncated sampling and Markov Chain Monte Carlo (MCMC) based computation techniques to simulate the parameters from the posteriors. The Bayesian model is flexible enough to identify significant genes as well as to perform future predictions. The method is applied to cancer classification via cDNA microarrays where the genes BRCA1 and BRCA2 are associated with a hereditary disposition to breast cancer, and the method is used to identify a set of significant genes. The method is also applied successfully to the leukemia data. SUPPLEMENTARY INFORMATION: http://stat.tamu.edu/people/faculty/bmallick.html.  相似文献   

4.
We propose a general framework for prediction of predefined tumor classes using gene expression profiles from microarray experiments. The framework consists of 1) evaluating the appropriateness of class prediction for the given data set, 2) selecting the prediction method, 3) performing cross-validated class prediction, and 4) assessing the significance of prediction results by permutation testing. We describe an application of the prediction paradigm to gene expression profiles from human breast cancers, with specimens classified as positive or negative for BRCA1 mutations and also for BRCA2 mutations. In both cases, the accuracy of class prediction was statistically significant when compared to the accuracy of prediction expected by chance. The framework proposed here for the application of class prediction is designed to reduce the occurrence of spurious findings, a legitimate concern for high-dimensional microarray data. The prediction paradigm will serve as a good framework for comparing different prediction methods and may accelerate the development of molecular classifiers that are clinically useful.  相似文献   

5.
Microarray technology is becoming a powerful tool for clinical diagnosis, as it has potential to discover gene expression patterns that are characteristic for a particular disease. To date, this possibility has received much attention in the context of cancer research, especially in tumor classification. However, most published articles have concentrated on the development of binary classification methods while neglected ubiquitous multiclass problems. Unfortunately, only a few multiclass classification approaches have had poor predictive accuracy. In an effort to improve classification accuracy, we developed a novel multiclass microarray data classification method. First, we applied a "one versus rest-support vector machine" to classify the samples. Then the classification confidence of each testing sample was evaluated according to its distribution in feature space and some with poor confidence were extracted. Next, a novel strategy, which we named as "class priority estimation method based on centroid distance", was used to make decisions about categories for those poor confidence samples. This approach was tested on seven benchmark multiclass microarray datasets, with encouraging results, demonstrating effectiveness and feasibility.  相似文献   

6.
Tumor classification is a well-studied problem in the field of bioinformatics. Developments in the field of DNA chip design have now made it possible to measure the expression levels of thousands of genes in sample tissue from healthy cell lines or tumors. A number of studies have examined the problems of tumor classification: class discovery, the problem of defining a number of classes of tumors using the data from a DNA chip, and class prediction, the problem of accurately classifying an unknown tumor, given expression data from the unknown tumor and from a learning set. The current work has applied phylogenetic methods to both problems. To solve the class discovery problem, we impose a metric on a set of tumors as a function of their gene expression levels, and impose a tree structure on this metric, using standard tree fitting methods borrowed from the field of phylogenetics. Phylogenetic methods provide a simple way of imposing a clear hierarchical relationship on the data, with branch lengths in the classification tree representing the degree of separation witnessed. We tested our method for class discovery on two data sets: a data set of 87 tissues, comprised mostly of small, round, blue-cell tumors (SRBCTs), and a data set of 22 breast tumors. We fit the 87 samples of the first set to a classification tree, which neatly separated into four major clusters corresponding exactly to the four groups of tumors, namely neuroblastomas, rhabdomyosarcomas, Burkitt's lymphomas, and the Ewing's family of tumors. The classification tree built using the breast cancer data separated tumors with BRCA1 mutations from those with BRCA2 mutations, with sporadic tumors separated from both groups and from each other. We also demonstrate the flexibility of the class discovery method with regard to standard resampling methodology such as jackknifing and noise perturbation. To solve the class prediction problem, we built a classification tree on the learning set, and then sought the optimal placement of each test sample within the classification tree. We tested this method on the SRBCT data set, and classified each tumor successfully.  相似文献   

7.
Classification of gene microarrays by penalized logistic regression   总被引:2,自引:0,他引:2  
Classification of patient samples is an important aspect of cancer diagnosis and treatment. The support vector machine (SVM) has been successfully applied to microarray cancer diagnosis problems. However, one weakness of the SVM is that given a tumor sample, it only predicts a cancer class label but does not provide any estimate of the underlying probability. We propose penalized logistic regression (PLR) as an alternative to the SVM for the microarray cancer diagnosis problem. We show that when using the same set of genes, PLR and the SVM perform similarly in cancer classification, but PLR has the advantage of additionally providing an estimate of the underlying probability. Often a primary goal in microarray cancer diagnosis is to identify the genes responsible for the classification, rather than class prediction. We consider two gene selection methods in this paper, univariate ranking (UR) and recursive feature elimination (RFE). Empirical results indicate that PLR combined with RFE tends to select fewer genes than other methods and also performs well in both cross-validation and test samples. A fast algorithm for solving PLR is also described.  相似文献   

8.
MOTIVATION: Recent research has shown that gene expression profiles can potentially be used for predicting various clinical phenotypes, such as tumor class, drug response and survival time. While there has been extensive studies on tumor classification, there has been less emphasis on other phenotypic features, in particular, patient survival time or time to cancer recurrence, which are subject to right censoring. We consider in this paper an analysis of censored survival time based on microarray gene expression profiles. RESULTS: We propose a dimension reduction strategy, which combines principal components analysis and sliced inverse regression, to identify linear combinations of genes, that both account for the variability in the gene expression levels and preserve the phenotypic information. The extracted gene combinations are then employed as covariates in a predictive survival model formulation. We apply the proposed method to a large diffuse large-B-cell lymphoma dataset, which consists of 240 patients and 7399 genes, and build a Cox proportional hazards model based on the derived gene expression components. The proposed method is shown to provide a good predictive performance for patient survival, as demonstrated by both the significant survival difference between the predicted risk groups and the receiver operator characteristics analysis. AVAILABILITY: R programs are available upon request from the authors. SUPPLEMENTARY INFORMATION: http://dna.ucdavis.edu/~hli/bioinfo-surv-supp.pdf.  相似文献   

9.
MOTIVATION: It is important to consider finding differentially expressed genes in a dataset of microarray experiments for pattern generation. RESULTS: We developed two methods which are mainly based on the q-values approach; the first is a direct extension of the q-values approach, while the second uses two approaches: q-values and maximum-likelihood. We present two algorithms for the second method, one for error minimization and the other for confidence bounding. Also, we show how the method called Patterns from Gene Expression (PaGE) (Grant et al., 2000) can benefit from q-values. Finally, we conducted some experiments to demonstrate the effectiveness of the proposed methods; experimental results on a selected dataset (BRCA1 vs BRCA2 tumor types) are provided. CONTACT: alhajj@cpsc.ucalgary.ca.  相似文献   

10.
11.
Breast cancer (BRCA) represents the most common malignancy among women worldwide with high mortality. Radiotherapy is a prevalent therapeutic for BRCA that with heterogeneous effectiveness among patients. Here, we proposed to develop a gene expression-based signature for BRCA radiotherapy sensitivity estimation. Gene expression profiles of BRCA samples from the Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) were obtained and used as training and independent testing dataset, respectively. Differential expression genes (DEGs) in BRCA samples compared with their paracancerous samples in the training set were identified by using the edgeR Bioconductor package. Univariate Cox regression analysis and LASSO Cox regression method were applied to screen optimal genes for constructing a radiotherapy sensitivity estimation signature. Nomogram combining independent prognostic factors was used to predict 1-, 3-, and 5-year OS of radiation-treated BRCA patients. Relative proportions of tumor infiltrating immune cells (TIICs) calculated by CIBERSORT and mRNA levels of key immune checkpoint receptors was adopted to explore the relation between the signature and tumor immune response. As a result, 603 DEGs were obtained in BRCA tumor samples, six of which were retained and used to construct the radiotherapy sensitivity prediction model. The signature was proved to be robust in both training and testing sets. In addition, the signature was closely related to the immune microenvironment of BRCA in the context of TIICs and immune checkpoint receptors’ mRNA levels. In conclusion, the present study obtained a radiotherapy sensitivity estimation signature for BRCA, which should shed new light in clinical and experimental research.  相似文献   

12.
Improving missing value estimation in microarray data with gene ontology   总被引:3,自引:0,他引:3  
MOTIVATION: Gene expression microarray experiments produce datasets with frequent missing expression values. Accurate estimation of missing values is an important prerequisite for efficient data analysis as many statistical and machine learning techniques either require a complete dataset or their results are significantly dependent on the quality of such estimates. A limitation of the existing estimation methods for microarray data is that they use no external information but the estimation is based solely on the expression data. We hypothesized that utilizing a priori information on functional similarities available from public databases facilitates the missing value estimation. RESULTS: We investigated whether semantic similarity originating from gene ontology (GO) annotations could improve the selection of relevant genes for missing value estimation. The relative contribution of each information source was automatically estimated from the data using an adaptive weight selection procedure. Our experimental results in yeast cDNA microarray datasets indicated that by considering GO information in the k-nearest neighbor algorithm we can enhance its performance considerably, especially when the number of experimental conditions is small and the percentage of missing values is high. The increase of performance was less evident with a more sophisticated estimation method. We conclude that even a small proportion of annotated genes can provide improvements in data quality significant for the eventual interpretation of the microarray experiments. AVAILABILITY: Java and Matlab codes are available on request from the authors. SUPPLEMENTARY MATERIAL: Available online at http://users.utu.fi/jotatu/GOImpute.html.  相似文献   

13.
MOTIVATION: Estimation of misclassification error has received increasing attention in clinical diagnosis and bioinformatics studies, especially in small sample studies with microarray data. Current error estimation methods are not satisfactory because they either have large variability (such as leave-one-out cross-validation) or large bias (such as resubstitution and leave-one-out bootstrap). While small sample size remains one of the key features of costly clinical investigations or of microarray studies that have limited resources in funding, time and tissue materials, accurate and easy-to-implement error estimation methods for small samples are desirable and will be beneficial. RESULTS: A bootstrap cross-validation method is studied. It achieves accurate error estimation through a simple procedure with bootstrap resampling and only costs computer CPU time. Simulation studies and applications to microarray data demonstrate that it performs consistently better than its competitors. This method possesses several attractive properties: (1) it is implemented through a simple procedure; (2) it performs well for small samples with sample size, as small as 16; (3) it is not restricted to any particular classification rules and thus applies to many parametric or non-parametric methods.  相似文献   

14.
Dimension reduction methods have been proposed for regression analysis with predictors of high dimension, but have not received much attention on the problems with censored data. In this article, we present an iterative imputed spline approach based on principal Hessian directions (PHD) for censored survival data in order to reduce the dimension of predictors without requiring a prespecified parametric model. Our proposal is to replace the right-censored survival time with its conditional expectation for adjusting the censoring effect by using the Kaplan-Meier estimator and an adaptive polynomial spline regression in the residual imputation. A sparse estimation strategy is incorporated in our approach to enhance the interpretation of variable selection. This approach can be implemented in not only PHD, but also other methods developed for estimating the central mean subspace. Simulation studies with right-censored data are conducted for the imputed spline approach to PHD (IS-PHD) in comparison with two methods of sliced inverse regression, minimum average variance estimation, and naive PHD in ignorance of censoring. The results demonstrate that the proposed IS-PHD method is particularly useful for survival time responses approximating symmetric or bending structures. Illustrative applications to two real data sets are also presented.  相似文献   

15.
We describe two-dimensional strandness-dependent electrophoresis (2D-SDE) for quantification and length distribution analysis of single-stranded (ss) DNA fragments, double-stranded (ds) DNA fragments, RNA-DNA hybrids, and nicked DNA fragments in complex samples. In the first dimension nucleic acid molecules are separated based on strandness and length in the presence of 7 M urea. After the first-dimension electrophoresis all nucleic acid fragments are heat denatured in the gel. During the second-dimension electrophoresis all nucleic acid fragments are single-stranded and migrate according to length. 2D-SDE takes about 90 min and requires only basic skills and equipment. We show that 2D-SDE has many applications in analyzing complex nucleic acid samples including (1) estimation of renaturation efficiency and kinetics, (2) monitoring cDNA synthesis, (3) detection of nicked DNA fragments, and (4) estimation of quality and in vitro damage of nucleic acid samples. Results from 2D-SDE should be useful to validate techniques such as complex polymerase chain reaction, subtractive hybridization, cDNA synthesis, cDNA normalization, and microarray analysis. 2D-SDE could also be used, e.g., to characterize biological nucleic acid samples. Information obtained with 2D-SDE cannot be readily obtained with other methods. 2D-SDE can be used for preparative isolation of ssDNA fragments, dsDNA fragments, and RNA-DNA hybrids.  相似文献   

16.
For small samples, classifier design algorithms typically suffer from overfitting. Given a set of features, a classifier must be designed and its error estimated. For small samples, an error estimator may be unbiased but, owing to a large variance, often give very optimistic estimates. This paper proposes mitigating the small-sample problem by designing classifiers from a probability distribution resulting from spreading the mass of the sample points to make classification more difficult, while maintaining sample geometry. The algorithm is parameterized by the variance of the spreading distribution. By increasing the spread, the algorithm finds gene sets whose classification accuracy remains strong relative to greater spreading of the sample. The error gives a measure of the strength of the feature set as a function of the spread. The algorithm yields feature sets that can distinguish the two classes, not only for the sample data, but for distributions spread beyond the sample data. For linear classifiers, the topic of the present paper, the classifiers are derived analytically from the model, thereby providing an enormous savings in computation time. The algorithm is applied to cancer classification via cDNA microarrays. In particular, the genes BRCA1 and BRCA2 are associated with a hereditary disposition to breast cancer, and the algorithm is used to find gene sets whose expressions can be used to classify BRCA1 and BRCA2 tumors.  相似文献   

17.
18.
Local fractal dimension based ECG arrhythmia classification   总被引:1,自引:0,他引:1  
We propose a local fractal dimension based nearest neighbor classifier for ECG based classification of arrhythmia. Local fractal dimension (LFD) at each sample point of the ECG waveform is taken as the feature. A nearest neighbor algorithm in the feature space is used to find the class of the test ECG beat. The nearest neighbor is found based on the RR-interval-information-biased Euclidean distance, proposed in the current work. Based on the two algorithms used for estimating the LFD, two classification algorithms are validated in the current work, viz. variance based fractal dimension estimation based nearest neighbor classifier and power spectral density based fractal dimension estimation based nearest neighbor classifier. Their performances are evaluated based on various figures of merit. MIT-BIH (Massachusetts Institute of Technology - Boston’s Beth Israel Hospital) Arrhythmia dataset has been used to validate the algorithms. Along with showing good performance against all the figures of merit, the proposed algorithms also proved to be patient independent in the sense that the performance is good even when the test ECG signal is from a patient whose ECG is not present in the training ECG dataset.  相似文献   

19.
Classification methods used in microarray studies for gene expression are diverse in the way they deal with the underlying complexity of the data, as well as in the technique used to build the classification model. The MAQC II study on cancer classification problems has found that performance was affected by factors such as the classification algorithm, cross validation method, number of genes, and gene selection method. In this paper, we study the hypothesis that the disease under study significantly determines which method is optimal, and that additionally sample size, class imbalance, type of medical question (diagnostic, prognostic or treatment response), and microarray platform are potentially influential. A systematic literature review was used to extract the information from 48 published articles on non-cancer microarray classification studies. The impact of the various factors on the reported classification accuracy was analyzed through random-intercept logistic regression. The type of medical question and method of cross validation dominated the explained variation in accuracy among studies, followed by disease category and microarray platform. In total, 42% of the between study variation was explained by all the study specific and problem specific factors that we studied together.  相似文献   

20.

Background  

The most popular methods for significance analysis on microarray data are well suited to find genes differentially expressed across predefined categories. However, identification of features that correlate with continuous dependent variables is more difficult using these methods, and long lists of significant genes returned are not easily probed for co-regulations and dependencies. Dimension reduction methods are much used in the microarray literature for classification or for obtaining low-dimensional representations of data sets. These methods have an additional interpretation strength that is often not fully exploited when expression data are analysed. In addition, significance analysis may be performed directly on the model parameters to find genes that are important for any number of categorical or continuous responses. We introduce a general scheme for analysis of expression data that combines significance testing with the interpretative advantages of the dimension reduction methods. This approach is applicable both for explorative analysis and for classification and regression problems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号