首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper, we introduce a modified version of linear discriminant analysis, called the "shrunken centroids regularized discriminant analysis" (SCRDA). This method generalizes the idea of the "nearest shrunken centroids" (NSC) (Tibshirani and others, 2003) into the classical discriminant analysis. The SCRDA method is specially designed for classification problems in high dimension low sample size situations, for example, microarray data. Through both simulated data and real life data, it is shown that this method performs very well in multivariate classification problems, often outperforms the PAM method (using the NSC algorithm) and can be as competitive as the support vector machines classifiers. It is also suitable for feature elimination purpose and can be used as gene selection method. The open source R package for this method (named "rda") is available on CRAN (http://www.r-project.org) for download and testing.  相似文献   

2.
MOTIVATION: In the context of sample (e.g. tumor) classifications with microarray gene expression data, many methods have been proposed. However, almost all the methods ignore existing biological knowledge and treat all the genes equally a priori. On the other hand, because some genes have been identified by previous studies to have biological functions or to be involved in pathways related to the outcome (e.g. cancer), incorporating this type of prior knowledge into a classifier can potentially improve both the predictive performance and interpretability of the resulting model. RESULTS: We propose a simple and general framework to incorporate such prior knowledge into building a penalized classifier. As two concrete examples, we apply the idea to two penalized classifiers, nearest shrunken centroids (also called PAM) and penalized partial least squares (PPLS). Instead of treating all the genes equally a priori as in standard penalized methods, we group the genes according to their functional associations based on existing biological knowledge or data, and adopt group-specific penalty terms and penalization parameters. Simulated and real data examples demonstrate that, if prior knowledge on gene grouping is indeed informative, our new methods perform better than the two standard penalized methods, yielding higher predictive accuracy and screening out more irrelevant genes.  相似文献   

3.
MOTIVATION: The nearest shrunken centroids classifier has become a popular algorithm in tumor classification problems using gene expression microarray data. Feature selection is an embedded part of the method to select top-ranking genes based on a univariate distance statistic calculated for each gene individually. The univariate statistics summarize gene expression profiles outside of the gene co-regulation network context, leading to redundant information being included in the selection procedure. RESULTS: We propose an Eigengene-based Linear Discriminant Analysis (ELDA) to address gene selection in a multivariate framework. The algorithm uses a modified rotated Spectral Decomposition (SpD) technique to select 'hub' genes that associate with the most important eigenvectors. Using three benchmark cancer microarray datasets, we show that ELDA selects the most characteristic genes, leading to substantially smaller classifiers than the univariate feature selection based analogues. The resulting de-correlated expression profiles make the gene-wise independence assumption more realistic and applicable for the shrunken centroids classifier and other diagonal linear discriminant type of models. Our algorithm further incorporates a misclassification cost matrix, allowing differential penalization of one type of error over another. In the breast cancer data, we show false negative prognosis can be controlled via a cost-adjusted discriminant function. AVAILABILITY: R code for the ELDA algorithm is available from author upon request.  相似文献   

4.
MOTIVATION: Classification of biological samples by microarrays is a topic of much interest. A number of methods have been proposed and successfully applied to this problem. It has recently been shown that classification by nearest centroids provides an accurate predictor that may outperform much more complicated methods. The 'Prediction Analysis of Microarrays' (PAM) approach is one such example, which the authors strongly motivate by its simplicity and interpretability. In this spirit, I seek to assess the performance of classifiers simpler than even PAM. RESULTS: I surprisingly show that the modified t-statistics and shrunken centroids employed by PAM tend to increase misclassification error when compared with their simpler counterparts. Based on these observations, I propose a classification method called 'Classification to Nearest Centroids' (ClaNC). ClaNC ranks genes by standard t-statistics, does not shrink centroids and uses a class-specific gene-selection procedure. Because of these modifications, ClaNC is arguably simpler and easier to interpret than PAM, and it can be viewed as a traditional nearest centroid classifier that uses specially selected genes. I demonstrate that ClaNC error rates tend to be significantly less than those for PAM, for a given number of active genes. AVAILABILITY: Point-and-click software is freely available at http://students.washington.edu/adabney/clanc.  相似文献   

5.
Differential analysis of DNA microarray gene expression data   总被引:6,自引:0,他引:6  
Here, we review briefly the sources of experimental and biological variance that affect the interpretation of high-dimensional DNA microarray experiments. We discuss methods using a regularized t-test based on a Bayesian statistical framework that allow the identification of differentially regulated genes with a higher level of confidence than a simple t-test when only a few experimental replicates are available. We also describe a computational method for calculating the global false-positive and false-negative levels inherent in a DNA microarray data set. This method provides a probability of differential expression for each gene based on experiment-wide false-positive and -negative levels driven by experimental error and biological variance.  相似文献   

6.
Robust PCA and classification in biosciences   总被引:7,自引:0,他引:7  
MOTIVATION: Principal components analysis (PCA) is a very popular dimension reduction technique that is widely used as a first step in the analysis of high-dimensional microarray data. However, the classical approach that is based on the mean and the sample covariance matrix of the data is very sensitive to outliers. Also, classification methods based on this covariance matrix do not give good results in the presence of outlying measurements. RESULTS: First, we propose a robust PCA (ROBPCA) method for high-dimensional data. It combines projection-pursuit ideas with robust estimation of low-dimensional data. We also propose a diagnostic plot to display and classify the outliers. This ROBPCA method is applied to several bio-chemical datasets. In one example, we also apply a robust discriminant method on the scores obtained with ROBPCA. We show that this combination of robust methods leads to better classifications than classical PCA and quadratic discriminant analysis. AVAILABILITY: All the programs are part of the Matlab Toolbox for Robust Calibration, available at http://www.wis.kuleuven.ac.be/stat/robust.html.  相似文献   

7.
Differential gene expression detection and sample classification using microarray data have received much research interest recently. Owing to the large number of genes p and small number of samples n (p > n), microarray data analysis poses big challenges for statistical analysis. An obvious problem owing to the 'large p small n' is over-fitting. Just by chance, we are likely to find some non-differentially expressed genes that can classify the samples very well. The idea of shrinkage is to regularize the model parameters to reduce the effects of noise and produce reliable inferences. Shrinkage has been successfully applied in the microarray data analysis. The SAM statistics proposed by Tusher et al. and the 'nearest shrunken centroid' proposed by Tibshirani et al. are ad hoc shrinkage methods. Both methods are simple, intuitive and prove to be useful in empirical studies. Recently Wu proposed the penalized t/F-statistics with shrinkage by formally using the (1) penalized linear regression models for two-class microarray data, showing good performance. In this paper we systematically discussed the use of penalized regression models for analyzing microarray data. We generalize the two-class penalized t/F-statistics proposed by Wu to multi-class microarray data. We formally derive the ad hoc shrunken centroid used by Tibshirani et al. using the (1) penalized regression models. And we show that the penalized linear regression models provide a rigorous and unified statistical framework for sample classification and differential gene expression detection.  相似文献   

8.
In the analysis of high-throughput biological data, it is often believed that the biological units such as genes behave interactively by groups, that is, pathways in our context. It is conceivable that utilization of priorly available pathway knowledge would greatly facilitate both interpretation and estimation in statistical analysis of such high-dimensional biological data. In this article, we propose a 2-step procedure for the purpose of identifying pathways that are related to and influence the clinical phenotype. In the first step, a nonlinear dimension reduction method is proposed, which permits flexible within-pathway gene interactions as well as nonlinear pathway effects on the response. In the second step, a regularized model-based pathway ranking and selection procedure is developed that is built upon the summary features extracted from the first step. Simulations suggest that the new method performs favorably compared to the existing solutions. An analysis of a glioblastoma microarray data finds 4 pathways that have evidence of support from the biological literature.  相似文献   

9.

Background  

With the rapid development of new genetic measurement methods, several types of genetic alterations can be quantified in a high-throughput manner. While the initial focus has been on investigating each data set separately, there is an increasing interest in studying the correlation structure between two or more data sets. Multivariate methods based on Canonical Correlation Analysis (CCA) have been proposed for integrating paired genetic data sets. The high dimensionality of microarray data imposes computational difficulties, which have been addressed for instance by studying the covariance structure of the data, or by reducing the number of variables prior to applying the CCA. In this work, we propose a new method for analyzing high-dimensional paired genetic data sets, which mainly emphasizes the correlation structure and still permits efficient application to very large data sets. The method is implemented by translating a regularized CCA to its dual form, where the computational complexity depends mainly on the number of samples instead of the number of variables. The optimal regularization parameters are chosen by cross-validation. We apply the regularized dual CCA, as well as a classical CCA preceded by a dimension-reducing Principal Components Analysis (PCA), to a paired data set of gene expression changes and copy number alterations in leukemia.  相似文献   

10.
Covariance matrix estimation is a fundamental statistical task in many applications, but the sample covariance matrix is suboptimal when the sample size is comparable to or less than the number of features. Such high-dimensional settings are common in modern genomics, where covariance matrix estimation is frequently employed as a method for inferring gene networks. To achieve estimation accuracy in these settings, existing methods typically either assume that the population covariance matrix has some particular structure, for example, sparsity, or apply shrinkage to better estimate the population eigenvalues. In this paper, we study a new approach to estimating high-dimensional covariance matrices. We first frame covariance matrix estimation as a compound decision problem. This motivates defining a class of decision rules and using a nonparametric empirical Bayes g-modeling approach to estimate the optimal rule in the class. Simulation results and gene network inference in an RNA-seq experiment in mouse show that our approach is comparable to or can outperform a number of state-of-the-art proposals.  相似文献   

11.

Background  

Graphical Gaussian models are popular tools for the estimation of (undirected) gene association networks from microarray data. A key issue when the number of variables greatly exceeds the number of samples is the estimation of the matrix of partial correlations. Since the (Moore-Penrose) inverse of the sample covariance matrix leads to poor estimates in this scenario, standard methods are inappropriate and adequate regularization techniques are needed. Popular approaches include biased estimates of the covariance matrix and high-dimensional regression schemes, such as the Lasso and Partial Least Squares.  相似文献   

12.
13.
Fung ES  Ng MK 《Bioinformation》2007,2(5):230-234
One of the applications of the discriminant analysis on microarray data is to classify patient and normal samples based on gene expression values. The analysis is especially important in medical trials and diagnosis of cancer subtypes. The main contribution of this paper is to propose a simple Fisher-type discriminant method on gene selection in microarray data. In the new algorithm, we calculate a weight for each gene and use the weight values as an indicator to identify the subsets of relevant genes that categorize patient and normal samples. A l(2) - l(1) norm minimization method is implemented to the discriminant process to automatically compute the weights of all genes in the samples. The experiments on two microarray data sets have shown that the new algorithm can generate classification results as good as other classification methods, and effectively determine relevant genes for classification purpose. In this study, we demonstrate the gene selection's ability and the computational effectiveness of the proposed algorithm. Experimental results are given to illustrate the usefulness of the proposed model.  相似文献   

14.
Linear discriminant analysis (LDA) is a multivariate classification technique frequently applied to morphometric data in various biomedical disciplines. Canonical variate analysis (CVA), the generalization of LDA for multiple groups, is often used in the exploratory style of an ordination technique (a low-dimensional representation of the data). In the rare case when all groups have the same covariance matrix, maximum likelihood classification can be based on these linear functions. Both LDA and CVA require full-rank covariance matrices, which is usually not the case in modern morphometrics. When the number of variables is close to the number of individuals, groups appear separated in a CVA plot even if they are samples from the same population. Hence, reliable classification and assessment of group separation require many more organisms than variables. A simple alternative to CVA is the projection of the data onto the principal components of the group averages (between-group PCA). In contrast to CVA, these axes are orthogonal and can be computed even when the data are not of full rank, such as for Procrustes shape coordinates arising in samples of any size, and when covariance matrices are heterogeneous. In evolutionary quantitative genetics, the selection gradient is identical to the coefficient vector of a linear discriminant function between the populations before vs. after selection. When the measured variables are Procrustes shape coordinates, discriminant functions and selection gradients are vectors in shape space and can be visualized as shape deformations. Except for applications in quantitative genetics and in classification, however, discriminant functions typically offer no interpretation as biological factors.  相似文献   

15.
Many statistical methods have been developed to screen for differentially expressed genes associated with specific phenotypes in the microarray data. However, it remains a major challenge to synthesize the observed expression patterns with abundant biological knowledge for more complete understanding of the biological functions among genes. Various methods including clustering analysis on genes, neural network, Bayesian network and pathway analysis have been developed toward this goal. In most of these procedures, the activation and inhibition relationships among genes have hardly been utilized in the modeling steps. We propose two novel Bayesian models to integrate the microarray data with the putative pathway structures obtained from the KEGG database and the directional gene–gene interactions in the medical literature. We define the symmetric Kullback–Leibler divergence of a pathway, and use it to identify the pathway(s) most supported by the microarray data. Monte Carlo Markov Chain sampling algorithm is given for posterior computation in the hierarchical model. The proposed method is shown to select the most supported pathway in an illustrative example. Finally, we apply the methodology to a real microarray data set to understand the gene expression profile of osteoblast lineage at defined stages of differentiation. We observe that our method correctly identifies the pathways that are reported to play essential roles in modulating bone mass.  相似文献   

16.
As much of the focus of genetics and molecular biology has shifted toward the systems level, it has become increasingly important to accurately extract biologically relevant signal from thousands of related measurements. The common property among these high-dimensional biological studies is that the measured features have a rich and largely unknown underlying structure. One example of much recent interest is identifying differentially expressed genes in comparative microarray experiments. We propose a new approach aimed at optimally performing many hypothesis tests in a high-dimensional study. This approach estimates the optimal discovery procedure (ODP), which has recently been introduced and theoretically shown to optimally perform multiple significance tests. Whereas existing procedures essentially use data from only one feature at a time, the ODP approach uses the relevant information from the entire data set when testing each feature. In particular, we propose a generally applicable estimate of the ODP for identifying differentially expressed genes in microarray experiments. This microarray method consistently shows favorable performance over five highly used existing methods. For example, in testing for differential expression between two breast cancer tumor types, the ODP provides increases from 72% to 185% in the number of genes called significant at a false discovery rate of 3%. Our proposed microarray method is freely available to academic users in the open-source, point-and-click EDGE software package.  相似文献   

17.
18.
MOTIVATION: For establishing prognostic predictors of various diseases using DNA microarray analysis technology, it is desired to find selectively significant genes for constructing the prognostic model and it is also necessary to eliminate non-specific genes or genes with error before constructing the model. RESULTS: We applied projective adaptive resonance theory (PART) to gene screening for DNA microarray data. Genes selected by PART were subjected to our FNN-SWEEP modeling method for the construction of a cancer class prediction model. The model performance was evaluated through comparison with a conventional screening signal-to-noise (S2N) method or nearest shrunken centroids (NSC) method. The FNN-SWEEP predictor with PART screening could discriminate classes of acute leukemia in blinded data with 97.1% accuracy and classes of lung cancer with 90.0% accuracy, while the predictor with S2N was only 85.3 and 70.0% or the predictor with NSC was 88.2 and 90.0%, respectively. The results have proven that PART was superior for gene screening. AVAILABILITY: The software is available upon request from the authors. CONTACT: honda@nubio.nagoya-u.ac.jp  相似文献   

19.
Zhao W  Li H  Hou W  Wu R 《Genetics》2007,176(3):1879-1892
The biological and statistical advantages of functional mapping result from joint modeling of the mean-covariance structures for developmental trajectories of a complex trait measured at a series of time points. While an increased number of time points can better describe the dynamic pattern of trait development, significant difficulties in performing functional mapping arise from prohibitive computational times required as well as from modeling the structure of a high-dimensional covariance matrix. In this article, we develop a statistical model for functional mapping of quantitative trait loci (QTL) that govern the developmental process of a quantitative trait on the basis of wavelet dimension reduction. By breaking an original signal down into a spectrum by taking its averages (smooth coefficients) and differences (detail coefficients), we used the discrete Haar wavelet shrinkage technique to transform an inherently high-dimensional biological problem into its tractable low-dimensional representation within the framework of functional mapping constructed by a Gaussian mixture model. Unlike conventional nonparametric modeling of wavelet shrinkage, we incorporate mathematical aspects of developmental trajectories into the smooth coefficients used for QTL mapping, thus preserving the biological relevance of functional mapping in formulating a number of hypothesis tests at the interplay between gene actions/interactions and developmental patterns for complex phenotypes. This wavelet-based parametric functional mapping has been statistically examined and compared with full-dimensional functional mapping through simulation studies. It holds great promise as a powerful statistical tool to unravel the genetic machinery of developmental trajectories with large-scale high-dimensional data.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号