首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 406 毫秒
1.
Recent developments in microarray technology make it possible to capture the gene expression profiles for thousands of genes at once. With this data researchers are tackling problems ranging from the identification of 'cancer genes' to the formidable task of adding functional annotations to our rapidly growing gene databases. Specific research questions suggest patterns of gene expression that are interesting and informative: for instance, genes with large variance or groups of genes that are highly correlated. Cluster analysis and related techniques are proving to be very useful. However, such exploratory methods alone do not provide the opportunity to engage in statistical inference. Given the high dimensionality (thousands of genes) and small sample sizes (often <30) encountered in these datasets, an honest assessment of sampling variability is crucial and can prevent the over-interpretation of spurious results. We describe a statistical framework that encompasses many of the analytical goals in gene expression analysis; our framework is completely compatible with many of the current approaches and, in fact, can increase their utility. We propose the use of a deterministic rule, applied to the parameters of the gene expression distribution, to select a target subset of genes that are of biological interest. In addition to subset membership, the target subset can include information about relationships between genes, such as clustering. This target subset presents an interesting parameter that we can estimate by applying the rule to the sample statistics of microarray data. The parametric bootstrap, based on a multivariate normal model, is used to estimate the distribution of these estimated subsets and relevant summary measures of this sampling distribution are proposed. We focus on rules that operate on the mean and covariance. Using Bernstein's Inequality, we obtain consistency of the subset estimates, under the assumption that the sample size converges faster to infinity than the logarithm of the number of genes. We also provide a conservative sample size formula guaranteeing that the sample mean and sample covariance matrix are uniformly within a distance epsilon > 0 of the population mean and covariance. The practical performance of the method using a cluster-based subset rule is illustrated with a simulation study. The method is illustrated with an analysis of a publicly available leukemia data set.  相似文献   

2.
Summary High‐dimensional data such as microarrays have brought us new statistical challenges. For example, using a large number of genes to classify samples based on a small number of microarrays remains a difficult problem. Diagonal discriminant analysis, support vector machines, and k‐nearest neighbor have been suggested as among the best methods for small sample size situations, but none was found to be superior to others. In this article, we propose an improved diagonal discriminant approach through shrinkage and regularization of the variances. The performance of our new approach along with the existing methods is studied through simulations and applications to real data. These studies show that the proposed shrinkage‐based and regularization diagonal discriminant methods have lower misclassification rates than existing methods in many cases.  相似文献   

3.
Covariance matrix estimation is a fundamental statistical task in many applications, but the sample covariance matrix is suboptimal when the sample size is comparable to or less than the number of features. Such high-dimensional settings are common in modern genomics, where covariance matrix estimation is frequently employed as a method for inferring gene networks. To achieve estimation accuracy in these settings, existing methods typically either assume that the population covariance matrix has some particular structure, for example, sparsity, or apply shrinkage to better estimate the population eigenvalues. In this paper, we study a new approach to estimating high-dimensional covariance matrices. We first frame covariance matrix estimation as a compound decision problem. This motivates defining a class of decision rules and using a nonparametric empirical Bayes g-modeling approach to estimate the optimal rule in the class. Simulation results and gene network inference in an RNA-seq experiment in mouse show that our approach is comparable to or can outperform a number of state-of-the-art proposals.  相似文献   

4.
Differential gene expression detection and sample classification using microarray data have received much research interest recently. Owing to the large number of genes p and small number of samples n (p > n), microarray data analysis poses big challenges for statistical analysis. An obvious problem owing to the 'large p small n' is over-fitting. Just by chance, we are likely to find some non-differentially expressed genes that can classify the samples very well. The idea of shrinkage is to regularize the model parameters to reduce the effects of noise and produce reliable inferences. Shrinkage has been successfully applied in the microarray data analysis. The SAM statistics proposed by Tusher et al. and the 'nearest shrunken centroid' proposed by Tibshirani et al. are ad hoc shrinkage methods. Both methods are simple, intuitive and prove to be useful in empirical studies. Recently Wu proposed the penalized t/F-statistics with shrinkage by formally using the (1) penalized linear regression models for two-class microarray data, showing good performance. In this paper we systematically discussed the use of penalized regression models for analyzing microarray data. We generalize the two-class penalized t/F-statistics proposed by Wu to multi-class microarray data. We formally derive the ad hoc shrunken centroid used by Tibshirani et al. using the (1) penalized regression models. And we show that the penalized linear regression models provide a rigorous and unified statistical framework for sample classification and differential gene expression detection.  相似文献   

5.
We present prediction and variable importance (VIM) methods for longitudinal data sets containing continuous and binary exposures subject to missingness. We demonstrate the use of these methods for prognosis of medical outcomes of severe trauma patients, a field in which current medical practice involves rules of thumb and scoring methods that only use a few variables and ignore the dynamic and high-dimensional nature of trauma recovery. Well-principled prediction and VIM methods can provide a tool to make care decisions informed by the high-dimensional patient’s physiological and clinical history. Our VIM parameters are analogous to slope coefficients in adjusted regressions, but are not dependent on a specific statistical model, nor require a certain functional form of the prediction regression to be estimated. In addition, they can be causally interpreted under causal and statistical assumptions as the expected outcome under time-specific clinical interventions, related to changes in the mean of the outcome if each individual experiences a specified change in the variable (keeping other variables in the model fixed). Better yet, the targeted MLE used is doubly robust and locally efficient. Because the proposed VIM does not constrain the prediction model fit, we use a very flexible ensemble learner (the SuperLearner), which returns a linear combination of a list of user-given algorithms. Not only is such a prediction algorithm intuitive appealing, it has theoretical justification as being asymptotically equivalent to the oracle selector. The results of the analysis show effects whose size and significance would have been not been found using a parametric approach (such as stepwise regression or LASSO). In addition, the procedure is even more compelling as the predictor on which it is based showed significant improvements in cross-validated fit, for instance area under the curve (AUC) for a receiver-operator curve (ROC). Thus, given that 1) our VIM applies to any model fitting procedure, 2) under assumptions has meaningful clinical (causal) interpretations and 3) has asymptotic (influence-curve) based robust inference, it provides a compelling alternative to existing methods for estimating variable importance in high-dimensional clinical (or other) data.  相似文献   

6.
Survival prediction from a large number of covariates is a current focus of statistical and medical research. In this paper, we study a methodology known as the compound covariate prediction performed under univariate Cox proportional hazard models. We demonstrate via simulations and real data analysis that the compound covariate method generally competes well with ridge regression and Lasso methods, both already well-studied methods for predicting survival outcomes with a large number of covariates. Furthermore, we develop a refinement of the compound covariate method by incorporating likelihood information from multivariate Cox models. The new proposal is an adaptive method that borrows information contained in both the univariate and multivariate Cox regression estimators. We show that the new proposal has a theoretical justification from a statistical large sample theory and is naturally interpreted as a shrinkage-type estimator, a popular class of estimators in statistical literature. Two datasets, the primary biliary cirrhosis of the liver data and the non-small-cell lung cancer data, are used for illustration. The proposed method is implemented in R package “compound.Cox” available in CRAN at http://cran.r-project.org/.  相似文献   

7.
Predicting survival from microarray data--a comparative study   总被引:1,自引:0,他引:1  
MOTIVATION: Survival prediction from gene expression data and other high-dimensional genomic data has been subject to much research during the last years. These kinds of data are associated with the methodological problem of having many more gene expression values than individuals. In addition, the responses are censored survival times. Most of the proposed methods handle this by using Cox's proportional hazards model and obtain parameter estimates by some dimension reduction or parameter shrinkage estimation technique. Using three well-known microarray gene expression data sets, we compare the prediction performance of seven such methods: univariate selection, forward stepwise selection, principal components regression (PCR), supervised principal components regression, partial least squares regression (PLS), ridge regression and the lasso. RESULTS: Statistical learning from subsets should be repeated several times in order to get a fair comparison between methods. Methods using coefficient shrinkage or linear combinations of the gene expression values have much better performance than the simple variable selection methods. For our data sets, ridge regression has the overall best performance. AVAILABILITY: Matlab and R code for the prediction methods are available at http://www.med.uio.no/imb/stat/bmms/software/microsurv/.  相似文献   

8.
We present the application of a nonparametric method to performing functional principal component analysis for functional curve data that consist of measurements of a random trajectory for a sample of subjects. This design typically consists of an irregular grid of time points on which repeated measurements are taken for a number of subjects. We introduce shrinkage estimates for the functional principal component scores that serve as the random effects in the model. Scatterplot smoothing methods are used to estimate the mean function and covariance surface of this model. We propose improved estimation in the neighborhood of and at the diagonal of the covariance surface, where the measurement errors are reflected. The presence of additive measurement errors motivates shrinkage estimates for the functional principal component scores. Shrinkage estimates are developed through best linear prediction and in a generalized version, aiming at minimizing one-curve-leave-out prediction error. The estimation of individual trajectories combines data obtained from that individual as well as all other individuals. We apply our methods to new data regarding the analysis of the level of 14C-folate in plasma as a function of time since dosing of healthy adults with a small tracer dose of 14C-folic acid. A time transformation was incorporated to handle design irregularity concerning the time points on which the measurements were taken. The proposed methodology, incorporating shrinkage and data-adaptive features, is seen to be well suited for describing population kinetics of 14C-folate-specific activity and random effects, and can also be applied to other functional data analysis problems.  相似文献   

9.
Microarray technology is rapidly emerging for genome-wide screening of differentially expressed genes between clinical subtypes or different conditions of human diseases. Traditional statistical testing approaches, such as the two-sample t-test or Wilcoxon test, are frequently used for evaluating statistical significance of informative expressions but require adjustment for large-scale multiplicity. Due to its simplicity, Bonferroni adjustment has been widely used to circumvent this problem. It is well known, however, that the standard Bonferroni test is often very conservative. In the present paper, we compare three multiple testing procedures in the microarray context: the original Bonferroni method, a Bonferroni-type improved single-step method and a step-down method. The latter two methods are based on nonparametric resampling, by which the null distribution can be derived with the dependency structure among gene expressions preserved and the family-wise error rate accurately controlled at the desired level. We also present a sample size calculation method for designing microarray studies. Through simulations and data analyses, we find that the proposed methods for testing and sample size calculation are computationally fast and control error and power precisely.  相似文献   

10.
The primary objective of this paper is to provide a guide on implementing Bayesian generalized kernel regression methods for genomic prediction in the statistical software R. Such methods are quite efficient for capturing complex non-linear patterns that conventional linear regression models cannot. Furthermore, these methods are also powerful for leveraging environmental covariates, such as genotype × environment (G×E) prediction, among others. In this study we provide the building process of seven kernel methods: linear, polynomial, sigmoid, Gaussian, Exponential, Arc-cosine 1 and Arc-cosine L. Additionally, we highlight illustrative examples for implementing exact kernel methods for genomic prediction under a single-environment, a multi-environment and multi-trait framework, as well as for the implementation of sparse kernel methods under a multi-environment framework. These examples are followed by a discussion on the strengths and limitations of kernel methods and, subsequently by conclusions about the main contributions of this paper.Subject terms: Genomics, Plant sciences  相似文献   

11.
It was shown recently using experimental data that it is possible under certain conditions to determine whether a person with known genotypes at a number of markers was part of a sample from which only allele frequencies are known. Using population genetic and statistical theory, we show that the power of such identification is, approximately, proportional to the number of independent SNPs divided by the size of the sample from which the allele frequencies are available. We quantify the limits of identification and propose likelihood and regression analysis methods for the analysis of data. We show that these methods have similar statistical properties and have more desirable properties, in terms of type-I error rate and statistical power, than test statistics suggested in the literature.  相似文献   

12.

Background

In quantitative trait mapping and genomic prediction, Bayesian variable selection methods have gained popularity in conjunction with the increase in marker data and computational resources. Whereas shrinkage-inducing methods are common tools in genomic prediction, rigorous decision making in mapping studies using such models is not well established and the robustness of posterior results is subject to misspecified assumptions because of weak biological prior evidence.

Methods

Here, we evaluate the impact of prior specifications in a shrinkage-based Bayesian variable selection method which is based on a mixture of uniform priors applied to genetic marker effects that we presented in a previous study. Unlike most other shrinkage approaches, the use of a mixture of uniform priors provides a coherent framework for inference based on Bayes factors. To evaluate the robustness of genetic association under varying prior specifications, Bayes factors are compared as signals of positive marker association, whereas genomic estimated breeding values are considered for genomic selection. The impact of specific prior specifications is reduced by calculation of combined estimates from multiple specifications. A Gibbs sampler is used to perform Markov chain Monte Carlo estimation (MCMC) and a generalized expectation-maximization algorithm as a faster alternative for maximum a posteriori point estimation. The performance of the method is evaluated by using two publicly available data examples: the simulated QTLMAS XII data set and a real data set from a population of pigs.

Results

Combined estimates of Bayes factors were very successful in identifying quantitative trait loci, and the ranking of Bayes factors was fairly stable among markers with positive signals of association under varying prior assumptions, but their magnitudes varied considerably. Genomic estimated breeding values using the mixture of uniform priors compared well to other approaches for both data sets and loss of accuracy with the generalized expectation-maximization algorithm was small as compared to that with MCMC.

Conclusions

Since no error-free method to specify priors is available for complex biological phenomena, exploring a wide variety of prior specifications and combining results provides some solution to this problem. For this purpose, the mixture of uniform priors approach is especially suitable, because it comprises a wide and flexible family of distributions and computationally intensive estimation can be carried out in a reasonable amount of time.  相似文献   

13.
MOTIVATION: Logistic regression is a standard method for building prediction models for a binary outcome and has been extended for disease classification with microarray data by many authors. A feature (gene) selection step, however, must be added to penalized logistic modeling due to a large number of genes and a small number of subjects. Model selection for this two-step approach requires new statistical tools because prediction error estimation ignoring the feature selection step can be severely downward biased. Generic methods such as cross-validation and non-parametric bootstrap can be very ineffective due to the big variability in the prediction error estimate. RESULTS: We propose a parametric bootstrap model for more accurate estimation of the prediction error that is tailored to the microarray data by borrowing from the extensive research in identifying differentially expressed genes, especially the local false discovery rate. The proposed method provides guidance on the two critical issues in model selection: the number of genes to include in the model and the optimal shrinkage for the penalized logistic regression. We show that selecting more than 20 genes usually helps little in further reducing the prediction error. Application to Golub's leukemia data and our own cervical cancer data leads to highly accurate prediction models. AVAILABILITY: R library GeneLogit at http://geocities.com/jg_liao  相似文献   

14.
We apply geostatistical modeling techniques to investigate spatial patterns of species richness. Unlike most other statistical modeling techniques that are valid only when observations are independent, geostatistical methods are designed for applications involving spatially dependent observations. When spatial dependencies, which are sometimes called autocorrelations, exist, geostatistical techniques can be applied to produce optimal predictions in areas (typically proximate to observed data) where no observed data exist. Using tiger beetle species (Cicindelidae) data collected in western North America, we investigate the characteristics of spatial relationships in species numbers data, First, we compare the accuracy of spatial predictions of species richness when data from grid squares of two different sizes (scales) are used to form the predictions. Next we examine how prediction accuracy varies as a function of areal extent of the region under investigation. Then we explore the relationship between the number of observations used to build spatial prediction models and prediction accuracy. Our results indicate that, within the taxon of tiger beetles and for the two scales we investigate, the accuracy of spatial predictions is unrelated to scale and that prediction accuracy is not obviously related lo the areal extent of the region under investigation. We also provide information about the relationship between sample size and prediction accuracy, and, finally, we show that prediction accuracy may be substantially diminished if spatial correlations in the data are ignored.  相似文献   

15.
Discrete classification is common in Genomic Signal Processing applications, in particular in classification of discretized gene expression data, and in discrete gene expression prediction and the inference of boolean genomic regulatory networks. Once a discrete classifier is obtained from sample data, its performance must be evaluated through its classification error. In practice, error estimation methods must then be employed to obtain reliable estimates of the classification error based on the available data. Both classifier design and error estimation are complicated, in the case of Genomics, by the prevalence of small-sample data sets in such applications. This paper presents a broad review of the methodology of classification and error estimation for discrete data, in the context of Genomics, focusing on the study of performance in small sample scenarios, as well as asymptotic behavior.Key Words: Genomics, classification, error estimation, discrete histogram rule, sampling distribution, resubstitution, leave-one-out, ensemble methods, coefficient of determination.  相似文献   

16.
MOTIVATION: False discovery rate (FDR) is defined as the expected percentage of false positives among all the claimed positives. In practice, with the true FDR unknown, an estimated FDR can serve as a criterion to evaluate the performance of various statistical methods under the condition that the estimated FDR approximates the true FDR well, or at least, it does not improperly favor or disfavor any particular method. Permutation methods have become popular to estimate FDR in genomic studies. The purpose of this paper is 2-fold. First, we investigate theoretically and empirically whether the standard permutation-based FDR estimator is biased, and if so, whether the bias inappropriately favors or disfavors any method. Second, we propose a simple modification of the standard permutation to yield a better FDR estimator, which can in turn serve as a more fair criterion to evaluate various statistical methods. RESULTS: Both simulated and real data examples are used for illustration and comparison. Three commonly used test statistics, the sample mean, SAM statistic and Student's t-statistic, are considered. The results show that the standard permutation method overestimates FDR. The overestimation is the most severe for the sample mean statistic while the least for the t-statistic with the SAM-statistic lying between the two extremes, suggesting that one has to be cautious when using the standard permutation-based FDR estimates to evaluate various statistical methods. In addition, our proposed FDR estimation method is simple and outperforms the standard method.  相似文献   

17.
Many late-phase clinical trials recruit subjects at multiple study sites. This introduces a hierarchical structure into the data that can result in a power-loss compared to a more homogeneous single-center trial. Building on a recently proposed approach to sample size determination, we suggest a sample size recalculation procedure for multicenter trials with continuous endpoints. The procedure estimates nuisance parameters at interim from noncomparative data and recalculates the sample size required based on these estimates. In contrast to other sample size calculation methods for multicenter trials, our approach assumes a mixed effects model and does not rely on balanced data within centers. It is therefore advantageous, especially for sample size recalculation at interim. We illustrate the proposed methodology by a study evaluating a diabetes management system. Monte Carlo simulations are carried out to evaluate operation characteristics of the sample size recalculation procedure using comparative as well as noncomparative data, assessing their dependence on parameters such as between-center heterogeneity, residual variance of observations, treatment effect size and number of centers. We compare two different estimators for between-center heterogeneity, an unadjusted and a bias-adjusted estimator, both based on quadratic forms. The type 1 error probability as well as statistical power are close to their nominal levels for all parameter combinations considered in our simulation study for the proposed unadjusted estimator, whereas the adjusted estimator exhibits some type 1 error rate inflation. Overall, the sample size recalculation procedure can be recommended to mitigate risks arising from misspecified nuisance parameters at the planning stage.  相似文献   

18.
Successful predictions of peptide MHC binding typically require a large set of binding data for the specific MHC molecule that is examined. Structure based prediction methods promise to circumvent this requirement by evaluating the physical contacts a peptide can make with an MHC molecule based on the highly conserved 3D structure of peptide:MHC complexes. While several such methods have been described before, most are not publicly available and have not been independently tested for their performance. We here implemented and evaluated three prediction methods for MHC class II molecules: statistical potentials derived from the analysis of known protein structures; energetic evaluation of different peptide snapshots in a molecular dynamics simulation; and direct analysis of contacts made in known 3D structures of peptide:MHC complexes. These methods are ab initio in that they require structural data of the MHC molecule examined, but no specific peptide:MHC binding data. Moreover, these methods retain the ability to make predictions in a sufficiently short time scale to be useful in a real world application, such as screening a whole proteome for candidate binding peptides. A rigorous evaluation of each methods prediction performance showed that these are significantly better than random, but still substantially lower than the best performing sequence based class II prediction methods available. While the approaches presented here were developed independently, we have chosen to present our results together in order to support the notion that generating structure based predictions of peptide:MHC binding without using binding data is unlikely to give satisfactory results.  相似文献   

19.
In the era of big data analysis, it is of interest to develop diagnostic tools for preliminary scanning of large spatial databases. One problem is identification of locations where certain characteristics exceed a given norm, e.g. timber volume or mean tree diameter exceeding a user-defined threshold. Some of the challenges are, large size of the database, randomness, complex shape of the spatial mean surface, heterogeneity and others. In a step-by-step procedure, we propose a method for achieving this for large spatial data sets. For illustration, we work through a simulated spatial data set as well as a forest inventory data set from Alaska (source: USDA Forest Services). Working within the framework of nonparametric regression modeling, the proposed method can attain a high degree of flexibility regarding the shape of the spatial mean surface. Taking advantage of the large sample size, we also provide asymptotic formulas that are easy to implement in any statistical software.  相似文献   

20.
In this paper we discuss some of the statistical issues that should be considered when conducting experiments involving microarray gene expression data. We discuss statistical issues related to preprocessing the data as well as the analysis of the data. Analysis of the data is discussed in three contexts: class comparison, class prediction and class discovery. We also review the methods used in two studies that are using microarray gene expression to assess the effect of exposure to radiofrequency (RF) fields on gene expression. Our intent is to provide a guide for radiation researchers when conducting studies involving microarray gene expression data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号