首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
Prediction error estimation: a comparison of resampling methods   总被引:1,自引:0,他引:1  
MOTIVATION: In genomic studies, thousands of features are collected on relatively few samples. One of the goals of these studies is to build classifiers to predict the outcome of future observations. There are three inherent steps to this process: feature selection, model selection and prediction assessment. With a focus on prediction assessment, we compare several methods for estimating the 'true' prediction error of a prediction model in the presence of feature selection. RESULTS: For small studies where features are selected from thousands of candidates, the resubstitution and simple split-sample estimates are seriously biased. In these small samples, leave-one-out cross-validation (LOOCV), 10-fold cross-validation (CV) and the .632+ bootstrap have the smallest bias for diagonal discriminant analysis, nearest neighbor and classification trees. LOOCV and 10-fold CV have the smallest bias for linear discriminant analysis. Additionally, LOOCV, 5- and 10-fold CV, and the .632+ bootstrap have the lowest mean square error. The .632+ bootstrap is quite biased in small sample sizes with strong signal-to-noise ratios. Differences in performance among resampling methods are reduced as the number of specimens available increase. SUPPLEMENTARY INFORMATION: A complete compilation of results and R code for simulations and analyses are available in Molinaro et al. (2005) (http://linus.nci.nih.gov/brb/TechReport.htm).  相似文献   

2.
MOTIVATION: Feature selection approaches, such as filter and wrapper, have been applied to address the gene selection problem in the literature of microarray data analysis. In wrapper methods, the classification error is usually used as the evaluation criterion of feature subsets. Due to the nature of high dimensionality and small sample size of microarray data, however, counting-based error estimation may not necessarily be an ideal criterion for gene selection problem. RESULTS: Our study reveals that evaluating genes in terms of counting-based error estimators such as resubstitution error, leave-one-out error, cross-validation error and bootstrap error may encounter severe ties problem, i.e. two or more gene subsets score equally, and this in turn results in uncertainty in gene selection. Our analysis finds that the ties problem is caused by the discrete nature of counting-based error estimators and could be avoided by using continuous evaluation criteria instead. Experiment results show that continuous evaluation criteria such as generalised the absolute value of w2 measure for support vector machines and modified Relief's measure for k-nearest neighbors produce improved gene selection compared with counting-based error estimators. AVAILABILITY: The companion website is at http://www.ntu.edu.sg/home5/pg02776030/wrappers/ The website contains (1) the source code of all the gene selection algorithms and (2) the complete set of tables and figures of experiments.  相似文献   

3.
Cross-validation based point estimates of prediction accuracy are frequently reported in microarray class prediction problems. However these point estimates can be highly variable, particularly for small sample numbers, and it would be useful to provide confidence intervals of prediction accuracy. We performed an extensive study of existing confidence interval methods and compared their performance in terms of empirical coverage and width. We developed a bootstrap case cross-validation (BCCV) resampling scheme and defined several confidence interval methods using BCCV with and without bias-correction. The widely used approach of basing confidence intervals on an independent binomial assumption of the leave-one-out cross-validation errors results in serious under-coverage of the true prediction error. Two split-sample based methods previously proposed in the literature tend to give overly conservative confidence intervals. Using BCCV resampling, the percentile confidence interval method was also found to be overly conservative without bias-correction, while the bias corrected accelerated (BCa) interval method of Efron returns substantially anti-conservative confidence intervals. We propose a simple bias reduction on the BCCV percentile interval. The method provides mildly conservative inference under all circumstances studied and outperforms the other methods in microarray applications with small to moderate sample sizes.  相似文献   

4.
MOTIVATION: Logistic regression is a standard method for building prediction models for a binary outcome and has been extended for disease classification with microarray data by many authors. A feature (gene) selection step, however, must be added to penalized logistic modeling due to a large number of genes and a small number of subjects. Model selection for this two-step approach requires new statistical tools because prediction error estimation ignoring the feature selection step can be severely downward biased. Generic methods such as cross-validation and non-parametric bootstrap can be very ineffective due to the big variability in the prediction error estimate. RESULTS: We propose a parametric bootstrap model for more accurate estimation of the prediction error that is tailored to the microarray data by borrowing from the extensive research in identifying differentially expressed genes, especially the local false discovery rate. The proposed method provides guidance on the two critical issues in model selection: the number of genes to include in the model and the optimal shrinkage for the penalized logistic regression. We show that selecting more than 20 genes usually helps little in further reducing the prediction error. Application to Golub's leukemia data and our own cervical cancer data leads to highly accurate prediction models. AVAILABILITY: R library GeneLogit at http://geocities.com/jg_liao  相似文献   

5.
Is cross-validation valid for small-sample microarray classification?   总被引:5,自引:0,他引:5  
MOTIVATION: Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable understanding of the behavior of cross-validation in the context of very small samples. RESULTS: An extensive simulation study has been performed comparing cross-validation, resubstitution and bootstrap estimation for three popular classification rules-linear discriminant analysis, 3-nearest-neighbor and decision trees (CART)-using both synthetic and real breast-cancer patient data. Comparison is via the distribution of differences between the estimated and true errors. Various statistics for the deviation distribution have been computed: mean (for estimator bias), variance (for estimator precision), root-mean square error (for composition of bias and variance) and quartile ranges, including outlier behavior. In general, while cross-validation error estimation is much less biased than resubstitution, it displays excessive variance, which makes individual estimates unreliable for small samples. Bootstrap methods provide improved performance relative to variance, but at a high computational cost and often with increased bias (albeit, much less than with resubstitution).  相似文献   

6.
MOTIVATION: Ranking feature sets is a key issue for classification, for instance, phenotype classification based on gene expression. Since ranking is often based on error estimation, and error estimators suffer to differing degrees of imprecision in small-sample settings, it is important to choose a computationally feasible error estimator that yields good feature-set ranking. RESULTS: This paper examines the feature-ranking performance of several kinds of error estimators: resubstitution, cross-validation, bootstrap and bolstered error estimation. It does so for three classification rules: linear discriminant analysis, three-nearest-neighbor classification and classification trees. Two measures of performance are considered. One counts the number of the truly best feature sets appearing among the best feature sets discovered by the error estimator and the other computes the mean absolute error between the top ranks of the truly best feature sets and their ranks as given by the error estimator. Our results indicate that bolstering is superior to bootstrap, and bootstrap is better than cross-validation, for discovering top-performing feature sets for classification when using small samples. A key issue is that bolstered error estimation is tens of times faster than bootstrap, and faster than cross-validation, and is therefore feasible for feature-set ranking when the number of feature sets is extremely large.  相似文献   

7.
In microarray studies it is common that the number of replications (i.e. the sample size) is small and that the distribution of expression values differs from normality. In this situation, permutation and bootstrap tests may be appropriate for the identification of differentially expressed genes. However, unlike bootstrap tests, permutation tests are not suitable for very small sample sizes, such as three per group. A variety of different bootstrap tests exists. For example, it is possible to adjust the data to have a common mean before the bootstrap samples are drawn. For small significance levels, which can occur when a large number of genes is investigated, the original bootstrap test, as well as a bootstrap test suggested for the Behrens-Fisher problem, have no power in cases of very small sample sizes. In contrast, the modified test based on adjusted data is powerful. Using a Monte Carlo simulation study, we demonstrate that the difference in power can be huge. In addition, the different tests are illustrated using microarray data.  相似文献   

8.
Non-biological experimental variation or "batch effects" are commonly observed across multiple batches of microarray experiments, often rendering the task of combining data from these batches difficult. The ability to combine microarray data sets is advantageous to researchers to increase statistical power to detect biological phenomena from studies where logistical considerations restrict sample size or in studies that require the sequential hybridization of arrays. In general, it is inappropriate to combine data sets without adjusting for batch effects. Methods have been proposed to filter batch effects from data, but these are often complicated and require large batch sizes ( > 25) to implement. Because the majority of microarray studies are conducted using much smaller sample sizes, existing methods are not sufficient. We propose parametric and non-parametric empirical Bayes frameworks for adjusting data for batch effects that is robust to outliers in small sample sizes and performs comparable to existing methods for large samples. We illustrate our methods using two example data sets and show that our methods are justifiable, easy to apply, and useful in practice. Software for our method is freely available at: http://biosun1.harvard.edu/complab/batch/.  相似文献   

9.
Data with a large p (number of covariates) and/or a large n (sample size) are now commonly encountered. For many problems, regularization especially penalization is adopted for estimation and variable selection. The straightforward application of penalization to large datasets demands a “big computer” with high computational power. To improve computational feasibility, we develop bootstrap penalization, which dissects a big penalized estimation into a set of small ones, which can be executed in a highly parallel manner and each only demands a “small computer”. The proposed approach takes different strategies for data with different characteristics. For data with a large p but a small to moderate n, covariates are first clustered into relatively homogeneous blocks. The proposed approach consists of two sequential steps. In each step and for each bootstrap sample, we select blocks of covariates and run penalization. The results from multiple bootstrap samples are pooled to generate the final estimate. For data with a large n but a small to moderate p, we bootstrap a small number of subjects, apply penalized estimation, and then conduct a weighted average over multiple bootstrap samples. For data with a large p and a large n, the natural marriage of the previous two methods is applied. Numerical studies, including simulations and data analysis, show that the proposed approach has computational and numerical advantages over the straightforward application of penalization. An R package has been developed to implement the proposed methods.  相似文献   

10.
MOTIVATION: The protein lysate microarray is a developing proteomic technology for measuring protein expression levels in a large number of biological samples simultaneously. A challenge for accurate quantification is the relatively narrow dynamic range associated with the commonly used chromogenic signal detection system. To facilitate accurate measurement of the relative expression levels, each sample is serially diluted and each diluted version is spotted on a nitrocellulose-coated slide in triplicate. Thus, each sample yields multiple measurements in different dynamic ranges of the detection system. This study aims to develop suitable algorithms that yield accurate representations of the relative expression levels in different samples from multiple data points. RESULTS: We evaluated two algorithms for estimating relative protein expression in different samples on the lysate microarray by means of a cross-validation procedure. For this purpose as well as for quality control we designed a 1440-spot lysate microarray containing 80 identical samples of purified bovine serum albumin, printed in triplicate with six 2-fold dilutions. Our analysis showed that the algorithm based on a robust least squares estimator provided the most accurate quantification of the protein lysate microarray data. We also demonstrated our methods by estimating relative expression levels of p53 and p21 in either p53(+/+) or p53(-/-) HCT116 colon cancer cells after two drug treatments and their combinations on another lysate microarray. AVAILABILITY: http://www.cs.tut.fi/~mirceanc/lysate_array_bioinformatics.htm  相似文献   

11.
Tan Y  Liu Y 《Bioinformation》2011,7(8):400-404
Identification of genes differentially expressed across multiple conditions has become an important statistical problem in analyzing large-scale microarray data. Many statistical methods have been developed to address the challenging problem. Therefore, an extensive comparison among these statistical methods is extremely important for experimental scientists to choose a valid method for their data analysis. In this study, we conducted simulation studies to compare six statistical methods: the Bonferroni (B-) procedure, the Benjamini and Hochberg (BH-) procedure, the Local false discovery rate (Localfdr) method, the Optimal Discovery Procedure (ODP), the Ranking Analysis of F-statistics (RAF), and the Significant Analysis of Microarray data (SAM) in identifying differentially expressed genes. We demonstrated that the strength of treatment effect, the sample size, proportion of differentially expressed genes and variance of gene expression will significantly affect the performance of different methods. The simulated results show that ODP exhibits an extremely high power in indentifying differentially expressed genes, but significantly underestimates the False Discovery Rate (FDR) in all different data scenarios. The SAM has poor performance when the sample size is small, but is among the best-performing methods when the sample size is large. The B-procedure is stringent and thus has a low power in all data scenarios. Localfdr and RAF show comparable statistical behaviors with the BH-procedure with favorable power and conservativeness of FDR estimation. RAF performs the best when proportion of differentially expressed genes is small and treatment effect is weak, but Localfdr is better than RAF when proportion of differentially expressed genes is large.  相似文献   

12.

Background  

Supervised learning for classification of cancer employs a set of design examples to learn how to discriminate between tumors. In practice it is crucial to confirm that the classifier is robust with good generalization performance to new examples, or at least that it performs better than random guessing. A suggested alternative is to obtain a confidence interval of the error rate using repeated design and test sets selected from available examples. However, it is known that even in the ideal situation of repeated designs and tests with completely novel samples in each cycle, a small test set size leads to a large bias in the estimate of the true variance between design sets. Therefore different methods for small sample performance estimation such as a recently proposed procedure called Repeated Random Sampling (RSS) is also expected to result in heavily biased estimates, which in turn translates into biased confidence intervals. Here we explore such biases and develop a refined algorithm called Repeated Independent Design and Test (RIDT).  相似文献   

13.
MOTIVATION: Ranking gene feature sets is a key issue for both phenotype classification, for instance, tumor classification in a DNA microarray experiment, and prediction in the context of genetic regulatory networks. Two broad methods are available to estimate the error (misclassification rate) of a classifier. Resubstitution fits a single classifier to the data, and applies this classifier in turn to each data observation. Cross-validation (in leave-one-out form) removes each observation in turn, constructs the classifier, and then computes whether this leave-one-out classifier correctly classifies the deleted observation. Resubstitution typically underestimates classifier error, severely so in many cases. Cross-validation has the advantage of producing an effectively unbiased error estimate, but the estimate is highly variable. In many applications it is not the misclassification rate per se that is of interest, but rather the construction of gene sets that have the potential to classify or predict. Hence, one needs to rank feature sets based on their performance. RESULTS: A model-based approach is used to compare the ranking performances of resubstitution and cross-validation for classification based on real-valued feature sets and for prediction in the context of probabilistic Boolean networks (PBNs). For classification, a Gaussian model is considered, along with classification via linear discriminant analysis and the 3-nearest-neighbor classification rule. Prediction is examined in the steady-distribution of a PBN. Three metrics are proposed to compare feature-set ranking based on error estimation with ranking based on the true error, which is known owing to the model-based approach. In all cases, resubstitution is competitive with cross-validation relative to ranking accuracy. This is in addition to the enormous savings in computation time afforded by resubstitution.  相似文献   

14.
Convex bootstrap error estimation is a popular tool for classifier error estimation in gene expression studies. A basic question is how to determine the weight for the convex combination between the basic bootstrap estimator and the resubstitution estimator such that the resulting estimator is unbiased at finite sample sizes. The well-known 0.632 bootstrap error estimator uses asymptotic arguments to propose a fixed 0.632 weight, whereas the more recent 0.632+ bootstrap error estimator attempts to set the weight adaptively. In this paper, we study the finite sample problem in the case of linear discriminant analysis under Gaussian populations. We derive exact expressions for the weight that guarantee unbiasedness of the convex bootstrap error estimator in the univariate and multivariate cases, without making asymptotic simplifications. Using exact computation in the univariate case and an accurate approximation in the multivariate case, we obtain the required weight and show that it can deviate significantly from the constant 0.632 weight, depending on the sample size and Bayes error for the problem. The methodology is illustrated by application on data from a well-known cancer classification study.  相似文献   

15.
Zhou XH  Tu W 《Biometrics》2000,56(4):1118-1125
In this paper, we consider the problem of interval estimation for the mean of diagnostic test charges. Diagnostic test charge data may contain zero values, and the nonzero values can often be modeled by a log-normal distribution. Under such a model, we propose three different interval estimation procedures: a percentile-t bootstrap interval based on sufficient statistics and two likelihood-based confidence intervals. For theoretical properties, we show that the two likelihood-based one-sided confidence intervals are only first-order accurate and that the bootstrap-based one-sided confidence interval is second-order accurate. For two-sided confidence intervals, all three proposed methods are second-order accurate. A simulation study in finite-sample sizes suggests all three proposed intervals outperform a widely used minimum variance unbiased estimator (MVUE)-based interval except for the case of one-sided lower end-point intervals when the skewness is very small. Among the proposed one-sided intervals, the bootstrap interval has the best coverage accuracy. For the two-sided intervals, when the sample size is small, the bootstrap method still yields the best coverage accuracy unless the skewness is very small, in which case the bias-corrected ML method has the best accuracy. When the sample size is large, all three proposed intervals have similar coverage accuracy. Finally, we analyze with the proposed methods one real example assessing diagnostic test charges among older adults with depression.  相似文献   

16.
We examined whether or not the predictive ability of genomic best linear unbiased prediction (GBLUP) could be improved via a resampling method used in machine learning: bootstrap aggregating sampling (“bagging”). In theory, bagging can be useful when the predictor has large variance or when the number of markers is much larger than sample size, preventing effective regularization. After presenting a brief review of GBLUP, bagging was adapted to the context of GBLUP, both at the level of the genetic signal and of marker effects. The performance of bagging was evaluated with four simulated case studies including known or unknown quantitative trait loci, and an application was made to real data on grain yield in wheat planted in four environments. A metric aimed to quantify candidate-specific cross-validation uncertainty was proposed and assessed; as expected, model derived theoretical reliabilities bore no relationship with cross-validation accuracy. It was found that bagging can ameliorate predictive performance of GBLUP and make it more robust against over-fitting. Seemingly, 25–50 bootstrap samples was enough to attain reasonable predictions as well as stable measures of individual predictive mean squared errors.  相似文献   

17.
Abstract The fine‐scale spatial genetic structure (SGS) of alpine plants is receiving increasing attention, from which seed and pollen dispersal can be inferred. However, estimation of SGS may depend strongly on the sampling strategy, including the sample size and spatial sampling scheme. Here, we examined the effects of sample size and three spatial schemes, simple‐random, line‐transect, and random‐cluster sampling, on the estimation of SGS in Androsace tapete, an alpine cushion plant endemic to Qinghai‐Tibetan Plateau. Using both real data and simulated data of dominant molecular markers, we show that: (i) SGS is highly sensitive to sample strategy especially when the sample size is small (e.g., below 100); (ii) the commonly used SGS parameter (the intercept of the autocorrelogram) is more susceptible to sample error than a newly developed Sp statistic; and (iii) the random‐cluster scheme is susceptible to obvious bias in parameter estimation even when the sample size is relatively large (e.g., above 200). Overall, the line‐transect scheme is recommendable, in that it performs slightly better than the simple‐random scheme in parameter estimation and is more efficient to encompass broad spatial scales. The consistency between simulated data and real data implies that these findings might hold true in other alpine plants and more species should be examined in future work.  相似文献   

18.
MOTIVATION: Sample size calculation is important in experimental design and is even more so in microarray or proteomic experiments since only a few repetitions can be afforded. In the multiple testing problems involving these experiments, it is more powerful and more reasonable to control false discovery rate (FDR) or positive FDR (pFDR) instead of type I error, e.g. family-wise error rate (FWER). When controlling FDR, the traditional approach of estimating sample size by controlling type I error is no longer applicable. RESULTS: Our proposed method applies to controlling FDR. The sample size calculation is straightforward and requires minimal computation, as illustrated with two sample t-tests and F-tests. Based on simulation with the resultant sample size, the power is shown to be achievable by the q-value procedure. AVAILABILITY: A Matlab code implementing the described methods is available upon request.  相似文献   

19.

Background:  

In class prediction problems using microarray data, gene selection is essential to improve the prediction accuracy and to identify potential marker genes for a disease. Among numerous existing methods for gene selection, support vector machine-based recursive feature elimination (SVM-RFE) has become one of the leading methods and is being widely used. The SVM-based approach performs gene selection using the weight vector of the hyperplane constructed by the samples on the margin. However, the performance can be easily affected by noise and outliers, when it is applied to noisy, small sample size microarray data.  相似文献   

20.
Bootstrap confidence intervals for adaptive cluster sampling   总被引:2,自引:0,他引:2  
Consider a collection of spatially clustered objects where the clusters are geographically rare. Of interest is estimation of the total number of objects on the site from a sample of plots of equal size. Under these spatial conditions, adaptive cluster sampling of plots is generally useful in improving efficiency in estimation over simple random sampling without replacement (SRSWOR). In adaptive cluster sampling, when a sampled plot meets some predefined condition, neighboring plots are added to the sample. When populations are rare and clustered, the usual unbiased estimators based on small samples are often highly skewed and discrete in distribution. Thus, confidence intervals based on asymptotic normal theory may not be appropriate. We investigated several nonparametric bootstrap methods for constructing confidence intervals under adaptive cluster sampling. To perform bootstrapping, we transformed the initial sample in order to include the information from the adaptive portion of the sample yet maintain a fixed sample size. In general, coverages of bootstrap percentile methods were closer to nominal coverage than the normal approximation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号