首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The bootstrap is a tool that allows for efficient evaluation of prediction performance of statistical techniques without having to set aside data for validation. This is especially important for high-dimensional data, e.g., arising from microarrays, because there the number of observations is often limited. For avoiding overoptimism the statistical technique to be evaluated has to be applied to every bootstrap sample in the same manner it would be used on new data. This includes a selection of complexity, e.g., the number of boosting steps for gradient boosting algorithms. Using the latter, we demonstrate in a simulation study that complexity selection in conventional bootstrap samples, drawn with replacement, is severely biased in many scenarios. This translates into a considerable bias of prediction error estimates, often underestimating the amount of information that can be extracted from high-dimensional data. Potential remedies for this complexity selection bias, such as alternatively using a fixed level of complexity or of using sampling without replacement are investigated and it is shown that the latter works well in many settings. We focus on high-dimensional binary response data, with bootstrap .632+ estimates of the Brier score for performance evaluation, and censored time-to-event data with .632+ prediction error curve estimates. The latter, with the modified bootstrap procedure, is then applied to an example with microarray data from patients with diffuse large B-cell lymphoma.  相似文献   

2.
The bootstrap method has become a widely used tool applied in diverse areas where results based on asymptotic theory are scarce. It can be applied, for example, for assessing the variance of a statistic, a quantile of interest or for significance testing by resampling from the null hypothesis. Recently, some approaches have been proposed in the biometrical field where hypothesis testing or model selection is performed on a bootstrap sample as if it were the original sample. P‐values computed from bootstrap samples have been used, for example, in the statistics and bioinformatics literature for ranking genes with respect to their differential expression, for estimating the variability of p‐values and for model stability investigations. Procedures which make use of bootstrapped information criteria are often applied in model stability investigations and model averaging approaches as well as when estimating the error of model selection procedures which involve tuning parameters. From the literature, however, there is evidence that p‐values and model selection criteria evaluated on bootstrap data sets do not represent what would be obtained on the original data or new data drawn from the overall population. We explain the reasons for this and, through the use of a real data set and simulations, we assess the practical impact on procedures relevant to biometrical applications in cases where it has not yet been studied. Moreover, we investigate the behavior of subsampling (i.e., drawing from a data set without replacement) as a potential alternative solution to the bootstrap for these procedures.  相似文献   

3.
The statistical framework of maximum likelihood estimation is used to examine character weighting in inferring phylogenies. A simple probabilistic model of evolution is used, in which each character evolves independently among two states, and different lineages evolve independently. When different characters have different known probabilities of change, all sufficiently small, the proper maximum likelihood method of estimating phylogenies is a weighted parsimony method in which the weights are logarithmically related to the rates of change. When rates of change are taken extremely small, the weights become more equal and unweighted parsimony methods are obtained. When it is known that a few characters have very high rates of change and the rest very low rates, but it is not known which characters are the ones having the high rates, the maximum likelihood criterion supports use of compatibility methods. By varying the fraction of characters believed to have high rates of change one obtains a ‘threshold method’ whose behavior depends on the value of a parameter. By altering this parameter the method changes smoothly from being a parsimony method to being a compatibility method. This provides us with a spectrum of intermediates between these methods. These intermediate methods may be of use in analysing real data.  相似文献   

4.
The statistical framework of maximum likelihood estimation is used to examine character weighting in inferring phylogenies. A simple probabilistic model of evolution is used, in which each character evolves independently among two states, and different lineages evolve independently. When different characters have different known probabilities of change, all sufficiently small, the proper maximum likelihood method of estimating phylogenies is a weighted parsimony method in which the weights are logarithmically related to the rates of change. When rates of change are taken extremely small, the weights become more equal and unweighted parsimony methods are obtained.
When it is known that a few characters have very high rates of change and the rest very low rates, but it is not known which characters are the ones having the high rates, the maximum likelihood criterion supports use of compatibility methods. By varying the fraction of characters believed to have high rates of change one obtains a 'threshold method' whose behavior depends on the value of a parameter. By altering this parameter the method changes smoothly from being a parsimony method to being a compatibility method. This provides us with a spectrum of intermediates between these methods. These intermediate methods may be of use in analysing real data.  相似文献   

5.
Wu LY  Sun L  Bull SB 《Human heredity》2006,62(2):84-96
BACKGROUND/AIMS: In genome-wide linkage analysis of quantitative trait loci (QTL), locus-specific heritability estimates are biased when the original data are used to both localize linkage and estimate effects, due to maximization of the LOD score over the genome. Positive bias is increased by adoption of stringent significance levels to control genome-wide type I error. We propose multi-locus bootstrap resampling estimators for bias reduction in the situation in which linkage peaks at more than one QTL are of interest. METHODS: Bootstrap estimates were based on repeated sample splitting in the original dataset. We conducted simulation studies in nuclear families with 0 to 5 QTLs and applied the methods in a genome-wide analysis of a blood pressure phenotype in extended pedigrees from the Framingham Heart Study (FHS). RESULTS: Compared to na?ve estimates in the original simulation samples, bootstrap estimates had reduced bias and smaller mean squared error. In the FHS pedigrees, the bootstrap yielded heritability estimates as much as 70% smaller than in the original sample. CONCLUSIONS: Because effect estimates obtained in an initial study are typically inflated relative to those expected in an independent replication study, successful replication will be more likely when sample size requirements are based on bias-reduced estimates.  相似文献   

6.
We explore the estimation of uncertainty in evolutionary parameters using a recently devised approach for resampling entire additive genetic variance–covariance matrices ( G ). Large‐sample theory shows that maximum‐likelihood estimates (including restricted maximum likelihood, REML) asymptotically have a multivariate normal distribution, with covariance matrix derived from the inverse of the information matrix, and mean equal to the estimated G . This suggests that sampling estimates of G from this distribution can be used to assess the variability of estimates of G , and of functions of G . We refer to this as the REML‐MVN method. This has been implemented in the mixed‐model program WOMBAT. Estimates of sampling variances from REML‐MVN were compared to those from the parametric bootstrap and from a Bayesian Markov chain Monte Carlo (MCMC) approach (implemented in the R package MCMCglmm). We apply each approach to evolvability statistics previously estimated for a large, 20‐dimensional data set for Drosophila wings. REML‐MVN and MCMC sampling variances are close to those estimated with the parametric bootstrap. Both slightly underestimate the error in the best‐estimated aspects of the G matrix. REML analysis supports the previous conclusion that the G matrix for this population is full rank. REML‐MVN is computationally very efficient, making it an attractive alternative to both data resampling and MCMC approaches to assessing confidence in parameters of evolutionary interest.  相似文献   

7.
Automated variable selection procedures, such as backward elimination, are commonly employed to perform model selection in the context of multivariable regression. The stability of such procedures can be investigated using a bootstrap‐based approach. The idea is to apply the variable selection procedure on a large number of bootstrap samples successively and to examine the obtained models, for instance, in terms of the inclusion of specific predictor variables. In this paper, we aim to investigate a particular important problem affecting this method in the case of categorical predictor variables with different numbers of categories and to give recommendations on how to avoid it. For this purpose, we systematically assess the behavior of automated variable selection based on the likelihood ratio test using either bootstrap samples drawn with replacement or subsamples drawn without replacement from the original dataset. Our study consists of extensive simulations and a real data example from the NHANES study. Our main result is that if automated variable selection is conducted on bootstrap samples, variables with more categories are substantially favored over variables with fewer categories and over metric variables even if none of them have any effect. Importantly, variables with no effect and many categories may be (wrongly) preferred to variables with an effect but few categories. We suggest the use of subsamples instead of bootstrap samples to bypass these drawbacks.  相似文献   

8.
The statistical properties of sample estimation and bootstrap estimation of phylogenetic variability from a sample of nucleotide sequences are studied by using model trees of three taxa with an outgroup and by assuming a constant rate of nucleotide substitution. The maximum-parsimony method of tree reconstruction is used. An analytic formula is derived for estimating the sequence length that is required if P, the probability of obtaining the true tree from the sampled sequences, is to be equal to or higher than a given value. Bootstrap estimation is formulated as a two-step sampling procedure: (1) sampling of sequences from the evolutionary process and (2) resampling of the original sequence sample. The probability that a bootstrap resampling of an original sequence sample will support the true tree is found to depend on the model tree, the sequence length, and the probability that a randomly chosen nucleotide site is an informative site. When a trifurcating tree is used as the model tree, the probability that one of the three bifurcating trees will appear in > or = 95% of the bootstrap replicates is < 5%, even if the number of bootstrap replicates is only 50; therefore, the probability of accepting an erroneous tree as the true tree is < 5% if that tree appears in > or = 95% of the bootstrap replicates and if more than 50 bootstrap replications are conducted. However, if a particular bifurcating tree is observed in, say, < 75% of the bootstrap replicates, then it cannot be claimed to be better than the trifurcating tree even if > or = 1,000 bootstrap replications are conducted. When a bifurcating tree is used as the model tree, the bootstrap approach tends to overestimate P when the sequences are very short, but it tends to underestimate that probability when the sequences are long. Moreover, simulation results show that, if a tree is accepted as the true tree only if it has appeared in > or = 95% of the bootstrap replicates, then the probability of failing to accept any bifurcating tree can be as large as 58% even when P = 95%, i.e., even when 95% of the samples from the evolutionary process will support the true tree. Thus, if the rate-constancy assumption holds, bootstrapping is a conservative approach for estimating the reliability of an inferred phylogeny for four taxa.  相似文献   

9.
Abstract. A method is described to determine the number of significant dimensions in metric ordination of a sample. The method is probabilistic, based on bootstrap resampling. An iterative algorithm takes bootstrap samples with replacement from the sample. It finds in each bootstrap sample ordination coordinates and computes, after Procrustean adjustments, the correlation between observed and bootstrap ordination scores. It compares this correlation to the same parameter generated in a parallel bootstrapped ordination of randomly permuted data, which upon many iterations will generate a probability. The method is assessed in principal coordinates analysis of simulated data sets that have varying number of variables and correlation levels, uniform or patterned correlation structure. The results suggest the method is more reliable than other available methods in recovering the true intrinsic dimensionality. Examples with grassland data illustrate utility.  相似文献   

10.
For independent data, non-parametric bootstrap is realised by resampling the data with replacement. This approach fails for dependent data such as time series. If the data generating process is at least stationary and mixing, the blockwise bootstrap by drawing subsamples or blocks of the data saves the concept. For the blockwise bootstrap a blocklength has to be selected. We propose a method for selecting the optimal blocklength. To improve the finite size properties of the blockwise bootstrap, studentised statistics is considered. If the statistic can be represented as a smooth function model this studentisation can be approximated efficiently. The studentised blockwise bootstrap method is applied for testing hypotheses on medical time series.  相似文献   

11.
Rydin and Källersjö (2002 ) found that taxon‐sampling effects were a strongly disturbing factor in a high‐level phylogenetic analysis. I have reanalyzed some of their data to assess whether bootstrap frequencies can be used to predict the stability of clades to taxon sampling, and to compare it with the performance of a stability measure based on taxon resampling (“taxon jackknifing”). High bootstrap frequencies correctly identify a small number of stable clades, but miss many other equally stable clades. When the total error rate is considered, no cut‐off level based on bootstrap frequencies performs better than using all clades in the strict consensus, whereas a slight improvement was observed when cut‐off levels based on taxon jackknifing frequencies are used. © The Willi Hennig Society 2006.  相似文献   

12.
Schoen DJ  Clegg MT 《Genetics》1986,112(4):927-945
Estimation of mating system parameters in plant populations typically employs family-structured samples of progeny genotypes. These estimation models postulate a mixture of self-fertilization and random outcrossing. One assumption of such models concerns the distribution of pollen genotypes among eggs within single maternal families. Previous applications of the mixed mating model to mating system estimation have assumed that pollen genotypes are sampled randomly from the total population in forming outcrossed progeny within families. In contrast, the one-pollen parent model assumes that outcrossed progeny within a family share a single-pollen parent genotype. Monte Carlo simulations of family-structured sampling were carried out to examine the consequences of violations of the different assumptions of the two models regarding the distribution of pollen genotypes among eggs. When these assumptions are violated, estimates of mating system parameters may be significantly different from their true values and may exhibit distributions which depart from normality. Monte Carlo methods were also used to examine the utility of the bootstrap resampling algorithm for estimating the variances of mating system parameters. The bootstrap method gives variance estimates that approximate empirically determined values. When applied to data from two plant populations which differ in pollen genotype distributions within families, the two estimation procedures exhibit the same behavior as that seen with the simulated data.  相似文献   

13.

Background  

In recent years, gene order data has attracted increasing attention from both biologists and computer scientists as a new type of data for phylogenetic analysis. If gene orders are viewed as one character with a large number of states, traditional bootstrap procedures cannot be applied. Researchers began to use a jackknife resampling method to assess the quality of gene order phylogenies.  相似文献   

14.
The bootstrap is an important tool for estimating the confidence interval of monophyletic groups within phylogenies. Although bootstrap analyses are used in most evolutionary studies, there is no clear consensus as how best to interpret bootstrap probability values. To study further the bootstrap method, nine small subunit ribosomal DNA (SSU rDNA) data sets were submitted to bootstrapped maximum parsimony (MP) analyses using unweighted and weighted sequence positions. Analyses of the lengths (i.e., parsimony steps) of the bootstrap trees show that the shape and mean of the bootstrap tree distribution may provide important insights into the evolutionary signal within the sequence data. With complex phylogenies containing nodes defined by short internal branches (multifurcations), the mean of the bootstrap tree distribution may differ by 2 standard deviations from the length of the best tree found from the original data set. Weighting sequence positions significantly increases the bootstrap values at internal nodes. There may, however, be strong bootstrap support for conflicting species groupings among different data sets. This phenomenon appears to result from a correlation between the topology of the tree used to create the weights and the topology of the bootstrap consensus tree inferred from the MP analysis of these weighted data. The analyses also show that characteristics of the bootstrap tree distribution (e.g., skewness) may be used to choose between alternative weighting schemes for phylogenetic analyses.  相似文献   

15.
We propose a method for a posteriori evaluation of classification stability which compares the classification of sites in the original data set (a matrix of species by sites) with classifications of subsets of its sites created by without‐replacement bootstrap resampling. Site assignments to clusters of the original classification and to clusters of the classification of each subset are compared using Goodman‐Kruskal's lambda index. Many resampled subsets are classified and the mean of lambda values calculated for the classifications of these subsets is used as an estimation of classification stability. Furthermore, the mean of the lambda values based on different resampled subsets, calculated for each site of the data set separately, can be used as a measure of the influence of particular sites on classification stability. This method was tested on several artificial data sets classified by commonly used clustering methods and on a real data set of forest vegetation plots. Its strength lies in the ability to distinguish classifications which reflect robust patterns of community differentiation from unstable classifications of more continuous patterns. In addition, it can identify sites within each cluster which have a transitional species composition with respect to other clusters.  相似文献   

16.
While there has been strong support for Amborella and Nymphaeales (water lilies) as branching from basal-most nodes in the angiosperm phylogeny, this hypothesis has recently been challenged by phylogenetic analyses of 61 protein-coding genes extracted from the chloroplast genome sequences of Amborella, Nymphaea, and 12 other available land plant chloroplast genomes. These character-rich analyses placed the monocots, represented by three grasses (Poaceae), as sister to all other extant angiosperm lineages. We have extracted protein-coding regions from draft sequences for six additional chloroplast genomes to test whether this surprising result could be an artifact of long-branch attraction due to limited taxon sampling. The added taxa include three monocots (Acorus, Yucca, and Typha), a water lily (Nuphar), a ranunculid (Ranunculus), and a gymnosperm (Ginkgo). Phylogenetic analyses of the expanded DNA and protein data sets together with microstructural characters (indels) provided unambiguous support for Amborella and the Nymphaeales as branching from the basal-most nodes in the angiosperm phylogeny. However, their relative positions proved to be dependent on the method of analysis, with parsimony favoring Amborella as sister to all other angiosperms and maximum likelihood (ML) and neighbor-joining methods favoring an Amborella + Nymphaeales clade as sister. The ML phylogeny supported the later hypothesis, but the likelihood for the former hypothesis was not significantly different. Parametric bootstrap analysis, single-gene phylogenies, estimated divergence dates, and conflicting indel characters all help to illuminate the nature of the conflict in resolution of the most basal nodes in the angiosperm phylogeny. Molecular dating analyses provided median age estimates of 161 MYA for the most recent common ancestor (MRCA) of all extant angiosperms and 145 MYA for the MRCA of monocots, magnoliids, and eudicots. Whereas long sequences reduce variance in branch lengths and molecular dating estimates, the impact of improved taxon sampling on the rooting of the angiosperm phylogeny together with the results of parametric bootstrap analyses demonstrate how long-branch attraction might mislead genome-scale phylogenetic analyses.  相似文献   

17.
Because antimicrobial resistance in food-producing animals is a major public health concern, many countries have implemented antimicrobial monitoring systems at a national level. When designing a sampling scheme for antimicrobial resistance monitoring, it is necessary to consider both cost effectiveness and statistical plausibility. In this study, we examined how sampling scheme precision and sensitivity can vary with the number of animals sampled from each farm, while keeping the overall sample size constant to avoid additional sampling costs. Five sampling strategies were investigated. These employed 1, 2, 3, 4 or 6 animal samples per farm, with a total of 12 animals sampled in each strategy. A total of 1,500 Escherichia coli isolates from 300 fattening pigs on 30 farms were tested for resistance against 12 antimicrobials. The performance of each sampling strategy was evaluated by bootstrap resampling from the observational data. In the bootstrapping procedure, farms, animals, and isolates were selected randomly with replacement, and a total of 10,000 replications were conducted. For each antimicrobial, we observed that the standard deviation and 2.5–97.5 percentile interval of resistance prevalence were smallest in the sampling strategy that employed 1 animal per farm. The proportion of bootstrap samples that included at least 1 isolate with resistance was also evaluated as an indicator of the sensitivity of the sampling strategy to previously unidentified antimicrobial resistance. The proportion was greatest with 1 sample per farm and decreased with larger samples per farm. We concluded that when the total number of samples is pre-specified, the most precise and sensitive sampling strategy involves collecting 1 sample per farm.  相似文献   

18.
Reconciling discordant morphological and molecular phylogenies remains a problem in modern systematics. By examining conflicting DNA-hybridization and morphological phylogenies of sand dollars, I show that morphological criteria may be used to help evaluate the reliability of molecular phylogenies where they differ from morphological trees. All available criteria for assessing the reliability of DNA-hybridization phylogenies suggest that the sand dollar DNA-hybridization phylogeny is robust. Standard homology-recognition criteria are used to assess the a priori reliabilities of the morphological attributes associated with the node drawn into question by the DNA data, and it is shown that these attributes are among the least phylogenetically informative of all the morphological characters. Moreover, the questioned node has the smallest number of supporting characters, and most of these characters are associated with the food grooves, which suggests that they may be functionally correlated. Thus, on the basis of the analysis of the morphological data and given the robustness of the DNA tree, the DNA phylogeny is preferred. Further, paleobiogeographic data support the DNA tree rather than the morphological tree, and a plausible heterochronic mechanism has been proposed that may account for the homoplasious morphological evolution that must have occurred if the DNA tree is correct.  相似文献   

19.
In the clade of Penstemon and segregate genera, pollination syndromes are well defined among the 284 species. Most display combinations of floral characters associated with pollination by Hymenoptera, the ancestral mode of pollination for this clade. Forty-one species present characters associated with hummingbird pollination, although some of these ornithophiles are also visited by insects. The ornithophiles are scattered throughout the traditional taxonomy and across phylogenies estimated from nuclear (internal transcribed spacer (ITS)) and chloroplast DNA (trnCD/TL) sequence data. Here, the number of separate origins of ornithophily is estimated, using bootstrap phylogenies and constrained parsimony searches. Analyses suggest 21 separate origins, with overwhelming support for 10 of these. Because species sampling was incomplete, this is probably an underestimate. Penstemons therefore show great evolutionary lability with respect to acquiring hummingbird pollination; this syndrome acts as an attractor to which species with large sympetalous nectar-rich flowers have frequently been drawn. By contrast, penstemons have not undergone evolutionary shifts backwards or to other pollination syndromes. Thus, they are an example of both striking evolutionary lability and constrained evolution.  相似文献   

20.
Because phylogenies can be estimated without stratigraphic data and because estimated phylogenies also infer gaps in sampling, some workers have used phylogeny estimates as templates for evaluating sampling from the fossil record and for "correcting" historical diversity patterns. However, it is not known how sampling intensity (the probability of sampling taxa per unit time) and completeness (the proportion of taxa sampled) affect the accuracy of phylogenetic inferences, nor how phylogenetically inferred estimates of sampling and diversity respond to inaccurate estimates of phylogeny. Both issues are addressed with a series of simulations using simple models of character evolution, varying speciation patterns, and various rates of speciation, extinction, character change, and preservation. Parsimony estimates of simulated phylogenies become less accurate as sampling decreases, and inaccurate trees chronically underestimate sampling. Biotic factors such as rates of morphologic change and extinction both affect the accuracy of phylogenetic estimates and thus affect estimated gaps in sampling, indicating that differences in implied sampling need not reflect actual differences in sampling. Errors in inferred diversity are concentrated early in the history of a clade. This, coupled with failure to account for true extinction times (i.e., the Signor-Lipps effect), inflates relative diversity levels early in clade histories. Because factors other than differences in sampling predict differences in the numbers of gaps implied by phylogeny estimates, inferred phylogenies can be misleading templates for evaluating sampling or historical diversity patterns.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号