首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Null hypothesis significance testing has been under attack in recent years, partly owing to the arbitrary nature of setting α (the decision-making threshold and probability of Type I error) at a constant value, usually 0.05. If the goal of null hypothesis testing is to present conclusions in which we have the highest possible confidence, then the only logical decision-making threshold is the value that minimizes the probability (or occasionally, cost) of making errors. Setting α to minimize the combination of Type I and Type II error at a critical effect size can easily be accomplished for traditional statistical tests by calculating the α associated with the minimum average of α and β at the critical effect size. This technique also has the flexibility to incorporate prior probabilities of null and alternate hypotheses and/or relative costs of Type I and Type II errors, if known. Using an optimal α results in stronger scientific inferences because it estimates and minimizes both Type I errors and relevant Type II errors for a test. It also results in greater transparency concerning assumptions about relevant effect size(s) and the relative costs of Type I and II errors. By contrast, the use of α = 0.05 results in arbitrary decisions about what effect sizes will likely be considered significant, if real, and results in arbitrary amounts of Type II error for meaningful potential effect sizes. We cannot identify a rationale for continuing to arbitrarily use α = 0.05 for null hypothesis significance tests in any field, when it is possible to determine an optimal α.  相似文献   

2.
A P Grieve 《Biometrics》1985,41(4):979-990
Statisticians have been critical of the use of the two-period crossover designs for clinical trials because the estimate of the treatment difference is biased when the carryover effects of the two treatments are not equal. In the standard approach, if the null hypothesis of equal carryover effects is not rejected, data from both periods are used to estimate and test for treatment differences; if the null hypothesis is rejected, data from the first period alone are used. A Bayesian analysis based on the Bayes factor against unequal carryover effects is given. Although this Bayesian approach avoids the "all-or-nothing" decision inherent in the standard approach, it recognizes that with small trials it is difficult to provide unequivocal evidence that the carryover effects of the two treatments are equal, and thus that the interpretation of the difference between treatment effects is highly dependent on a subjective assessment of the reality or not of equal carryover effects.  相似文献   

3.
This paper presents a look at the underused procedure of testing for Type II errors when "negative" results are encountered during research. It recommends setting a statistical alternative hypothesis based on anthropologically derived information and calculating the probability of committing this type of error. In this manner, the process is similar to that used for testing Type I errors, which is clarified by examples from the literature. It is hoped that researchers will use the information presented here as a means of attaching levels of probability to acceptance of null hypotheses.  相似文献   

4.
Permutation test is a popular technique for testing a hypothesis of no effect, when the distribution of the test statistic is unknown. To test the equality of two means, a permutation test might use a test statistic which is the difference of the two sample means in the univariate case. In the multivariate case, it might use a test statistic which is the maximum of the univariate test statistics. A permutation test then estimates the null distribution of the test statistic by permuting the observations between the two samples. We will show that, for such tests, if the two distributions are not identical (as for example when they have unequal variances, correlations or skewness), then a permutation test for equality of means based on difference of sample means can have an inflated Type I error rate even when the means are equal. Our results illustrate permutation testing should be confined to testing for non-identical distributions. CONTACT: calian@raunvis.hi.is.  相似文献   

5.
Likelihood methods for detecting temporal shifts in diversification rates   总被引:8,自引:0,他引:8  
Maximum likelihood is a potentially powerful approach for investigating the tempo of diversification using molecular phylogenetic data. Likelihood methods distinguish between rate-constant and rate-variable models of diversification by fitting birth-death models to phylogenetic data. Because model selection in this context is a test of the null hypothesis that diversification rates have been constant over time, strategies for selecting best-fit models must minimize Type I error rates while retaining power to detect rate variation when it is present. Here I examine model selection, parameter estimation, and power to reject the null hypothesis using likelihood models based on the birth-death process. The Akaike information criterion (AIC) has often been used to select among diversification models; however, I find that selecting models based on the lowest AIC score leads to a dramatic inflation of the Type I error rate. When appropriately corrected to reduce Type I error rates, the birth-death likelihood approach performs as well or better than the widely used gamma statistic, at least when diversification rates have shifted abruptly over time. Analyses of datasets simulated under a range of rate-variable diversification scenarios indicate that the birth-death likelihood method has much greater power to detect variation in diversification rates when extinction is present. Furthermore, this method appears to be the only approach available that can distinguish between a temporal increase in diversification rates and a rate-constant model with nonzero extinction. I illustrate use of the method by analyzing a published phylogeny for Australian agamid lizards.  相似文献   

6.
A noninferiority (NI) trial is sometimes employed to show efficacy of a new treatment when it is unethical to randomize current patients to placebo because of the established efficacy of a standard treatment. Under this framework, if the NI trial determines that the treatment advantage of the standard to the new drug (i.e. S-N) is less than the historic advantage of the standard to placebo (S-P), then the efficacy of the new treatment (N-P) is established indirectly. We explicitly combine information from the NI trial with estimates from a random effects model, allowing study-to-study variability in k historic trials. Existing methods under random effects, such as the synthesis method, fail to account for the variability of the true standard versus placebo effect in the NI trial. Our method effectively uses a prediction interval for the missing standard versus placebo effect rather than a confidence interval of the mean. The consequences are to increase the variance of the synthesis method by incorporating a prediction variance term and to approximate the null distribution of the new statistic with a t with k-1 degrees of freedom instead of the standard normal. Thus, it is harder to conclude NI of the new to (predicted) placebo, compared with traditional methods, especially when k is small or when between study variability is large. When the between study variances are nonzero, we demonstrate substantial Type I error rate inflation with conventional approaches; simulations suggest that the new procedure has only modest inflation, and it is very conservative when between study variances are zero. An example is used to illustrate practical issues.  相似文献   

7.
The most widely used method for evaluating the mode of inheritance of pesticide resistance is based on bioassays of individuals from a backcross between F1 (hybrid of resistant and susceptible strains) and parental resistant or susceptible strains. Monte Carlo simulations of the standard backcross method showed that the probability of incorrectly rejecting the null hypothesis of monogenic inheritance (Type I error) was generally more than double the conventional value of P = 0.05. Conversely, the null hypothesis of monogenic inheritance was likely to be accepted in a relatively large proportion of cases in which resistance is controlled by two or more loci. Expected differences in mortality of backcross offspring between monogenic and additive polygenic models approached zero as dose approached extremely low values, extremely high values, and the LD50 of the backcross generation. Thus, the effectiveness of the backcross method depended strongly on dose. The power of the standard backcross method to correctly reject the null hypothesis of monogenic inheritance increased as number of loci, slope of parental dose-mortality lines, magnitude of resistance, and sample size increased. Guidelines for improving the design and interpretation of backcross experiments are presented.  相似文献   

8.
In clinical trials for the comparison of two treatments it seems reasonable to stop the study if either one treatment has worked out to be markedly superior in the main effect, or one to be severely inferior with respect to an adverse side effect. Two stage sampling plans are considered for simultaneously testing a main and side effect, assumed to follow a bivariate normal distribution with known variances, but unknown correlation. The test procedure keeps the global significance level under the null hypothesis of no differences in main and side effects. The critical values are chosen under the side condition, that the probability for ending at the first or second stage with a rejection of the elementary null hypothesis for the main effect is controlled, when a particular constellation of differences in mean holds; analogously the probability of ending with a rejection of the null hypotheses for the side effect, given certain treatment differences, is controlled too. Plans “optimal” with respect to sample size are given.  相似文献   

9.
Observed variations in rates of taxonomic diversification have been attributed to a range of factors including biological innovations, ecosystem restructuring, and environmental changes. Before inferring causality of any particular factor, however, it is critical to demonstrate that the observed variation in diversity is significantly greater than that expected from natural stochastic processes. Relative tests that assess whether observed asymmetry in species richness between sister taxa in monophyletic pairs is greater than would be expected under a symmetric model have been used widely in studies of rate heterogeneity and are particularly useful for groups in which paleontological data are problematic. Although one such test introduced by Slowinski and Guyer a decade ago has been applied to a wide range of clades and evolutionary questions, the statistical behavior of the test has not been examined extensively, particularly when used with Fisher's procedure for combining probabilities to analyze data from multiple independent taxon pairs. Here, certain pragmatic difficulties with the Slowinski-Guyer test are described, further details of the development of a recently introduced likelihood-based relative rates test are presented, and standard simulation procedures are used to assess the behavior of the two tests in a range of situations to determine: (1) the accuracy of the tests' nominal Type I error rate; (2) the statistical power of the tests; (3) the sensitivity of the tests to inclusion of taxon pairs with few species; (4) the behavior of the tests with datasets comprised of few taxon pairs; and (5) the sensitivity of the tests to certain violations of the null model assumptions. Our results indicate that in most biologically plausible scenarios, the likelihood-based test has superior statistical properties in terms of both Type I error rate and power, and we found no scenario in which the Slowinski-Guyer test was distinctly superior, although the degree of the discrepancy varies among the different scenarios. The Slowinski-Guyer test tends to be much more conservative (i.e., very disinclined to reject the null hypothesis) in datasets with many small pairs. In most situations, the performance of both the likelihood-based test and particularly the Slowinski-Guyer test improve when pairs with few species are excluded from the computation, although this is balanced against a decline in the tests' power and accuracy as fewer pairs are included in the dataset. The performance of both tests is quite poor when they are applied to datasets in which the taxon sizes do not conform to the distribution implied by the usual null model. Thus, results of analyses of taxonomic rate heterogeneity using the Slowinski-Guyer test can be misleading because the test's ability to reject the null hypothesis (equal rates) when true is often inaccurate and its ability to reject the null hypothesis when the alternative (unequal rates) is true is poor, particularly when small taxon pairs are included. Although not always perfect, the likelihood-based test provides a more accurate and powerful alternative as a relative rates test.  相似文献   

10.
The problem of whether the hominid fossil sample of habiline specimens is comprised of more than one species has received much attention in paleoanthropology. The core of this debate has significant implications about when and how variation must be explained by taxonomy. In this paper, we examine the problem of whether the observed variation in habiline sample must be interpreted to reflect species differences. We test the null hypothesis of no difference by examining the degree of variability in habiline sample in comparison with other single-species early hominid fossil samples from Sterkfontein and Swartkrans (Sterkfontein is earlier than the habiline sample; Swartkrans may be within the habiline time span). We use the standard error test for this analysis, a sampling statistic based on the standard error of the slope of regressions between pairs of specimens that relates all of the homologous measurements each pair shares. We show that the null hypothesis for the habiline sample cannot be rejected. The similarities of specimen pairs within the habiline sample are not more than those observed between the specimens in the two australopithecine samples we analyzed.  相似文献   

11.
《新西兰生态学杂志》2011,23(2):139-147
To establish whether poisoning programs affect non-target density, the null hypothesis that density does not decline on poisoned sites needs to be tested. However, where no statistically significant reduction in density is found, there is some probability that a biologically significant reduction has been overlooked. The probability that such an error has occurred (a Type 2 error) depends on the effect poisoning has on non-target density, the precision with which the reduction is assessed, and the number of poisoning operations sampled. Prospective power analysis can identify minimum sample sizes that reduce the probability of a Type 2 error to acceptable levels. Equivalence tests require a priori identification of the minimum change in non-target density that can be safely overlooked and the acceptable probability of doing so. As such, they explicitly link the statistical and biological significance of non-target poisoning assessments. We illustrate these principles using an experimental assessment of the effect rabbit poisoning has on the density of large kangaroo populations in Australia. A rule-of-thumb guide was used to estimate appropriate levels of power (0.85) and reductions in kangaroo density (r = -0.12) for the assessment, and a pilot study conducted to estimate the between-sample standard deviation for estimates of change in kangaroo density(s = 0.089). Prospective power analysis based on these estimates indicated that 6 poisoning programs would provide a robust assessment of the effect of poisoning on kangaroos. However, because the between-sample standard deviation was underestimated, a subsequent assessment based on 6 samples had insufficient power to usefully estimate the effect poisoning had on kangaroos. Retrospective power analysis indicated that at 0.85 power, reductions in kangaroo density as high as r=—2.2 may have been overlooked. Using the between-sample standard deviation from this assessment, changes in kangaroo density would have to be estimated for 17-19 poisoning programs if a subsequent experiment was to achieve a biologically as well as statistically robust result.  相似文献   

12.
Interim analyses in clinical trials are planned for ethical as well as economic reasons. General results have been published in the literature that allow the use of standard group sequential methodology if one uses an efficient test statistic, e.g., when Wald-type statistics are used in random-effects models for ordinal longitudinal data. These models often assume that the random effects are normally distributed. However, this is not always the case. We will show that, when the random-effects distribution is misspecified in ordinal regression models, the joint distribution of the test statistics over the different interim analyses is still a multivariate normal distribution, but a sandwich-type correction to the covariance matrix is needed in order to obtain the correct covariance matrix. The independent increment structure is also investigated. A bias in estimation will occur due to the misspecification. However, we will also show that the treatment effect estimate will be unbiased under the null hypothesis, thus maintaining the type I error. Extensive simulations based on a toenail dermatophyte onychomycosis trial are used to illustrate our results.  相似文献   

13.

One of the first things one learns in a basic psychology or statistics course is that you cannot prove the null hypothesis that there is no difference between two conditions such as a patient group and a normal control group. This remains true. However now, thanks to ongoing progress by a special group of devoted methodologists, even when the result of an inferential test is p?>?.05, it is now possible to rigorously and quantitatively conclude that (a) the null hypothesis is actually unlikely, and (b) that the alternative hypothesis of an actual difference between treatment and control is more probable than the null. Alternatively, it is also possible to conclude quantitatively that the null hypothesis is much more likely than the alternative. Without Bayesian statistics, we couldn’t say anything if a simple inferential analysis like a t-test yielded p?>?.05. The present, mostly non-quantitative article describes free resources and illustrative procedures for doing Bayesian analysis, with t-test and ANOVA examples.

  相似文献   

14.
15.
Joshua Ladau  Sadie J. Ryan 《Oikos》2010,119(7):1064-1069
Null model tests of presence–absence data (‘NMTPAs’) provide important tools for inferring effects of competition, facilitation, habitat filtering, and other ecological processes from observational data. Many NMTPAs have been developed, but they often yield conflicting conclusions when applied to the same data. Type I and II error rates, size, power, robustness and bias provide important criteria for assessing which tests are valid, but these criteria need to be evaluated contingent on the sample size, null hypothesis of interest, and assumptions that are appropriate for the data set that is being analyzed. In this paper, we confirm that this is the case using the software MPower, evaluating the validity of NMTPAs contingent on the null hypothesis being tested, assumptions that can be made, and sample size. Evaluating the validity of NMTPAs contingent on these factors is important towards ensuring that reliable inferences are drawn from observational data about the processes controlling community assembly.  相似文献   

16.
Switching between testing for superiority and non-inferiority has been an important statistical issue in the design and analysis of active controlled clinical trial. In practice, it is often conducted with a two-stage testing procedure. It has been assumed that there is no type I error rate adjustment required when either switching to test for non-inferiority once the data fail to support the superiority claim or switching to test for superiority once the null hypothesis of non-inferiority is rejected with a pre-specified non-inferiority margin in a generalized historical control approach. However, when using a cross-trial comparison approach for non-inferiority testing, controlling the type I error rate sometimes becomes an issue with the conventional two-stage procedure. We propose to adopt a single-stage simultaneous testing concept as proposed by Ng (2003) to test both non-inferiority and superiority hypotheses simultaneously. The proposed procedure is based on Fieller's confidence interval procedure as proposed by Hauschke et al. (1999).  相似文献   

17.
Dental variation has been used commonly to assess taxonomic composition in morphologically homogeneous fossil samples. While the coefficient of variation (CV) has been used traditionally, range-based measures of variation, such as the range as a percentage of the mean (R%) and the maximum/minimum index (Imax/min) have recently become popular alternatives. The current study compares the performance of these statistics when applied to single- and pooled-species dental samples of extant Cercopithecus species. A common methodology for such problems of species discrimination has been to simply compare the maximum value of a variation statistic observed in extant samples with that observed in the fossil sample. However, regardless of what statistic is used, this approach has an unknowable Type I error rate, and usually has low power to detect multiple species. A more appropriate method involves a formal hypothesis test. The null hypothesis is that the level of variation in the fossil sample does not exceed what might be expected in a sample drawn randomly from a reference population, taking into account sampling error and the size of the fossil sample. Previous research using this method with the CV has indicated that it offers considerable power at an acceptable Type I error rate. In the current study, the data of primary interest were posterior dental dimensions for single- and pooled species samples from extant Cercopithecus species. In addition, the study also investigated the relative performance of variation statistics when applied to highly dimorphic canine dimensions, since much recent work has employed sexually dimorphic dental dimensions for assessing single-species hypotheses. The results indicate that the CV consistently out-performed the range-based statistics when using posterior dental dimensions to test a single-species hypothesis. Regardless of which statistic was used, tests on sexually dimorphic dimensions offered minimal power. In consideration of these results and the problem of studywise Type I error rates, we recommend against the use of multiple measures of variation to test for multiple species composition, and advocate the CV as the statistic of choice when using the method of Cope & Lacy (1992). For similar reasons, we argue for careful selection of dental variables for inclusion in such analyses, and in particular recommend against including sexually dimorphic dimensions when testing for multiple species composition.  相似文献   

18.
DiRienzo AG 《Biometrics》2003,59(3):497-504
When testing the null hypothesis that treatment arm-specific survival-time distributions are equal, the log-rank test is asymptotically valid when the distribution of time to censoring is conditionally independent of randomized treatment group given survival time. We introduce a test of the null hypothesis for use when the distribution of time to censoring depends on treatment group and survival time. This test does not make any assumptions regarding independence of censoring time and survival time. Asymptotic validity of this test only requires a consistent estimate of the conditional probability that the survival event is observed given both treatment group and that the survival event occurred before the time of analysis. However, by not making unverifiable assumptions about the data-generating mechanism, there exists a set of possible values of corresponding sample-mean estimates of these probabilities that are consistent with the observed data. Over this subset of the unit square, the proposed test can be calculated and a rejection region identified. A decision on the null that considers uncertainty because of censoring that may depend on treatment group and survival time can then be directly made. We also present a generalized log-rank test that enables us to provide conditions under which the ordinary log-rank test is asymptotically valid. This generalized test can also be used for testing the null hypothesis when the distribution of censoring depends on treatment group and survival time. However, use of this test requires semiparametric modeling assumptions. A simulation study and an example using a recent AIDS clinical trial are provided.  相似文献   

19.
Hardy-Weinberg equilibrium diagnostics   总被引:3,自引:0,他引:3  
We propose two diagnostics for the statistical assessment of Hardy-Weinberg equilibrium. One diagnostic is the posterior probability of the complement of the smallest highest posterior density credible region that includes points in the parameter space consistent with the hypothesis of equilibrium. The null hypothesis of equilibrium is to be rejected if this probability is less than a pre-selected critical level. The second diagnostic is the proportion of the parameter space occupied by the highest posterior density credible region associated with the critical level. These Bayesian diagnostics can be interpreted as analogues of the classical types I and II error probabilities. They are broadly applicable: they can be computed for any hypothesis test, using samples of any size generated according to any distribution.  相似文献   

20.
In many applications of generalized linear mixed models to multilevel data, it is of interest to test whether a random effects variance component is zero. It is well known that the usual asymptotic chi-square distribution of the likelihood ratio and score statistics under the null does not necessarily hold. In this note we propose a permutation test, based on randomly permuting the indices associated with a given level of the model, that has the correct Type I error rate under the null. Results from a simulation study suggest that it is more powerful than tests based on mixtures of chi-square distributions. The proposed test is illustrated using data on the familial aggregation of sleep disturbance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号