首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 35 毫秒
1.
McGuire G  Prentice MJ  Wright F 《Biometrics》1999,55(4):1064-1070
The genetic distance between two DNA sequences may be measured by the average number of nucleotide substitutions per position that has occurred since the two sequences diverged from a common ancestor. Estimates of this quantity can be derived from Markov models for the substitution process, while the variances are estimated using the delta method and confidence intervals calculated assuming normality. However, when the sampling distribution of the estimator deviates from normality, such intervals will not be accurate. For simple one-parameter models of nucleotide substitution, we propose a transformation of normal confidence intervals, which yields an almost exact approximation to the true confidence intervals of the distance estimators. To calculate confidence intervals for more complicated models, we propose the saddlepoint approximation. A simulation study shows that the saddlepoint-derived confidence intervals are a real improvement over existing methods.  相似文献   

2.
When the number of nucleotides examined is relatively small, the estimators of nucleotide substitutions between DNA sequences often introduce systematic error even if the data used fit the mathematical model underlying the estimation formula. The systematic error of this kind is especially large for models that allow variation in substitution rate among different sites. In the present paper we present a number of formulas that produce virtually bias-free estimates of evolutionary distances for these models. Correspondence to: M. Nei  相似文献   

3.
Two methods are commonly employed for evaluating the extent of the uncertainty of evolutionary distances between sequences: either some estimator of the variance of the distance estimator, or the bootstrap method. However, both approaches can be misleading, particularly when the evolutionary distance is small. We propose using another statistical method which does not have the same defect: interval estimation. We show how confidence intervals may be constructed for the Jukes and Cantor (1969) and Kimura two-parameter (1980) estimators. We compare the exact confidence intervals thus obtained with the approximate intervals derived by the two previous methods, using artificial and biological data. The results show that the usual methods clearly underestimate the variability when the substitution rate is low and when sequences are short. Moreover, our analysis suggests that similar results may be expected for other evolutionary distance estimators.   相似文献   

4.
We consider models of nucleotidic substitution processes where the rate of substitution at a given site depends on the state of the neighbours of the site. We first estimate the time elapsed between an ancestral sequence at stationarity and a present sequence. Second, assuming that two sequences are issued from a common ancestral sequence at stationarity, we estimate the time since divergence. In the simplest non-trivial case of a Jukes-Cantor model with CpG influence, we provide and justify mathematically consistent estimators in these two settings. We also provide asymptotic confidence intervals, valid for nucleotidic sequences of finite length, and we compute explicit formulas for the estimators and for their confidence intervals. In the general case of an RN model with YpR influence, we extend these results under a proviso, namely that the equation defining the estimator has a unique solution.  相似文献   

5.
Using real sequence data, we evaluate the adequacy of assumptions made in evolutionary models of nucleotide substitution and the effects that these assumptions have on estimation of evolutionary trees. Two aspects of the assumptions are evaluated. The first concerns the pattern of nucleotide substitution, including equilibrium base frequencies and the transition/transversion-rate ratio. The second concerns the variation of substitution rates over sites. The maximum-likelihood estimate of tree topology appears quite robust to both these aspects of the assumptions of the models, but evaluation of the reliability of the estimated tree by using simpler, less realistic models can be misleading. Branch lengths are underestimated when simpler models of substitution are used, but the underestimation caused by ignoring rate variation over nucleotide sites is much more serious. The goodness of fit of a model is reduced by ignoring spatial rate variation, but unrealistic assumptions about the pattern of nucleotide substitution can lead to an extraordinary reduction in the likelihood. It seems that evolutionary biologists can obtain accurate estimates of certain evolutionary parameters even with an incorrect phylogeny, while systematists cannot get the right tree with confidence even when a realistic, and more complex, model of evolution is assumed.   相似文献   

6.
When the sample size is not large or when the underlying disease is rare, to assure collection of an appropriate number of cases and to control the relative error of estimation, one may employ inverse sampling, in which one continues sampling subjects until one obtains exactly the desired number of cases. This paper focuses discussion on interval estimation of the simple difference between two proportions under independent inverse sampling. This paper develops three asymptotic interval estimators on the basis of the maximum likelihood estimator (MLE), the uniformly minimum variance unbiased estimator (UMVUE), and the asymptotic likelihood ratio test (ALRT). To compare the performance of these three estimators, this paper calculates the coverage probability and the expected length of the resulting confidence intervals on the basis of the exact distribution. This paper finds that when the underlying proportions of cases in both two comparison populations are small or moderate (≤0.20), all three asymptotic interval estimators developed here perform reasonably well even for the pre-determined number of cases as small as 5. When the pre-determined number of cases is moderate or large (≥50), all three estimators are essentially equivalent in all the situations considered here. Because application of the two interval estimators derived from the MLE and the UMVUE does not involve any numerical iterative procedure needed in the ALRT, for simplicity we may use these two estimators without losing efficiency.  相似文献   

7.
In the capture‐recapture problem for two independent samples, the traditional estimator, calculated as the product of the two sample sizes divided by the number of sampled subjects appearing commonly in both samples, is well known to be a biased estimator of the population size and have no finite variance under direct or binomial sampling. To alleviate these theoretical limitations, the inverse sampling, in which we continue sampling subjects in the second sample until we obtain a desired number of marked subjects who appeared in the first sample, has been proposed elsewhere. In this paper, we consider five interval estimators of the population size, including the most commonly‐used interval estimator using Wald's statistic, the interval estimator using the logarithmic transformation, the interval estimator derived from a quadratic equation developed here, the interval estimator using the χ2‐approximation, and the interval estimator based on the exact negative binomial distribution. To evaluate and compare the finite sample performance of these estimators, we employ Monte Carlo simulation to calculate the coverage probability and the standardized average length of the resulting confidence intervals in a variety of situations. To study the location of these interval estimators, we calculate the non‐coverage probability in the two tails of the confidence intervals. Finally, we briefly discuss the optimal sample size determination for a given precision to minimize the expected total cost. (© 2004 WILEY‐VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

8.
Summary A formal mathematical analysis of Kimura's (1981) six-parameter model of nucleotide substitution for the case of unequal substitution rates among different pairs of nucleotides is conducted, and new formulae for estimating the number of nucleotide substitutions and its standard error are obtained. By using computer simulation, the validities and utilities of Jukes and Cantor's (1969) one-parameter formula, Takahata and Kimura's (1981) four-parameter formula, and our sixparameter formula for estimating the number of nucleotide substitutions are examined under three different schemes of nucleotide substitution. It is shown that the one-parameter and four-parameter formulae often give underestimates when the number of nucleotide substitutions is large, whereas the six-parameter formula generally gives a good estimate for all the three substitution schemes examined. However, when the number of nucleotide substitutions is large, the six-parameter and four-parameter formulae are often inapplicable unless the number of nucleotides compared is extremely large. It is also shown that as long as the mean number of nucleotide substitutions is smaller than one per nucleotide site the three formulae give more or less the same estimate regardless of the substitution scheme used.On leave of absence from the Department of Biology, Faculty of Science, Kyushu University 33, Fukuoka 812, Japan  相似文献   

9.
Clegg LX  Gail MH  Feuer EJ 《Biometrics》2002,58(3):684-688
We propose a new Poisson method to estimate the variance for prevalence estimates obtained by the counting method described by Gail et al. (1999, Biometrics 55, 1137-1144) and to construct a confidence interval for the prevalence. We evaluate both the Poisson procedure and the procedure based on the bootstrap proposed by Gail et al. in simulated samples generated by resampling real data. These studies show that both variance estimators usually perform well and yield coverages of confidence intervals at nominal levels. When the number of disease survivors is very small, however, confidence intervals based on the Poisson method have supranominal coverage, whereas those based on the procedure of Gail et al. tend to have below-nominal coverage. For these reasons, we recommend the Poisson method, which also reduces the computational burden considerably.  相似文献   

10.
J O'Quigley 《Biometrics》1992,48(3):853-862
The problem of point and interval estimation following a Phase I trial, carried out according to the scheme outlined by O'Quigley, Pepe, and Fisher (1990, Biometrics 46, 33-48), is investigated. A reparametrization of the model suggested in this earlier work can be seen to be advantageous in some circumstances. Maximum likelihood estimators, Bayesian estimators, and one-step estimators are considered. The continual reassessment method imposes restrictions on the sample space such that it is not possible for confidence intervals to achieve exact coverage properties, however large a sample is taken. Nonetheless, our simulations, based on a small finite sample of 20, not atypical in studies of this type, indicate that the calculated intervals are useful in most practical cases and achieve coverage very close to nominal levels in a very wide range of situations. The relative merits of the different estimators and their associated confidence intervals, viewed from a frequentist perspective, are discussed.  相似文献   

11.
This paper discusses interval estimation of the simple difference (SD) between the proportions of the primary infection and the secondary infection, given the primary infection, by developing three asymptotic interval estimators using Wald's test statistic, the likelihood‐ratio test, and the basic principle of Fieller's theorem. This paper further evaluates and compares the performance of these interval estimators with respect to the coverage probability and the expected length of the resulting confidence intervals. This paper finds that the asymptotic confidence interval using the likelihood ratio test consistently performs well in all situations considered here. When the underlying SD is within 0.10 and the total number of subjects is not large (say, 50), this paper further finds that the interval estimators using Fieller's theorem would be preferable to the estimator using the Wald's test statistic if the primary infection probability were moderate (say, 0.30), but the latter is preferable to the former if this probability were large (say, 0.80). When the total number of subjects is large (say, ≥200), all the three interval estimators perform well in almost all situations considered in this paper. In these cases, for simplicity, we may apply either of the two interval estimators using Wald's test statistic or Fieller's theorem without losing much accuracy and efficiency as compared with the interval estimator using the asymptotic likelihood ratio test.  相似文献   

12.
Although phylogenetic inference of protein-coding sequences continues to dominate the literature, few analyses incorporate evolutionary models that consider the genetic code. This problem is exacerbated by the exclusion of codon-based models from commonly employed model selection techniques, presumably due to the computational cost associated with codon models. We investigated an efficient alternative to standard nucleotide substitution models, in which codon position (CP) is incorporated into the model. We determined the most appropriate model for alignments of 177 RNA virus genes and 106 yeast genes, using 11 substitution models including one codon model and four CP models. The majority of analyzed gene alignments are best described by CP substitution models, rather than by standard nucleotide models, and without the computational cost of full codon models. These results have significant implications for phylogenetic inference of coding sequences as they make it clear that substitution models incorporating CPs not only are a computationally realistic alternative to standard models but may also frequently be statistically superior.  相似文献   

13.
The current variance estimators for Jukes and Cantor's one-parameter model and Kimura's two-parameter model tend to underestimate the true variances when the true proportion of differences between the two sequences under study is not small. In this paper, we developed improved variance estimators, using a higher-order Taylor expansion and empirical methods. The new estimators outperform the conventional estimators and provide accurate estimates of the true variances.  相似文献   

14.
Cross-validation based point estimates of prediction accuracy are frequently reported in microarray class prediction problems. However these point estimates can be highly variable, particularly for small sample numbers, and it would be useful to provide confidence intervals of prediction accuracy. We performed an extensive study of existing confidence interval methods and compared their performance in terms of empirical coverage and width. We developed a bootstrap case cross-validation (BCCV) resampling scheme and defined several confidence interval methods using BCCV with and without bias-correction. The widely used approach of basing confidence intervals on an independent binomial assumption of the leave-one-out cross-validation errors results in serious under-coverage of the true prediction error. Two split-sample based methods previously proposed in the literature tend to give overly conservative confidence intervals. Using BCCV resampling, the percentile confidence interval method was also found to be overly conservative without bias-correction, while the bias corrected accelerated (BCa) interval method of Efron returns substantially anti-conservative confidence intervals. We propose a simple bias reduction on the BCCV percentile interval. The method provides mildly conservative inference under all circumstances studied and outperforms the other methods in microarray applications with small to moderate sample sizes.  相似文献   

15.
Models of amino acid substitution present challenges beyond those often faced with the analysis of DNA sequences. The alignments of amino acid sequences are often small, whereas the number of parameters to be estimated is potentially large when compared with the number of free parameters for nucleotide substitution models. Most approaches to the analysis of amino acid alignments have focused on the use of fixed amino acid models in which all of the potentially free parameters are fixed to values estimated from a large number of sequences. Often, these fixed amino acid models are specific to a gene or taxonomic group (e.g. the Mtmam model, which has parameters that are specific to mammalian mitochondrial gene sequences). Although the fixed amino acid models succeed in reducing the number of free parameters to be estimated--indeed, they reduce the number of free parameters from approximately 200 to 0--it is possible that none of the currently available fixed amino acid models is appropriate for a specific alignment. Here, we present four approaches to the analysis of amino acid sequences. First, we explore the use of a general time reversible model of amino acid substitution using a Dirichlet prior probability distribution on the 190 exchangeability parameters. Second, we then explore the behaviour of prior probability distributions that are'centred' on the rates specified by the fixed amino acid model. Third, we consider a mixture of fixed amino acid models. Finally, we consider constraints on the exchangeability parameters as partitions,similar to how nucleotide substitution models are specified, and place a Dirichlet process prior model on all the possible partitioning schemes.  相似文献   

16.
In recent years, likelihood ratio tests (LRTs) based on DNA and protein sequence data have been proposed for testing various evolutionary hypotheses. Because conducting an LRT requires an evolutionary model of nucleotide or amino acid substitution, which is almost always unknown, it becomes important to investigate the robustness of LRTs to violations of assumptions of these evolutionary models. Computer simulation was used to examine performance of LRTs of the molecular clock, transition/transversion bias, and among-site rate variation under different substitution models. The results showed that when correct models are used, LRTs perform quite well even when the DNA sequences are as short as 300 nt. However, LRTs were found to be biased under incorrect models. The extent of bias varies considerably, depending on the hypotheses tested, the substitution models assumed, and the lengths of the sequences used, among other things. A preliminary simulation study also suggests that LRTs based on parametric bootstrapping may be more sensitive to substitution models than are standard LRTs. When an assumed substitution model is grossly wrong and a more realistic model is available, LRTs can often reject the wrong model; thus, the performance of LRTs may be improved by using a more appropriate model. On the other hand, many factors of molecular evolution have not been considered in any substitution models so far built, and the possibility of an influence of this negligence on LRTs is often overlooked. The dependence of LRTs on substitution models calls for caution in interpreting test results and highlights the importance of clarifying the substitution patterns of genes and proteins and building more realistic models.  相似文献   

17.
This paper discusses interval estimation for the ratio of the mean failure times on the basis of paired exponential observations. This paper considers five interval estimators: the confidence interval using an idea similar to Fieller's theorem (CIFT), the confidence interval using an exact parametric test (CIEP), the confidence interval using the marginal likelihood ratio test (CILR), the confidence interval assuming no matching effect (CINM), and the confidence interval using a locally most powerful test (CIMP). To evaluate and compare the performance of these five interval estimators, this paper applies Monte Carlo simulation. This paper notes that with respect to the coverage probability, use of the CIFT, CILR, or CIMP, although which are all derived based on large sample theory, can perform well even when the number of pairs n is as small as 10. As compared with use of the CILR, this paper finds that use of the CIEP with equal tail probabilities is likely to lose efficiency. However, this loss can be reduced by using the optimal tail probabilities to minimize the average length when n is small (<20). This paper further notes that use of the CIMP is preferable to the CIEP in a variety of situations considered here. In fact, the average length of the CIMP with use of the optimal tail probabilities can even be shorter than that of the CILR. When the intraclass correlation between failure times within pairs is 0 (i.e., the failure times within the same pair are independent), the CINM, which is derived for two independent samples, is certainly the best one among the five interval estimators considered here. When there is an intraclass correlation but which is small (<0.10), the CIFT is recommended for obtaining a relatively short interval estimate without sacrificing the loss of the coverage probability. When the intraclass correlation is moderate or large, either the CILR or the CIMP with the optimal tail probabilities is preferable to the others. This paper also notes that if the intraclass correlation between failure times within pairs is large, use of the CINM can be misleading, especially when the number of pairs is large.  相似文献   

18.
It is well known that Cornfield 's confidence interval of the odds ratio with the continuity correction can mimic the performance of the exact method. Furthermore, because the calculation procedure of using the former is much simpler than that of using the latter, Cornfield 's confidence interval with the continuity correction is highly recommended by many publications. However, all these papers that draw this conclusion are on the basis of examining the coverage probability exclusively. The efficiency of the resulting confidence intervals is completely ignored. This paper calculates and compares the coverage probability and the average length for Woolf s logit interval estimator, Gart 's logit interval estimator of adding 0.50, Cornfield 's interval estimator with the continuity correction, and Cornfield 's interval estimator without the continuity correction in a variety of situations. This paper notes that Cornfield 's interval estimator with the continuity correction is too conservative, while Cornfield 's method without the continuity correction can improve efficiency without sacrificing the accuracy of the coverage probability. This paper further notes that when the sample size is small (say, 20 or 30 per group) and the probability of exposure in the control group is small (say, 0.10) or large (say, 0.90), using Cornfield 's method without the continuity correction is likely preferable to all the other estimators considered here. When the sample size is large (say, 100 per group) or when the probability of exposure in the control group is moderate (say, 0.50), Gart 's logit interval estimator is probably the best.  相似文献   

19.
Tests of applicability of several substitution models for DNA sequence data   总被引:8,自引:3,他引:5  
Using linear invariants for various models of nucleotide substitution, we developed test statistics for examining the applicability of a specific model to a given dataset in phylogenetic inference. The models examined are those developed by Jukes and Cantor (1969), Kimura (1980), Tajima and Nei (1984), Hasegawa et al. (1985), Tamura (1992), Tamura and Nei (1993), and a new model called the eight-parameter model. The first six models are special cases of the last model. The test statistics developed are independent of evolutionary time and phylogeny, although the variances of the statistics contain phylogenetic information. Therefore, these statistics can be used before a phylogenetic tree is estimated. Our objective is to find the simplest model that is applicable to a given dataset, keeping in mind that a simple model usually gives an estimate of evolutionary distance (number of nucleotide substitutions per site) with a smaller variance than a complicated model when the simple model is correct. We have also developed a statistical test of the homogeneity of nucleotide frequencies of a sample of several sequences that takes into account possible phylogenetic correlations. This test is used to examine the stationarity in time of the base frequencies in the sample. For Hasegawa et al.'s and the eight-parameter models, analytical formulas for estimating evolutionary distances are presented. Application of the above tests to several sets of real data has shown that the assumption of stationarity of base composition is usually acceptable when the sequences studied are closely related but otherwise it is rejected. Similarly, the simple models of nucleotide substitution are almost always rejected when actual genes are distantly related and/or the total number of nucleotides examined is large.   相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号