共查询到20条相似文献,搜索用时 35 毫秒
1.
The genetic distance between two DNA sequences may be measured by the average number of nucleotide substitutions per position that has occurred since the two sequences diverged from a common ancestor. Estimates of this quantity can be derived from Markov models for the substitution process, while the variances are estimated using the delta method and confidence intervals calculated assuming normality. However, when the sampling distribution of the estimator deviates from normality, such intervals will not be accurate. For simple one-parameter models of nucleotide substitution, we propose a transformation of normal confidence intervals, which yields an almost exact approximation to the true confidence intervals of the distance estimators. To calculate confidence intervals for more complicated models, we propose the saddlepoint approximation. A simulation study shows that the saddlepoint-derived confidence intervals are a real improvement over existing methods. 相似文献
2.
When the number of nucleotides examined is relatively small, the estimators of nucleotide substitutions between DNA sequences often introduce systematic error even if the data used fit the mathematical model underlying the estimation formula. The systematic error of this kind is especially large for models that allow variation in substitution rate among different sites. In the present paper we present a number of formulas that produce virtually bias-free estimates of evolutionary distances for these models.
Correspondence to: M. Nei 相似文献
3.
Confidence intervals of evolutionary distances between sequences and comparison with usual approaches including the bootstrap method 总被引:1,自引:1,他引:0
Two methods are commonly employed for evaluating the extent of the
uncertainty of evolutionary distances between sequences: either some
estimator of the variance of the distance estimator, or the bootstrap
method. However, both approaches can be misleading, particularly when the
evolutionary distance is small. We propose using another statistical method
which does not have the same defect: interval estimation. We show how
confidence intervals may be constructed for the Jukes and Cantor (1969) and
Kimura two-parameter (1980) estimators. We compare the exact confidence
intervals thus obtained with the approximate intervals derived by the two
previous methods, using artificial and biological data. The results show
that the usual methods clearly underestimate the variability when the
substitution rate is low and when sequences are short. Moreover, our
analysis suggests that similar results may be expected for other
evolutionary distance estimators.
相似文献
4.
Mikael Falconnet 《Mathematical biosciences》2010,224(2):101-108
We consider models of nucleotidic substitution processes where the rate of substitution at a given site depends on the state of the neighbours of the site. We first estimate the time elapsed between an ancestral sequence at stationarity and a present sequence. Second, assuming that two sequences are issued from a common ancestral sequence at stationarity, we estimate the time since divergence. In the simplest non-trivial case of a Jukes-Cantor model with CpG influence, we provide and justify mathematically consistent estimators in these two settings. We also provide asymptotic confidence intervals, valid for nucleotidic sequences of finite length, and we compute explicit formulas for the estimators and for their confidence intervals. In the general case of an RN model with YpR influence, we extend these results under a proviso, namely that the equation defining the estimator has a unique solution. 相似文献
5.
Comparison of models for nucleotide substitution used in maximum- likelihood phylogenetic estimation 总被引:31,自引:15,他引:16
Using real sequence data, we evaluate the adequacy of assumptions made in
evolutionary models of nucleotide substitution and the effects that these
assumptions have on estimation of evolutionary trees. Two aspects of the
assumptions are evaluated. The first concerns the pattern of nucleotide
substitution, including equilibrium base frequencies and the
transition/transversion-rate ratio. The second concerns the variation of
substitution rates over sites. The maximum-likelihood estimate of tree
topology appears quite robust to both these aspects of the assumptions of
the models, but evaluation of the reliability of the estimated tree by
using simpler, less realistic models can be misleading. Branch lengths are
underestimated when simpler models of substitution are used, but the
underestimation caused by ignoring rate variation over nucleotide sites is
much more serious. The goodness of fit of a model is reduced by ignoring
spatial rate variation, but unrealistic assumptions about the pattern of
nucleotide substitution can lead to an extraordinary reduction in the
likelihood. It seems that evolutionary biologists can obtain accurate
estimates of certain evolutionary parameters even with an incorrect
phylogeny, while systematists cannot get the right tree with confidence
even when a realistic, and more complex, model of evolution is assumed.
相似文献
6.
Kung-Jong Lui 《Biometrical journal. Biometrische Zeitschrift》1999,41(1):83-92
When the sample size is not large or when the underlying disease is rare, to assure collection of an appropriate number of cases and to control the relative error of estimation, one may employ inverse sampling, in which one continues sampling subjects until one obtains exactly the desired number of cases. This paper focuses discussion on interval estimation of the simple difference between two proportions under independent inverse sampling. This paper develops three asymptotic interval estimators on the basis of the maximum likelihood estimator (MLE), the uniformly minimum variance unbiased estimator (UMVUE), and the asymptotic likelihood ratio test (ALRT). To compare the performance of these three estimators, this paper calculates the coverage probability and the expected length of the resulting confidence intervals on the basis of the exact distribution. This paper finds that when the underlying proportions of cases in both two comparison populations are small or moderate (≤0.20), all three asymptotic interval estimators developed here perform reasonably well even for the pre-determined number of cases as small as 5. When the pre-determined number of cases is moderate or large (≥50), all three estimators are essentially equivalent in all the situations considered here. Because application of the two interval estimators derived from the MLE and the UMVUE does not involve any numerical iterative procedure needed in the ALRT, for simplicity we may use these two estimators without losing efficiency. 相似文献
7.
Kung‐Jong Lui 《Biometrical journal. Biometrische Zeitschrift》2004,46(4):474-480
In the capture‐recapture problem for two independent samples, the traditional estimator, calculated as the product of the two sample sizes divided by the number of sampled subjects appearing commonly in both samples, is well known to be a biased estimator of the population size and have no finite variance under direct or binomial sampling. To alleviate these theoretical limitations, the inverse sampling, in which we continue sampling subjects in the second sample until we obtain a desired number of marked subjects who appeared in the first sample, has been proposed elsewhere. In this paper, we consider five interval estimators of the population size, including the most commonly‐used interval estimator using Wald's statistic, the interval estimator using the logarithmic transformation, the interval estimator derived from a quadratic equation developed here, the interval estimator using the χ2‐approximation, and the interval estimator based on the exact negative binomial distribution. To evaluate and compare the finite sample performance of these estimators, we employ Monte Carlo simulation to calculate the coverage probability and the standardized average length of the resulting confidence intervals in a variety of situations. To study the location of these interval estimators, we calculate the non‐coverage probability in the two tails of the confidence intervals. Finally, we briefly discuss the optimal sample size determination for a given precision to minimize the expected total cost. (© 2004 WILEY‐VCH Verlag GmbH & Co. KGaA, Weinheim) 相似文献
8.
Estimation of average number of nucleotide substitutions when the rate of substitution varies with nucleotide 总被引:17,自引:0,他引:17
Summary A formal mathematical analysis of Kimura's (1981) six-parameter model of nucleotide substitution for the case of unequal substitution rates among different pairs of nucleotides is conducted, and new formulae for estimating the number of nucleotide substitutions and its standard error are obtained. By using computer simulation, the validities and utilities of Jukes and Cantor's (1969) one-parameter formula, Takahata and Kimura's (1981) four-parameter formula, and our sixparameter formula for estimating the number of nucleotide substitutions are examined under three different schemes of nucleotide substitution. It is shown that the one-parameter and four-parameter formulae often give underestimates when the number of nucleotide substitutions is large, whereas the six-parameter formula generally gives a good estimate for all the three substitution schemes examined. However, when the number of nucleotide substitutions is large, the six-parameter and four-parameter formulae are often inapplicable unless the number of nucleotides compared is extremely large. It is also shown that as long as the mean number of nucleotide substitutions is smaller than one per nucleotide site the three formulae give more or less the same estimate regardless of the substitution scheme used.On leave of absence from the Department of Biology, Faculty of Science, Kyushu University 33, Fukuoka 812, Japan 相似文献
9.
We propose a new Poisson method to estimate the variance for prevalence estimates obtained by the counting method described by Gail et al. (1999, Biometrics 55, 1137-1144) and to construct a confidence interval for the prevalence. We evaluate both the Poisson procedure and the procedure based on the bootstrap proposed by Gail et al. in simulated samples generated by resampling real data. These studies show that both variance estimators usually perform well and yield coverages of confidence intervals at nominal levels. When the number of disease survivors is very small, however, confidence intervals based on the Poisson method have supranominal coverage, whereas those based on the procedure of Gail et al. tend to have below-nominal coverage. For these reasons, we recommend the Poisson method, which also reduces the computational burden considerably. 相似文献
10.
J O'Quigley 《Biometrics》1992,48(3):853-862
The problem of point and interval estimation following a Phase I trial, carried out according to the scheme outlined by O'Quigley, Pepe, and Fisher (1990, Biometrics 46, 33-48), is investigated. A reparametrization of the model suggested in this earlier work can be seen to be advantageous in some circumstances. Maximum likelihood estimators, Bayesian estimators, and one-step estimators are considered. The continual reassessment method imposes restrictions on the sample space such that it is not possible for confidence intervals to achieve exact coverage properties, however large a sample is taken. Nonetheless, our simulations, based on a small finite sample of 20, not atypical in studies of this type, indicate that the calculated intervals are useful in most practical cases and achieve coverage very close to nominal levels in a very wide range of situations. The relative merits of the different estimators and their associated confidence intervals, viewed from a frequentist perspective, are discussed. 相似文献
11.
Kung‐Jong Lui 《Biometrical journal. Biometrische Zeitschrift》2000,42(1):59-69
This paper discusses interval estimation of the simple difference (SD) between the proportions of the primary infection and the secondary infection, given the primary infection, by developing three asymptotic interval estimators using Wald's test statistic, the likelihood‐ratio test, and the basic principle of Fieller's theorem. This paper further evaluates and compares the performance of these interval estimators with respect to the coverage probability and the expected length of the resulting confidence intervals. This paper finds that the asymptotic confidence interval using the likelihood ratio test consistently performs well in all situations considered here. When the underlying SD is within 0.10 and the total number of subjects is not large (say, 50), this paper further finds that the interval estimators using Fieller's theorem would be preferable to the estimator using the Wald's test statistic if the primary infection probability were moderate (say, 0.30), but the latter is preferable to the former if this probability were large (say, 0.80). When the total number of subjects is large (say, ≥200), all the three interval estimators perform well in almost all situations considered in this paper. In these cases, for simplicity, we may apply either of the two interval estimators using Wald's test statistic or Fieller's theorem without losing much accuracy and efficiency as compared with the interval estimator using the asymptotic likelihood ratio test. 相似文献
12.
Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences 总被引:13,自引:0,他引:13
Although phylogenetic inference of protein-coding sequences continues to dominate the literature, few analyses incorporate evolutionary models that consider the genetic code. This problem is exacerbated by the exclusion of codon-based models from commonly employed model selection techniques, presumably due to the computational cost associated with codon models. We investigated an efficient alternative to standard nucleotide substitution models, in which codon position (CP) is incorporated into the model. We determined the most appropriate model for alignments of 177 RNA virus genes and 106 yeast genes, using 11 substitution models including one codon model and four CP models. The majority of analyzed gene alignments are best described by CP substitution models, rather than by standard nucleotide models, and without the computational cost of full codon models. These results have significant implications for phylogenetic inference of coding sequences as they make it clear that substitution models incorporating CPs not only are a computationally realistic alternative to standard models but may also frequently be statistically superior. 相似文献
13.
The current variance estimators for Jukes and Cantor's one-parameter model and Kimura's two-parameter model tend to underestimate the true variances when the true proportion of differences between the two sequences under study is not small. In this paper, we developed improved variance estimators, using a higher-order Taylor expansion and empirical methods. The new estimators outperform the conventional estimators and provide accurate estimates of the true variances. 相似文献
14.
Jiang W Varma S Simon R 《Statistical applications in genetics and molecular biology》2008,7(1):Article8
Cross-validation based point estimates of prediction accuracy are frequently reported in microarray class prediction problems. However these point estimates can be highly variable, particularly for small sample numbers, and it would be useful to provide confidence intervals of prediction accuracy. We performed an extensive study of existing confidence interval methods and compared their performance in terms of empirical coverage and width. We developed a bootstrap case cross-validation (BCCV) resampling scheme and defined several confidence interval methods using BCCV with and without bias-correction. The widely used approach of basing confidence intervals on an independent binomial assumption of the leave-one-out cross-validation errors results in serious under-coverage of the true prediction error. Two split-sample based methods previously proposed in the literature tend to give overly conservative confidence intervals. Using BCCV resampling, the percentile confidence interval method was also found to be overly conservative without bias-correction, while the bias corrected accelerated (BCa) interval method of Efron returns substantially anti-conservative confidence intervals. We propose a simple bias reduction on the BCCV percentile interval. The method provides mildly conservative inference under all circumstances studied and outperforms the other methods in microarray applications with small to moderate sample sizes. 相似文献
15.
Huelsenbeck JP Joyce P Lakner C Ronquist F 《Philosophical transactions of the Royal Society of London. Series B, Biological sciences》2008,363(1512):3941-3953
Models of amino acid substitution present challenges beyond those often faced with the analysis of DNA sequences. The alignments of amino acid sequences are often small, whereas the number of parameters to be estimated is potentially large when compared with the number of free parameters for nucleotide substitution models. Most approaches to the analysis of amino acid alignments have focused on the use of fixed amino acid models in which all of the potentially free parameters are fixed to values estimated from a large number of sequences. Often, these fixed amino acid models are specific to a gene or taxonomic group (e.g. the Mtmam model, which has parameters that are specific to mammalian mitochondrial gene sequences). Although the fixed amino acid models succeed in reducing the number of free parameters to be estimated--indeed, they reduce the number of free parameters from approximately 200 to 0--it is possible that none of the currently available fixed amino acid models is appropriate for a specific alignment. Here, we present four approaches to the analysis of amino acid sequences. First, we explore the use of a general time reversible model of amino acid substitution using a Dirichlet prior probability distribution on the 190 exchangeability parameters. Second, we then explore the behaviour of prior probability distributions that are'centred' on the rates specified by the fixed amino acid model. Third, we consider a mixture of fixed amino acid models. Finally, we consider constraints on the exchangeability parameters as partitions,similar to how nucleotide substitution models are specified, and place a Dirichlet process prior model on all the possible partitioning schemes. 相似文献
16.
J Zhang 《Molecular biology and evolution》1999,16(6):868-875
In recent years, likelihood ratio tests (LRTs) based on DNA and protein sequence data have been proposed for testing various evolutionary hypotheses. Because conducting an LRT requires an evolutionary model of nucleotide or amino acid substitution, which is almost always unknown, it becomes important to investigate the robustness of LRTs to violations of assumptions of these evolutionary models. Computer simulation was used to examine performance of LRTs of the molecular clock, transition/transversion bias, and among-site rate variation under different substitution models. The results showed that when correct models are used, LRTs perform quite well even when the DNA sequences are as short as 300 nt. However, LRTs were found to be biased under incorrect models. The extent of bias varies considerably, depending on the hypotheses tested, the substitution models assumed, and the lengths of the sequences used, among other things. A preliminary simulation study also suggests that LRTs based on parametric bootstrapping may be more sensitive to substitution models than are standard LRTs. When an assumed substitution model is grossly wrong and a more realistic model is available, LRTs can often reject the wrong model; thus, the performance of LRTs may be improved by using a more appropriate model. On the other hand, many factors of molecular evolution have not been considered in any substitution models so far built, and the possibility of an influence of this negligence on LRTs is often overlooked. The dependence of LRTs on substitution models calls for caution in interpreting test results and highlights the importance of clarifying the substitution patterns of genes and proteins and building more realistic models. 相似文献
17.
Kung-Jong Lui 《Biometrical journal. Biometrische Zeitschrift》1997,39(5):545-558
This paper discusses interval estimation for the ratio of the mean failure times on the basis of paired exponential observations. This paper considers five interval estimators: the confidence interval using an idea similar to Fieller's theorem (CIFT), the confidence interval using an exact parametric test (CIEP), the confidence interval using the marginal likelihood ratio test (CILR), the confidence interval assuming no matching effect (CINM), and the confidence interval using a locally most powerful test (CIMP). To evaluate and compare the performance of these five interval estimators, this paper applies Monte Carlo simulation. This paper notes that with respect to the coverage probability, use of the CIFT, CILR, or CIMP, although which are all derived based on large sample theory, can perform well even when the number of pairs n is as small as 10. As compared with use of the CILR, this paper finds that use of the CIEP with equal tail probabilities is likely to lose efficiency. However, this loss can be reduced by using the optimal tail probabilities to minimize the average length when n is small (<20). This paper further notes that use of the CIMP is preferable to the CIEP in a variety of situations considered here. In fact, the average length of the CIMP with use of the optimal tail probabilities can even be shorter than that of the CILR. When the intraclass correlation between failure times within pairs is 0 (i.e., the failure times within the same pair are independent), the CINM, which is derived for two independent samples, is certainly the best one among the five interval estimators considered here. When there is an intraclass correlation but which is small (<0.10), the CIFT is recommended for obtaining a relatively short interval estimate without sacrificing the loss of the coverage probability. When the intraclass correlation is moderate or large, either the CILR or the CIMP with the optimal tail probabilities is preferable to the others. This paper also notes that if the intraclass correlation between failure times within pairs is large, use of the CINM can be misleading, especially when the number of pairs is large. 相似文献
18.
It is well known that Cornfield 's confidence interval of the odds ratio with the continuity correction can mimic the performance of the exact method. Furthermore, because the calculation procedure of using the former is much simpler than that of using the latter, Cornfield 's confidence interval with the continuity correction is highly recommended by many publications. However, all these papers that draw this conclusion are on the basis of examining the coverage probability exclusively. The efficiency of the resulting confidence intervals is completely ignored. This paper calculates and compares the coverage probability and the average length for Woolf s logit interval estimator, Gart 's logit interval estimator of adding 0.50, Cornfield 's interval estimator with the continuity correction, and Cornfield 's interval estimator without the continuity correction in a variety of situations. This paper notes that Cornfield 's interval estimator with the continuity correction is too conservative, while Cornfield 's method without the continuity correction can improve efficiency without sacrificing the accuracy of the coverage probability. This paper further notes that when the sample size is small (say, 20 or 30 per group) and the probability of exposure in the control group is small (say, 0.10) or large (say, 0.90), using Cornfield 's method without the continuity correction is likely preferable to all the other estimators considered here. When the sample size is large (say, 100 per group) or when the probability of exposure in the control group is moderate (say, 0.50), Gart 's logit interval estimator is probably the best. 相似文献
19.
Using linear invariants for various models of nucleotide substitution, we
developed test statistics for examining the applicability of a specific
model to a given dataset in phylogenetic inference. The models examined are
those developed by Jukes and Cantor (1969), Kimura (1980), Tajima and Nei
(1984), Hasegawa et al. (1985), Tamura (1992), Tamura and Nei (1993), and a
new model called the eight-parameter model. The first six models are
special cases of the last model. The test statistics developed are
independent of evolutionary time and phylogeny, although the variances of
the statistics contain phylogenetic information. Therefore, these
statistics can be used before a phylogenetic tree is estimated. Our
objective is to find the simplest model that is applicable to a given
dataset, keeping in mind that a simple model usually gives an estimate of
evolutionary distance (number of nucleotide substitutions per site) with a
smaller variance than a complicated model when the simple model is correct.
We have also developed a statistical test of the homogeneity of nucleotide
frequencies of a sample of several sequences that takes into account
possible phylogenetic correlations. This test is used to examine the
stationarity in time of the base frequencies in the sample. For Hasegawa et
al.'s and the eight-parameter models, analytical formulas for estimating
evolutionary distances are presented. Application of the above tests to
several sets of real data has shown that the assumption of stationarity of
base composition is usually acceptable when the sequences studied are
closely related but otherwise it is rejected. Similarly, the simple models
of nucleotide substitution are almost always rejected when actual genes are
distantly related and/or the total number of nucleotides examined is large.
相似文献