首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Recent developments in marginal likelihood estimation for model selection in the field of Bayesian phylogenetics and molecular evolution have emphasized the poor performance of the harmonic mean estimator (HME). Although these studies have shown the merits of new approaches applied to standard normally distributed examples and small real-world data sets, not much is currently known concerning the performance and computational issues of these methods when fitting complex evolutionary and population genetic models to empirical real-world data sets. Further, these approaches have not yet seen widespread application in the field due to the lack of implementations of these computationally demanding techniques in commonly used phylogenetic packages. We here investigate the performance of some of these new marginal likelihood estimators, specifically, path sampling (PS) and stepping-stone (SS) sampling for comparing models of demographic change and relaxed molecular clocks, using synthetic data and real-world examples for which unexpected inferences were made using the HME. Given the drastically increased computational demands of PS and SS sampling, we also investigate a posterior simulation-based analogue of Akaike's information criterion (AIC) through Markov chain Monte Carlo (MCMC), a model comparison approach that shares with the HME the appealing feature of having a low computational overhead over the original MCMC analysis. We confirm that the HME systematically overestimates the marginal likelihood and fails to yield reliable model classification and show that the AICM performs better and may be a useful initial evaluation of model choice but that it is also, to a lesser degree, unreliable. We show that PS and SS sampling substantially outperform these estimators and adjust the conclusions made concerning previous analyses for the three real-world data sets that we reanalyzed. The methods used in this article are now available in BEAST, a powerful user-friendly software package to perform Bayesian evolutionary analyses.  相似文献   

2.
Coalescent-based inference of phylogenetic relationships among species takes into account gene tree incongruence due to incomplete lineage sorting, but for such methods to make sense species have to be correctly delimited. Because alternative assignments of individuals to species result in different parametric models, model selection methods can be applied to optimise model of species classification. In a Bayesian framework, Bayes factors (BF), based on marginal likelihood estimates, can be used to test a range of possible classifications for the group under study. Here, we explore BF and the Akaike Information Criterion (AIC) to discriminate between different species classifications in the flowering plant lineage Silene sect. Cryptoneurae (Caryophyllaceae). We estimated marginal likelihoods for different species classification models via the Path Sampling (PS), Stepping Stone sampling (SS), and Harmonic Mean Estimator (HME) methods implemented in BEAST. To select among alternative species classification models a posterior simulation-based analog of the AIC through Markov chain Monte Carlo analysis (AICM) was also performed. The results are compared to outcomes from the software BP&P. Our results agree with another recent study that marginal likelihood estimates from PS and SS methods are useful for comparing different species classifications, and strongly support the recognition of the newly described species S. ertekinii.  相似文献   

3.
Bayesian phylogenetic methods are generating noticeable enthusiasm in the field of molecular systematics. Many phylogenetic models are often at stake, and different approaches are used to compare them within a Bayesian framework. The Bayes factor, defined as the ratio of the marginal likelihoods of two competing models, plays a key role in Bayesian model selection. We focus on an alternative estimator of the marginal likelihood whose computation is still a challenging problem. Several computational solutions have been proposed, none of which can be considered outperforming the others simultaneously in terms of simplicity of implementation, computational burden and precision of the estimates. Practitioners and researchers, often led by available software, have privileged so far the simplicity of the harmonic mean (HM) estimator. However, it is known that the resulting estimates of the Bayesian evidence in favor of one model are biased and often inaccurate, up to having an infinite variance so that the reliability of the corresponding conclusions is doubtful. We consider possible improvements of the generalized harmonic mean (GHM) idea that recycle Markov Chain Monte Carlo (MCMC) simulations from the posterior, share the computational simplicity of the original HM estimator, but, unlike it, overcome the infinite variance issue. We show reliability and comparative performance of the improved harmonic mean estimators comparing them to approximation techniques relying on improved variants of the thermodynamic integration.  相似文献   

4.
Coalescent likelihood is the probability of observing the given population sequences under the coalescent model. Computation of coalescent likelihood under the infinite sites model is a classic problem in coalescent theory. Existing methods are based on either importance sampling or Markov chain Monte Carlo and are inexact. In this paper, we develop a simple method that can compute the exact coalescent likelihood for many data sets of moderate size, including real biological data whose likelihood was previously thought to be difficult to compute exactly. Our method works for both panmictic and subdivided populations. Simulations demonstrate that the practical range of exact coalescent likelihood computation for panmictic populations is significantly larger than what was previously believed. We investigate the application of our method in estimating mutation rates by maximum likelihood. A main application of the exact method is comparing the accuracy of approximate methods. To demonstrate the usefulness of the exact method, we evaluate the accuracy of program Genetree in computing the likelihood for subdivided populations.  相似文献   

5.
Chronograms from molecular dating are increasingly being used to infer rates of diversification and their change over time. A major limitation in such analyses is incomplete species sampling that moreover is usually nonrandom. While the widely used γ statistic with the Monte Carlo constant-rates test or the birth-death likelihood analysis with the δ AICrc test statistic are appropriate for comparing the fit of different diversification models in phylogenies with random species sampling, no objective automated method has been developed for fitting diversification models to nonrandomly sampled phylogenies. Here, we introduce a novel approach, CorSiM, which involves simulating missing splits under a constant rate birth-death model and allows the user to specify whether species sampling in the phylogeny being analyzed is random or nonrandom. The completed trees can be used in subsequent model-fitting analyses. This is fundamentally different from previous diversification rate estimation methods, which were based on null distributions derived from the incomplete trees. CorSiM is automated in an R package and can easily be applied to large data sets. We illustrate the approach in two Araceae clades, one with a random species sampling of 52% and one with a nonrandom sampling of 55%. In the latter clade, the CorSiM approach detects and quantifies an increase in diversification rate, whereas classic approaches prefer a constant rate model; in the former clade, results do not differ among methods (as indeed expected since the classic approaches are valid only for randomly sampled phylogenies). The CorSiM method greatly reduces the type I error in diversification analysis, but type II error remains a methodological problem.  相似文献   

6.
Generalized linear models play an essential role in a wide variety of statistical applications. This paper discusses an approximation of the likelihood in these models that can greatly facilitate computation. The basic idea is to replace a sum that appears in the exact log-likelihood by an expectation over the model covariates; the resulting “expected log-likelihood” can in many cases be computed significantly faster than the exact log-likelihood. In many neuroscience experiments the distribution over model covariates is controlled by the experimenter and the expected log-likelihood approximation becomes particularly useful; for example, estimators based on maximizing this expected log-likelihood (or a penalized version thereof) can often be obtained with orders of magnitude computational savings compared to the exact maximum likelihood estimators. A risk analysis establishes that these maximum EL estimators often come with little cost in accuracy (and in some cases even improved accuracy) compared to standard maximum likelihood estimates. Finally, we find that these methods can significantly decrease the computation time of marginal likelihood calculations for model selection and of Markov chain Monte Carlo methods for sampling from the posterior parameter distribution. We illustrate our results by applying these methods to a computationally-challenging dataset of neural spike trains obtained via large-scale multi-electrode recordings in the primate retina.  相似文献   

7.
Parameter inference and model selection are very important for mathematical modeling in systems biology. Bayesian statistics can be used to conduct both parameter inference and model selection. Especially, the framework named approximate Bayesian computation is often used for parameter inference and model selection in systems biology. However, Monte Carlo methods needs to be used to compute Bayesian posterior distributions. In addition, the posterior distributions of parameters are sometimes almost uniform or very similar to their prior distributions. In such cases, it is difficult to choose one specific value of parameter with high credibility as the representative value of the distribution. To overcome the problems, we introduced one of the population Monte Carlo algorithms, population annealing. Although population annealing is usually used in statistical mechanics, we showed that population annealing can be used to compute Bayesian posterior distributions in the approximate Bayesian computation framework. To deal with un-identifiability of the representative values of parameters, we proposed to run the simulations with the parameter ensemble sampled from the posterior distribution, named “posterior parameter ensemble”. We showed that population annealing is an efficient and convenient algorithm to generate posterior parameter ensemble. We also showed that the simulations with the posterior parameter ensemble can, not only reproduce the data used for parameter inference, but also capture and predict the data which was not used for parameter inference. Lastly, we introduced the marginal likelihood in the approximate Bayesian computation framework for Bayesian model selection. We showed that population annealing enables us to compute the marginal likelihood in the approximate Bayesian computation framework and conduct model selection depending on the Bayes factor.  相似文献   

8.
Maximum likelihood estimation of the model parameters for a spatial population based on data collected from a survey sample is usually straightforward when sampling and non-response are both non-informative, since the model can then usually be fitted using the available sample data, and no allowance is necessary for the fact that only a part of the population has been observed. Although for many regression models this naive strategy yields consistent estimates, this is not the case for some models, such as spatial auto-regressive models. In this paper, we show that for a broad class of such models, a maximum marginal likelihood approach that uses both sample and population data leads to more efficient estimates since it uses spatial information from sampled as well as non-sampled units. Extensive simulation experiments based on two well-known data sets are used to assess the impact of the spatial sampling design, the auto-correlation parameter and the sample size on the performance of this approach. When compared to some widely used methods that use only sample data, the results from these experiments show that the maximum marginal likelihood approach is much more precise.  相似文献   

9.
MOTIVATION: There often are many alternative models of a biochemical system. Distinguishing models and finding the most suitable ones is an important challenge in Systems Biology, as such model ranking, by experimental evidence, will help to judge the support of the working hypotheses forming each model. Bayes factors are employed as a measure of evidential preference for one model over another. Marginal likelihood is a key component of Bayes factors, however computing the marginal likelihood is a difficult problem, as it involves integration of nonlinear functions in multidimensional space. There are a number of methods available to compute the marginal likelihood approximately. A detailed investigation of such methods is required to find ones that perform appropriately for biochemical modelling. RESULTS: We assess four methods for estimation of the marginal likelihoods required for computing Bayes factors. The Prior Arithmetic Mean estimator, the Posterior Harmonic Mean estimator, the Annealed Importance Sampling and the Annealing-Melting Integration methods are investigated and compared on a typical case study in Systems Biology. This allows us to understand the stability of the analysis results and make reliable judgements in uncertain context. We investigate the variance of Bayes factor estimates, and highlight the stability of the Annealed Importance Sampling and the Annealing-Melting Integration methods for the purposes of comparing nonlinear models. AVAILABILITY: Models used in this study are available in SBML format as the supplementary material to this article.  相似文献   

10.
ABSTRACT.   Previous studies using thermal imaging cameras (TI) have used target size as an indicator of target altitude when radar was not available, but this approach may lead to errors if birds that differ greatly in size are actually flying at the same altitude. To overcome this potential difficulty and obtain more accurate measures of the flight altitudes and numbers of individual migrants, we have developed a technique that combines a vertically pointed stationary radar beam and a vertically pointed thermal imaging camera (VERTRAD/TI). The TI provides accurate counts of the birds passing through a fixed, circular sampling area in the TI display, and the radar provides accurate data on their flight altitudes. We analyzed samples of VERTRAD/TI video data collected during nocturnal fall migration in 2000 and 2003 and during the arrival of spring trans-Gulf migration during the daytime in 2003. We used a video peak store (VPS) to make time exposures of target tracks in the video record of the TI and developed criteria to distinguish birds, foraging bats, and insects based on characteristics of the tracks in the VPS images and the altitude of the targets. The TI worked equally well during daytime and nighttime observations and best when skies were clear, because thermal radiance from cloud heat often obscured targets. The VERTRAD/TI system, though costly, is a valuable tool for measuring accurate bird migration traffic rates (the number of birds crossing 1609.34 m [1 statute mile] of front per hour) for different altitudinal strata above 25 m. The technique can be used to estimate the potential risk of migrating birds colliding with man-made obstacles of various heights (e.g., communication and broadcast towers and wind turbines)—a subject of increasing importance to conservation biologists.  相似文献   

11.
Stephens and Donnelly have introduced a simple yet powerful importance sampling scheme for computing the likelihood in population genetic models. Fundamental to the method is an approximation to the conditional probability of the allelic type of an additional gene, given those currently in the sample. As noted by Li and Stephens, the product of these conditional probabilities for a sequence of draws that gives the frequency of allelic types in a sample is an approximation to the likelihood, and can be used directly in inference. The aim of this note is to demonstrate the high level of accuracy of "product of approximate conditionals" (PAC) likelihood when used with microsatellite data. Results obtained on simulated microsatellite data show that this strategy leads to a negligible bias over a wide range of the scaled mutation parameter theta. Furthermore, the sampling variance of likelihood estimates as well as the computation time are lower than that obtained with importance sampling on the whole range of theta. It follows that this approach represents an efficient substitute to IS algorithms in computer intensive (e.g. MCMC) inference methods in population genetics.  相似文献   

12.
Maximum Likelihood (ML) method has an excellent performance for Direction-Of-Arrival (DOA) estimation, but a multidimensional nonlinear solution search is required which complicates the computation and prevents the method from practical use. To reduce the high computational burden of ML method and make it more suitable to engineering applications, we apply the Artificial Bee Colony (ABC) algorithm to maximize the likelihood function for DOA estimation. As a recently proposed bio-inspired computing algorithm, ABC algorithm is originally used to optimize multivariable functions by imitating the behavior of bee colony finding excellent nectar sources in the nature environment. It offers an excellent alternative to the conventional methods in ML-DOA estimation. The performance of ABC-based ML and other popular meta-heuristic-based ML methods for DOA estimation are compared for various scenarios of convergence, Signal-to-Noise Ratio (SNR), and number of iterations. The computation loads of ABC-based ML and the conventional ML methods for DOA estimation are also investigated. Simulation results demonstrate that the proposed ABC based method is more efficient in computation and statistical performance than other ML-based DOA estimation methods.  相似文献   

13.
We have developed a pruning algorithm for likelihood estimation of a tree of populations. This algorithm enables us to compute the likelihood for large trees. Thus, it gives an efficient way of obtaining the maximum-likelihood estimate (MLE) for a given tree topology. Our method utilizes the differences accumulated by random genetic drift in allele count data from single-nucleotide polymorphisms (SNPs), ignoring the effect of mutation after divergence from the common ancestral population. The computation of the maximum-likelihood tree involves both maximizing likelihood over branch lengths of a given topology and comparing the maximum-likelihood across topologies. Here our focus is the maximization of likelihood over branch lengths of a given topology. The pruning algorithm computes arrays of probabilities at the root of the tree from the data at the tips of the tree; at the root, the arrays determine the likelihood. The arrays consist of probabilities related to the number of coalescences and allele counts for the partially coalesced lineages. Computing these probabilities requires an unusual two-stage algorithm. Our computation is exact and avoids time-consuming Monte Carlo methods. We can also correct for ascertainment bias.  相似文献   

14.
A critical decision in landscape genetic studies is whether to use individuals or populations as the sampling unit. This decision affects the time and cost of sampling and may affect ecological inference. We analyzed 334 Columbia spotted frogs at 8 microsatellite loci across 40 sites in northern Idaho to determine how inferences from landscape genetic analyses would vary with sampling design. At all sites, we compared a proportion available sampling scheme (PASS), in which all samples were used, to resampled datasets of 2–11 individuals. Additionally, we compared a population sampling scheme (PSS) to an individual sampling scheme (ISS) at 18 sites with sufficient sample size. We applied an information theoretic approach with both restricted maximum likelihood and maximum likelihood estimation to evaluate competing landscape resistance hypotheses. We found that PSS supported low‐density forest when restricted maximum likelihood was used, but a combination model of most variables when maximum likelihood was used. We also saw variations when AIC was used compared to BIC. ISS supported this model as well as additional models when testing hypotheses of land cover types that create the greatest resistance to gene flow for Columbia spotted frogs. Increased sampling density and study extent, seen by comparing PSS to PASS, showed a change in model support. As number of individuals increased, model support converged at 7–9 individuals for ISS to PSS. ISS may be useful to increase study extent and sampling density, but may lack power to provide strong support for the correct model with microsatellite datasets. Our results highlight the importance of additional research on sampling design effects on landscape genetics inference.  相似文献   

15.
MOTIVATION: Maximum likelihood (ML) methods have become very popular for constructing phylogenetic trees from sequence data. However, despite noticeable recent progress, with large and difficult datasets (e.g. multiple genes with conflicting signals) current ML programs still require huge computing time and can become trapped in bad local optima of the likelihood function. When this occurs, the resulting trees may still show some of the defects (e.g. long branch attraction) of starting trees obtained using fast distance or parsimony programs. METHODS: Subtree pruning and regrafting (SPR) topological rearrangements are usually sufficient to intensively search the tree space. Here, we propose two new methods to make SPR moves more efficient. The first method uses a fast distance-based approach to detect the least promising candidate SPR moves, which are then simply discarded. The second method locally estimates the change in likelihood for any remaining potential SPRs, as opposed to globally evaluating the entire tree for each possible move. These two methods are implemented in a new algorithm with a sophisticated filtering strategy, which efficiently selects potential SPRs and concentrates most of the likelihood computation on the promising moves. RESULTS: Experiments with real datasets comprising 35-250 taxa show that, while indeed greatly reducing the amount of computation, our approach provides likelihood values at least as good as those of the best-known ML methods so far and is very robust to poor starting trees. Furthermore, combining our new SPR algorithm with local moves such as PHYML's nearest neighbor interchanges, the time needed to find good solutions can sometimes be reduced even more.  相似文献   

16.
Wang J 《Genetical research》2001,78(3):243-257
A pseudo maximum likelihood method is proposed to estimate effective population size (Ne) using temporal changes in allele frequencies at multi-allelic loci. The computation is simplified dramatically by (1) approximating the multi-dimensional joint probabilities of all the data by the product of marginal probabilities (hence the name pseudo-likelihood), (2) exploiting the special properties of transition matrix and (3) using a hidden Markov chain algorithm. Simulations show that the pseudo-likelihood method has a similar performance but needs much less computing time and storage compared with the full likelihood method in the case of 3 alleles per locus. Due to computational developments, I was able to assess the performance of the pseudo-likelihood method against the F-statistic method over a wide range of parameters by extensive simulations. It is shown that the pseudo-likelihood method gives more accurate and precise estimates of Ne than the F-statistic method, and the performance difference is mainly due to the presence of rare alleles in the samples. The pseudo-likelihood method is also flexible and can use three or more temporal samples simultaneously to estimate satisfactorily the NeS of each period, or the growth parameters of the population. The accuracy and precision of both methods depend on the ratio of the product of sample size and the number of generations involved to Ne, and the number of independent alleles used. In an application of the pseudo-likelihood method to a large data set of an olive fly population, more precise estimates of Ne are obtained than those from the F-statistic method.  相似文献   

17.
Wong WS  Yang Z  Goldman N  Nielsen R 《Genetics》2004,168(2):1041-1051
The parsimony method of Suzuki and Gojobori (1999) and the maximum likelihood method developed from the work of Nielsen and Yang (1998) are two widely used methods for detecting positive selection in homologous protein coding sequences. Both methods consider an excess of nonsynonymous (replacement) substitutions as evidence for positive selection. Previously published simulation studies comparing the performance of the two methods show contradictory results. Here we conduct a more thorough simulation study to cover and extend the parameter space used in previous studies. We also reanalyzed an HLA data set that was previously proposed to cause problems when analyzed using the maximum likelihood method. Our new simulations and a reanalysis of the HLA data demonstrate that the maximum likelihood method has good power and accuracy in detecting positive selection over a wide range of parameter values. Previous studies reporting poor performance of the method appear to be due to numerical problems in the optimization algorithms and did not reflect the true performance of the method. The parsimony method has a very low rate of false positives but very little power for detecting positive selection or identifying positively selected sites.  相似文献   

18.
19.
Computer simulation was used to compare minimum variance quadratic estimation (MIVQUE), minimum norm quadratic unbiased estimation (MINQUE), restricted maximum likelihood (REML), maximum likelihood (ML), and Henderson's Method 3 (HM3) on the basis of variance among estimates, mean square error (MSE), bias and probability of nearness for estimation of both individual variance components and three ratios of variance components. The investigation also compared three procedures for dealing with negative estimates and included the use of both individual observations and plot means as the experimental unit of the analysis. The structure of data simulated (field design, mating designs, genetic architecture and imbalance) represented typical analysis problems in quantitative forest genetics. Results of comparing the estimation techniques demonstrated that: estimates of probability of nearness did not discriminate among techniques; bias was discriminatory among procedures for dealing with negative estimates but not among estimation techniques (except ML); sampling variance among estimates was discriminatory among procedures for dealing with negative estimates, estimation techniques and unit of observation; and MSE provided no additional information to variance of the estimates. HM3 and REML were the closest competitors under these criteria; however, REML demonstrated greater robustness to imbalance. Of the three negative estimate procedures, two are of practical significance and guidelines for their application are presented. Estimates from individual observations were always preferable to those from plot means over the experimental levels of this study.This is Journal Series NO. R-03768 of the Institute of Food and Agricultural Sciences  相似文献   

20.
O'Brien SM  Dunson DB 《Biometrics》2004,60(3):739-746
Bayesian analyses of multivariate binary or categorical outcomes typically rely on probit or mixed effects logistic regression models that do not have a marginal logistic structure for the individual outcomes. In addition, difficulties arise when simple noninformative priors are chosen for the covariance parameters. Motivated by these problems, we propose a new type of multivariate logistic distribution that can be used to construct a likelihood for multivariate logistic regression analysis of binary and categorical data. The model for individual outcomes has a marginal logistic structure, simplifying interpretation. We follow a Bayesian approach to estimation and inference, developing an efficient data augmentation algorithm for posterior computation. The method is illustrated with application to a neurotoxicology study.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号