首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
There is an increasing demand for evolutionary models to incorporate relatively realistic dynamics, ranging from selection at many genomic sites to complex demography, population structure, and ecological interactions. Such models can generally be implemented as individual‐based forward simulations, but the large computational overhead of these models often makes simulation of whole chromosome sequences in large populations infeasible. This situation presents an important obstacle to the field that requires conceptual advances to overcome. The recently developed tree‐sequence recording method (Kelleher, Thornton, Ashander, & Ralph, 2018), which stores the genealogical history of all genomes in the simulated population, could provide such an advance. This method has several benefits: (1) it allows neutral mutations to be omitted entirely from forward‐time simulations and added later, thereby dramatically improving computational efficiency; (2) it allows neutral burn‐in to be constructed extremely efficiently after the fact, using “recapitation”; (3) it allows direct examination and analysis of the genealogical trees along the genome; and (4) it provides a compact representation of a population's genealogy that can be analysed in Python using the msprime package. We have implemented the tree‐sequence recording method in SLiM 3 (a free, open‐source evolutionary simulation software package) and extended it to allow the recording of non‐neutral mutations, greatly broadening the utility of this method. To demonstrate the versatility and performance of this approach, we showcase several practical applications that would have been beyond the reach of previously existing methods, opening up new horizons for the modelling and exploration of evolutionary processes.  相似文献   

2.

Background

Coalescent simulation is pivotal for understanding population evolutionary models and demographic histories, as well as for developing novel analytical methods for genetic association studies for DNA sequence data. A plethora of coalescent simulators are developed, but selecting the most appropriate program remains challenging.

Results

We extensively compared performances of five widely used coalescent simulators – Hudson’s ms, msHOT, MaCS, Simcoal2, and fastsimcoal, to provide a practical guide considering three crucial factors, 1) speed, 2) scalability and 3) recombination hotspot position and intensity accuracy. Although ms represents a popular standard coalescent simulator, it lacks the ability to simulate sequences with recombination hotspots. An extended program msHOT has compensated for the deficiency of ms by incorporating recombination hotspots and gene conversion events at arbitrarily chosen locations and intensities, but remains limited in simulating long stretches of DNA sequences. Simcoal2, based on a discrete generation-by-generation approach, could simulate more complex demographic scenarios, but runs comparatively slow. MaCS and fastsimcoal, both built on fast, modified sequential Markov coalescent algorithms to approximate standard coalescent, are much more efficient whilst keeping salient features of msHOT and Simcoal2, respectively. Our simulations demonstrate that they are more advantageous over other programs for a spectrum of evolutionary models. To validate recombination hotspots, LDhat 2.2 rhomap package, sequenceLDhot and Haploview were compared for hotspot detection, and sequenceLDhot exhibited the best performance based on both real and simulated data.

Conclusions

While ms remains an excellent choice for general coalescent simulations of DNA sequences, MaCS and fastsimcoal are much more scalable and flexible in simulating a variety of demographic events under different recombination hotspot models. Furthermore, sequenceLDhot appears to give the most optimal performance in detecting and validating cross-over hotspots.  相似文献   

3.
With the continued adoption of genome‐scale data in evolutionary biology comes the challenge of adequately harnessing the information to make accurate phylogenetic inferences. Coalescent‐based methods of species tree inference have become common, and concatenation has been shown in simulation to perform well, particularly when levels of incomplete lineage sorting are low. However, simulation conditions are often overly simplistic, leaving empiricists with uncertainty regarding analytical tools. We use a large ultraconserved element data set (>3,000 loci) from rattlesnakes of the Crotalus triseriatus group to delimit lineages and estimate species trees using concatenation and several coalescent‐based methods. Unpartitioned and partitioned maximum likelihood and Bayesian analysis of the concatenated matrix yield a topology identical to coalescent analysis of a subset of the data in bpp . ASTRAL analysis on a subset of the more variable loci also results in a tree consistent with concatenation and bpp , whereas the SVDquartets phylogeny differs at additional nodes. The size of the concatenated matrix has a strong effect on species tree inference using SVDquartets , warranting additional investigation on optimal data characteristics for this method. Species delimitation analyses suggest up to 16 unique lineages may be present within the C. triseriatus group, with divergences occurring during the Neogene and Quaternary. Network analyses suggest hybridization within the group is relatively rare. Altogether, our results reaffirm the Mexican highlands as a biodiversity hotspot and suggest that coalescent‐based species tree inference on data subsets can provide a strongly supported species tree consistent with concatenation of all loci with a large amount of missing data.  相似文献   

4.
The multispecies coalescent model provides a natural framework for species tree estimation accounting for gene-tree conflicts. Although a number of species tree methods under the multispecies coalescent have been suggested and evaluated using simulation, their statistical properties remain poorly understood. Here, we use mathematical analysis aided by computer simulation to examine the identifiability, consistency, and efficiency of different species tree methods in the case of three species and three sequences under the molecular clock. We consider four major species-tree methods including concatenation, two-step, independent-sites maximum likelihood, and maximum likelihood. We develop approximations that predict that the probit transform of the species tree estimation error decreases linearly with the square root of the number of loci. Even in this simplest case, major differences exist among the methods. Full-likelihood methods are considerably more efficient than summary methods such as concatenation and two-step. They also provide estimates of important parameters such as species divergence times and ancestral population sizes,whereas these parameters are not identifiable by summary methods. Our results highlight the need to improve the statistical efficiency of summary methods and the computational efficiency of full likelihood methods of species tree estimation.  相似文献   

5.
Knudsen B  Miyamoto MM 《Genetics》2007,176(4):2335-2342
Coalescent theory provides a powerful framework for estimating the evolutionary, demographic, and genetic parameters of a population from a small sample of individuals. Current coalescent models have largely focused on population genetic factors (e.g., mutation, population growth, and migration) rather than on the effects of experimental design and error. This study develops a new coalescent/mutation model that accounts for unobserved polymorphisms due to missing data, sequence errors, and multiple reads for diploid individuals. The importance of accommodating these effects of experimental design and error is illustrated with evolutionary simulations and a real data set from a population of the California sea hare. In particular, a failure to account for sequence errors can lead to overestimated mutation rates, inflated coalescent times, and inappropriate conclusions about the population. This current model can now serve as a starting point for the development of newer models with additional experimental and population genetic factors. It is currently implemented as a maximum-likelihood method, but this model may also serve as the basis for the development of Bayesian approaches that incorporate experimental design and error.  相似文献   

6.
Bayesian inference operates under the assumption that the empirical data are a good statistical fit to the analytical model, but this assumption can be challenging to evaluate. Here, we introduce a novel r package that utilizes posterior predictive simulation to evaluate the fit of the multispecies coalescent model used to estimate species trees. We conduct a simulation study to evaluate the consistency of different summary statistics in comparing posterior and posterior predictive distributions, the use of simulation replication in reducing error rates and the utility of parallel process invocation towards improving computation times. We also test P2C2M on two empirical data sets in which hybridization and gene flow are suspected of contributing to shared polymorphism, which is in violation with the coalescent model: Tamias chipmunks and Myotis bats. Our results indicate that (i) probability‐based summary statistics display the lowest error rates, (ii) the implementation of simulation replication decreases the rate of type II errors, and (iii) our r package displays improved statistical power compared to previous implementations of this approach. When probabilistic summary statistics are used, P2C2M corroborates the assumption that genealogies collected from Tamias and Myotis are not a good fit to the multispecies coalescent model. Taken as a whole, our findings argue that an assessment of the fit of the multispecies coalescent model should accompany any phylogenetic analysis that estimates a species tree.  相似文献   

7.
The multispecies coalescent (MSC) model accommodates both species divergences and within-species coalescent and provides a natural framework for phylogenetic analysis of genomic data when the gene trees vary across the genome. The MSC model implemented in the program bpp assumes a molecular clock and the Jukes–Cantor model, and is suitable for analyzing genomic data from closely related species. Here we extend our implementation to more general substitution models and relaxed clocks to allow the rate to vary among species. The MSC-with-relaxed-clock model allows the estimation of species divergence times and ancestral population sizes using genomic sequences sampled from contemporary species when the strict clock assumption is violated, and provides a simulation framework for evaluating species tree estimation methods. We conducted simulations and analyzed two real datasets to evaluate the utility of the new models. We confirm that the clock-JC model is adequate for inference of shallow trees with closely related species, but it is important to account for clock violation for distant species. Our simulation suggests that there is valuable phylogenetic information in the gene-tree branch lengths even if the molecular clock assumption is seriously violated, and the relaxed-clock models implemented in bpp are able to extract such information. Our Markov chain Monte Carlo algorithms suffer from mixing problems when used for species tree estimation under the relaxed clock and we discuss possible improvements. We conclude that the new models are currently most effective for estimating population parameters such as species divergence times when the species tree is fixed.  相似文献   

8.
The coalescent with recombination is a fundamental model to describe the genealogical history of DNA sequence samples from recombining organisms. Considering recombination as a process which acts along genomes and which creates sequence segments with shared ancestry, we study the influence of single recombination events upon tree characteristics of the coalescent. We focus on properties such as tree height and tree balance and quantify analytically the changes in these quantities incurred by recombination in terms of probability distributions. We find that changes in tree topology are often relatively mild under conditions of neutral evolution, while changes in tree height are on average quite large. Our results add to a quantitative understanding of the spatial coalescent and provide the neutral reference to which the impact by other evolutionary scenarios, for instance tree distortion by selective sweeps, can be compared.  相似文献   

9.
We implement an isolation with migration model for three species, with migration occurring between two closely related species while an out-group species is used to provide further information concerning gene trees and model parameters. The model is implemented in the likelihood framework for analyzing multilocus genomic sequence alignments, with one sequence sampled from each of the three species. The prior distribution of gene tree topology and branch lengths at every locus is calculated using a Markov chain characterization of the genealogical process of coalescent and migration, which integrates over the histories of migration events analytically. The likelihood function is calculated by integrating over branch lengths in the gene trees (coalescent times) numerically. We analyze the model to study the gene tree-species tree mismatch probability and the time to the most recent common ancestor at a locus. The model is used to construct a likelihood ratio test (LRT) of speciation with gene flow. We conduct computer simulations to evaluate the LRT and found that the test is in general conservative, with the false positive rate well below the significance level. For the test to have substantial power, hundreds of loci are needed. Application of the test to a human-chimpanzee-gorilla genomic data set suggests gene flow around the time of speciation of the human and the chimpanzee.  相似文献   

10.
Dating the time of divergence and understanding speciation processes are central to the study of the evolutionary history of organisms but are notoriously difficult. The difficulty is largely rooted in variations in the ancestral population size or in the genealogy variation across loci. To depict the speciation processes and divergence histories of three monophyletic Takydromus species endemic to Taiwan, we sequenced 20 nuclear loci and combined with one mitochondrial locus published in GenBank. They were analysed by a multispecies coalescent approach within a Bayesian framework. Divergence dating based on the gene tree approach showed high variation among loci, and the divergence was estimated at an earlier date than when derived by the species‐tree approach. To test whether variations in the ancestral population size accounted for the majority of this variation, we conducted computer inferences using isolation‐with‐migration (IM) and approximate Bayesian computation (ABC) frameworks. The results revealed that gene flow during the early stage of speciation was strongly favoured over the isolation model, and the initiation of the speciation process was far earlier than the dates estimated by gene‐ and species‐based divergence dating. Due to their limited dispersal ability, it is suggested that geographical isolation may have played a major role in the divergence of these Takydromus species. Nevertheless, this study reveals a more complex situation and demonstrates that gene flow during the speciation process cannot be overlooked and may have a great impact on divergence dating. By using multilocus data and incorporating Bayesian coalescence approaches, we provide a more biologically realistic framework for delineating the divergence history of Takydromus.  相似文献   

11.

Background  

In recent years there has been a trend of leaving the strict molecular clock in order to infer dating of speciations and other evolutionary events. Explicit modeling of substitution rates and divergence times makes formulation of informative prior distributions for branch lengths possible. Models with birth-death priors on tree branching and auto-correlated or iid substitution rates among lineages have been proposed, enabling simultaneous inference of substitution rates and divergence times. This problem has, however, mainly been analysed in the Markov chain Monte Carlo (MCMC) framework, an approach requiring computation times of hours or days when applied to large phylogenies.  相似文献   

12.
Hybridization and introgression have important consequences in evolution, such as increasing the genetic diversity and adaptive potential of a species. One of their most conspicuous footprints is discordance among gene trees or between genes and phenotypes. However, most studies that report introgression fail to disprove the null hypothesis that genetic incongruence may result from stochastic sorting of ancestral allelic polymorphisms. In the case of ancient introgression, these two processes may be especially difficult to distinguish topologically, but they make different predictions about the patterns of coalescence among loci. Here we apply three methods, molecular dating, multispecies coalescent models, and gene tree simulation under coalescence, to compare these two hypotheses that explain the polyphyletic mtDNA of the butterfly peacock bass, Cichla orinocensis. In comparison with a species tree based on 20 unlinked nuclear loci, we determined that mtDNA divergences were too recent to be explained by ancestral polymorphism. Similarly, coalescent species tree branches were significantly shorter when putative introgressed mtDNA was incorporated, and simulations showed the mtDNA topology to be unlikely under lineage sorting only. We conclude that introgression approximately 1.5 million years ago resulted in capture by C. orinocensis of an mtDNA lineage ancestral to the modern subspecies C. oc. monoculus.  相似文献   

13.
Genome-scale sequence data have become increasingly available in the phylogenetic studies for understanding the evolutionary histories of species. However, it is challenging to develop probabilistic models to account for heterogeneity of phylogenomic data. The multispecies coalescent model describes gene trees as independent random variables generated from a coalescence process occurring along the lineages of the species tree. Since the multispecies coalescent model allows gene trees to vary across genes, coalescent-based methods have been popularly used to account for heterogeneous gene trees in phylogenomic data analysis. In this paper, we summarize and evaluate the performance of coalescent-based methods for estimating species trees from genome-scale sequence data. We investigate the effects of deep coalescence and mutation on the performance of species tree estimation methods. We found that the coalescent-based methods perform well in estimating species trees for a large number of genes, regardless of the degree of deep coalescence and mutation. The performance of the coalescent methods is negatively correlated with the lengths of internal branches of the species tree.  相似文献   

14.
Estimates of the timing of divergence are central to testing the underlying causes of speciation. Relaxed molecular clocks and fossil calibration have improved these estimates; however, these advances are implemented in the context of gene trees, which can overestimate divergence times. Here we couple recent innovations for dating speciation events with the analytical power of species trees, where multilocus data are considered in a coalescent context. Divergence times are estimated in the bird genus Aphelocoma to test whether speciation in these jays coincided with mountain uplift or glacial cycles. Gene trees and species trees show general agreement that diversification began in the Miocene amid mountain uplift. However, dates from the multilocus species tree are more recent, occurring predominately in the Pleistocene, consistent with theory that divergence times can be significantly overestimated with gene‐tree based approaches that do not correct for genetic divergence that predates speciation. In addition to coalescent stochasticity, Haldane's rule could account for some differences in timing estimates between mitochondrial DNA and nuclear genes. By incorporating a fossil calibration applied to the species tree, in addition to the process of gene lineage coalescence, the present approach provides a more biologically realistic framework for dating speciation events, and hence for testing the links between diversification and specific biogeographic and geologic events.  相似文献   

15.
Gene trees are evolutionary trees representing the ancestry of genes sampled from multiple populations. Species trees represent populations of individuals—each with many genes—splitting into new populations or species. The coalescent process, which models ancestry of gene copies within populations, is often used to model the probability distribution of gene trees given a fixed species tree. This multispecies coalescent model provides a framework for phylogeneticists to infer species trees from gene trees using maximum likelihood or Bayesian approaches. Because the coalescent models a branching process over time, all trees are typically assumed to be rooted in this setting. Often, however, gene trees inferred by traditional phylogenetic methods are unrooted. We investigate probabilities of unrooted gene trees under the multispecies coalescent model. We show that when there are four species with one gene sampled per species, the distribution of unrooted gene tree topologies identifies the unrooted species tree topology and some, but not all, information in the species tree edges (branch lengths). The location of the root on the species tree is not identifiable in this situation. However, for 5 or more species with one gene sampled per species, we show that the distribution of unrooted gene tree topologies identifies the rooted species tree topology and all its internal branch lengths. The length of any pendant branch leading to a leaf of the species tree is also identifiable for any species from which more than one gene is sampled.  相似文献   

16.
Although Pleistocene glaciations had a major impact on the population genetic patterns of many species in North America and Europe, it remains unclear how these climatic fluctuations contributed to species diversification in East Asia. One reason for this is the difficulty of distinguishing genetic admixture following secondary contact from incomplete lineage sorting, both of which can generate similar patterns of genetic variation. Using a combination of multilocus analyses and coalescent simulation, we explore how these two processes occurred in the Pleistocene evolutionary history of a widespread East Asian bird, the Vinous‐throated parrotbill, Paradoxornis webbianus. Maximum likelihood (ML) tree identified two major mitochondrial lineages, which are geographically separated in most parts of its range, but are sympatric at a few sampling sites. NJ tree and Structure analysis of microsatellite data set revealed an extensive level of admixture and little population structure, suggesting recent admixture between two formerly separated groups. Networks from nuclear DNA data sets, however, did not indicate any geographically isolated groups but rather a panmictic population, thus support incomplete lineage sorting. By using coalescent simulation approaches, we show that both processes did occur, although at different temporal scales. During the Pleistocene glaciations, probably around 0.1–0.5 Ma (the Marine Isotope Stage 6, MIS6), P. webbianus contracted into two separate refugia, and subsequently accumulated genetic divergence. During the interglacial MIS5, the species expanded into previously glaciated areas allowing the once separated groups to come into contact and become admixed. Taken together, our results indicate the current genetic variation within P. webbianus is a combination pattern of widespread distribution in pre‐Pleistocene, then contraction and fragmentation into separated refugia during glacial advance, followed by recently postglacial expansion and admixture.  相似文献   

17.
The relationship between speciation times and the corresponding times of gene divergence is of interest in phylogenetic inference as a means of understanding the past evolutionary dynamics of populations and of estimating the timing of speciation events. It has long been recognized that gene divergence times might substantially pre-date speciation events. Although the distribution of the difference between these has previously been studied for the case of two populations, this distribution has not been explicitly computed for larger species phylogenies. Here we derive a simple method for computing this distribution for trees of arbitrary size. A two-stage procedure is proposed which (i) considers the probability distribution of the time from the speciation event at the root of the species tree to the gene coalescent time conditionally on the number of gene lineages available at the root; and (ii) calculates the probability mass function for the number of gene lineages at the root. This two-stage approach dramatically simplifies numerical analysis, because in the first step the conditional distribution does not depend on an underlying species tree, while in the second step the pattern of gene coalescence prior to the species tree root is irrelevant. In addition, the algorithm provides intuition concerning the properties of the distribution with respect to the various features of the underlying species tree. The methodology is complemented by developing probabilistic formulae and software, written in R. The method and software are tested on five-taxon species trees with varying levels of symmetry. The examples demonstrate that more symmetric species trees tend to have larger mean coalescent times and are more likely to have a unimodal gamma-like distribution with a long right tail, while asymmetric trees tend to have smaller mean coalescent times with an exponential-like distribution. In addition, species trees with longer branches generally have shorter mean coalescent times, with branches closest to the root of the tree being most influential.  相似文献   

18.
Genealogical discordance, or when different genes tell distinct stories although they evolved under a shared history, often emerges from either coalescent stochasticity or introgression. In this study, we present a strong case of mito‐nuclear genealogical discordance in the Australian rainforest lizard species complex of Saproscincus basiliscus and S. lewisi. One of the lineages that comprises this complex, the Southern S. basiliscus lineage, is deeply divergent at the mitochondrial genome but shows markedly less divergence at the nuclear genome. By placing our results in a comparative context and reconstructing the lineages' demography via multilocus and coalescent‐based approximate Bayesian computation methods, we test hypotheses for how coalescent variance and introgression contribute to this pattern. These analyses suggest that the observed genealogical discordance likely results from introgression. Further, to generate such strong discordance, introgression probably acted in concert with other factors promoting asymmetric gene flow between the mitochondrial and nuclear genomes, such as selection or sex‐biased dispersal. This study offers a framework for testing sources of genealogical discordance and suggests that historical introgression can be an important force shaping the genetic diversity of species and their populations.  相似文献   

19.
The multispecies coalescent (MSC) is a statistical framework that models how gene genealogies grow within the branches of a species tree. The field of computational phylogenetics has witnessed an explosion in the development of methods for species tree inference under MSC, owing mainly to the accumulating evidence of incomplete lineage sorting in phylogenomic analyses. However, the evolutionary history of a set of genomes, or species, could be reticulate due to the occurrence of evolutionary processes such as hybridization or horizontal gene transfer. We report on a novel method for Bayesian inference of genome and species phylogenies under the multispecies network coalescent (MSNC). This framework models gene evolution within the branches of a phylogenetic network, thus incorporating reticulate evolutionary processes, such as hybridization, in addition to incomplete lineage sorting. As phylogenetic networks with different numbers of reticulation events correspond to points of different dimensions in the space of models, we devise a reversible-jump Markov chain Monte Carlo (RJMCMC) technique for sampling the posterior distribution of phylogenetic networks under MSNC. We implemented the methods in the publicly available, open-source software package PhyloNet and studied their performance on simulated and biological data. The work extends the reach of Bayesian inference to phylogenetic networks and enables new evolutionary analyses that account for reticulation.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号