共查询到20条相似文献,搜索用时 0 毫秒
1.
Wang Y Rannala B 《Philosophical transactions of the Royal Society of London. Series B, Biological sciences》2008,363(1512):3921-3930
Recently, several statistical methods for estimating fine-scale recombination rates using population samples have been developed. However, currently available methods that can be applied to large-scale data are limited to approximated likelihoods. Here, we developed a full-likelihood Markov chain Monte Carlo method for estimating recombination rate under a Bayesian framework. Genealogies underlying a sampling of chromosomes are effectively modelled by using marginal individual single nucleotide polymorphism genealogies related through an ancestral recombination graph. The method is compared with two existing composite-likelihood methods using simulated data.Simulation studies show that our method performs well for different simulation scenarios. The method is applied to two human population genetic variation datasets that have been studied by sperm typing. Our results are consistent with the estimates from sperm crossover analysis. 相似文献
2.
Zhang Y 《Bioinformatics (Oxford, England)》2008,24(7):965-971
Motivation: Inferring population structures using genetic datasampled from a group of individuals is a challenging task. Manymethods either consider a fixed population number or ignorethe correlation between populations. As a result, they can losesensitivity and specificity in detecting subtle stratifications.In addition, when a large number of genetic markers are used,many existing algorithms perform rather inefficiently. Result: We propose a new Bayesian method to infer populationstructures using multiple unlinked single nucleotide polymorphisms(SNPs). Our approach explicitly considers the population correlationthrough a tree hierarchy, and treat the population number asa random variable. Using both simulated and real datasets ofworldwide samples, we demonstrate that an incorporated treecan consistently improve the power in detecting subtle populationstratifications. A tree-based model often involves a large numberof unknown parameters, and the corresponding estimation procedurecan be highly inefficient. We further implement a partitionmethod to analytically integrate out all nuisance parametersin the tree. As a result, our method can analyze large SNP datasetswith significantly improved convergence rate. Availability: http://www.stat.psu.edu/~yuzhang/tips.tar Contact: yuzhang{at}stat.psu.edu Supplementary information: Supplementary data are availableat Bioinformatics online.
Associate Editor: Keith Crandall 相似文献
3.
Pekka Marttinen Adam Baldwin William P Hanage Chris Dowson Eshwar Mahenthiralingam Jukka Corander 《BMC bioinformatics》2008,9(1):421
Background
We consider the discovery of recombinant segments jointly with their origins within multilocus DNA sequences from bacteria representing heterogeneous populations of fairly closely related species. The currently available methods for recombination detection capable of probabilistic characterization of uncertainty have a limited applicability in practice as the number of strains in a data set increases. 相似文献4.
Finlay EK Gaillard C Vahidi SM Mirhoseini SZ Jianlin H Qi XB El-Barody MA Baird JF Healy BC Bradley DG 《Biology letters》2007,3(4):449-452
The past population dynamics of four domestic and one wild species of bovine were estimated using Bayesian skyline plots, a coalescent Markov chain Monte Carlo method that does not require an assumed parametric model of demographic history. Four domestic species share a recent rapid population expansion not visible in the wild African buffalo (Syncerus caffer). The estimated timings of the expansions are consistent with the archaeological records of domestication. 相似文献
5.
Comparison of Bayesian and maximum-likelihood inference of population genetic parameters 总被引:9,自引:0,他引:9
Beerli P 《Bioinformatics (Oxford, England)》2006,22(3):341-345
Comparison of the performance and accuracy of different inference methods, such as maximum likelihood (ML) and Bayesian inference, is difficult because the inference methods are implemented in different programs, often written by different authors. Both methods were implemented in the program MIGRATE, that estimates population genetic parameters, such as population sizes and migration rates, using coalescence theory. Both inference methods use the same Markov chain Monte Carlo algorithm and differ from each other in only two aspects: parameter proposal distribution and maximization of the likelihood function. Using simulated datasets, the Bayesian method generally fares better than the ML approach in accuracy and coverage, although for some values the two approaches are equal in performance. MOTIVATION: The Markov chain Monte Carlo-based ML framework can fail on sparse data and can deliver non-conservative support intervals. A Bayesian framework with appropriate prior distribution is able to remedy some of these problems. RESULTS: The program MIGRATE was extended to allow not only for ML(-) maximum likelihood estimation of population genetics parameters but also for using a Bayesian framework. Comparisons between the Bayesian approach and the ML approach are facilitated because both modes estimate the same parameters under the same population model and assumptions. 相似文献
6.
Meiotic recombination is a fundamental cellular mechanism in sexually reproducing organisms and its different forms, crossing over and gene conversion both play an important role in shaping genetic variation in populations. Here, we describe a coalescent-based full-likelihood Markov chain Monte Carlo (MCMC) method for jointly estimating the crossing-over, gene-conversion, and mean tract length parameters from population genomic data under a Bayesian framework. Although computationally more expensive than methods that use approximate likelihoods, the relative efficiency of our method is expected to be optimal in theory. Furthermore, it is also possible to obtain a posterior sample of genealogies for the data using this method. We first check the performance of the new method on simulated data and verify its correctness. We also extend the method for inference under models with variable gene-conversion and crossing-over rates and demonstrate its ability to identify recombination hotspots. Then, we apply the method to two empirical data sets that were sequenced in the telomeric regions of the X chromosome of Drosophila melanogaster. Our results indicate that gene conversion occurs more frequently than crossing over in the su-w and su-s gene sequences while the local rates of crossing over as inferred by our program are not low. The mean tract lengths for gene-conversion events are estimated to be ~70 bp and 430 bp, respectively, for these data sets. Finally, we discuss ideas and optimizations for reducing the execution time of our algorithm. 相似文献
7.
Over the past decades, the use of molecular markers has revolutionized biology and led to the foundation of a new research discipline-phylogeography. Of particular interest has been the inference of population structure and biogeography. While initial studies focused on mtDNA as a molecular marker, it has become apparent that selection and genealogical lineage sorting could lead to erroneous inferences. As it is not clear to what extent these forces affect a given marker, it has become common practice to use the combined evidence from a set of molecular markers as an attempt to recover the signals that approximate the true underlying demography. Typically, the number of markers used is determined by either budget constraints or by statistical power required to recognize significant population differentiation. Using microsatellite markers from Drosophila and humans, we show that even large numbers of loci (>50) can frequently result in statistically well-supported, but incorrect inference of population structure using the software BAPS. Most importantly, genomic features, such as chromosomal location, variability of the markers, or recombination rate, cannot explain this observation. Instead, it can be attributed to sampling variation among loci with different realizations of the stochastic lineage sorting. This phenomenon is particularly pronounced for low levels of population differentiation. Our results have important implications for ongoing studies of population differentiation, as we unambiguously demonstrate that statistical significance of population structure inferred from a random set of genetic markers cannot necessarily be taken as evidence for a reliable demographic inference. 相似文献
8.
9.
Bayesian coalescent inference of past population dynamics from molecular sequences 总被引:31,自引:0,他引:31
We introduce the Bayesian skyline plot, a new method for estimating past population dynamics through time from a sample of molecular sequences without dependence on a prespecified parametric model of demographic history. We describe a Markov chain Monte Carlo sampling procedure that efficiently samples a variant of the generalized skyline plot, given sequence data, and combines these plots to generate a posterior distribution of effective population size through time. We apply the Bayesian skyline plot to simulated data sets and show that it correctly reconstructs demographic history under canonical scenarios. Finally, we compare the Bayesian skyline plot model to previous coalescent approaches by analyzing two real data sets (hepatitis C virus in Egypt and mitochondrial DNA of Beringian bison) that have been previously investigated using alternative coalescent methods. In the bison analysis, we detect a severe but previously unrecognized bottleneck, estimated to have occurred 10,000 radiocarbon years ago, which coincides with both the earliest undisputed record of large numbers of humans in Alaska and the megafaunal extinctions in North America at the beginning of the Holocene. 相似文献
10.
Bamshad MJ Wooding S Watkins WS Ostler CT Batzer MA Jorde LB 《American journal of human genetics》2003,72(3):578-589
A major goal of biomedical research is to develop the capability to provide highly personalized health care. To do so, it is necessary to understand the distribution of interindividual genetic variation at loci underlying physical characteristics, disease susceptibility, and response to treatment. Variation at these loci commonly exhibits geographic structuring and may contribute to phenotypic differences between groups. Thus, in some situations, it may be important to consider these groups separately. Membership in these groups is commonly inferred by use of a proxy such as place-of-origin or ethnic affiliation. These inferences are frequently weakened, however, by use of surrogates, such as skin color, for these proxies, the distribution of which bears little resemblance to the distribution of neutral genetic variation. Consequently, it has become increasingly controversial whether proxies are sufficient and accurate representations of groups inferred from neutral genetic variation. This raises three questions: how many data are required to identify population structure at a meaningful level of resolution, to what level can population structure be resolved, and do some proxies represent population structure accurately? We assayed 100 Alu insertion polymorphisms in a heterogeneous collection of approximately 565 individuals, approximately 200 of whom were also typed for 60 microsatellites. Stripped of identifying information, correct assignment to the continent of origin (Africa, Asia, or Europe) with a mean accuracy of at least 90% required a minimum of 60 Alu markers or microsatellites and reached 99%-100% when >/=100 loci were used. Less accurate assignment (87%) to the appropriate genetic cluster was possible for a historically admixed sample from southern India. These results set a minimum for the number of markers that must be tested to make strong inferences about detecting population structure among Old World populations under ideal experimental conditions. We note that, whereas some proxies correspond crudely, if at all, to population structure, the heuristic value of others is much higher. This suggests that a more flexible framework is needed for making inferences about population structure and the utility of proxies. 相似文献
11.
Polytomies and Bayesian phylogenetic inference 总被引:16,自引:0,他引:16
Bayesian phylogenetic analyses are now very popular in systematics and molecular evolution because they allow the use of much more realistic models than currently possible with maximum likelihood methods. There are, however, a growing number of examples in which large Bayesian posterior clade probabilities are associated with very short branch lengths and low values for non-Bayesian measures of support such as nonparametric bootstrapping. For the four-taxon case when the true tree is the star phylogeny, Bayesian analyses become increasingly unpredictable in their preference for one of the three possible resolved tree topologies as data set size increases. This leads to the prediction that hard (or near-hard) polytomies in nature will cause unpredictable behavior in Bayesian analyses, with arbitrary resolutions of the polytomy receiving very high posterior probabilities in some cases. We present a simple solution to this problem involving a reversible-jump Markov chain Monte Carlo (MCMC) algorithm that allows exploration of all of tree space, including unresolved tree topologies with one or more polytomies. The reversible-jump MCMC approach allows prior distributions to place some weight on less-resolved tree topologies, which eliminates misleadingly high posteriors associated with arbitrary resolutions of hard polytomies. Fortunately, assigning some prior probability to polytomous tree topologies does not appear to come with a significant cost in terms of the ability to assess the level of support for edges that do exist in the true tree. Methods are discussed for applying arbitrary prior distributions to tree topologies of varying resolution, and an empirical example showing evidence of polytomies is analyzed and discussed. 相似文献
12.
MRBAYES: Bayesian inference of phylogenetic trees 总被引:108,自引:0,他引:108
SUMMARY: The program MRBAYES performs Bayesian inference of phylogeny using a variant of Markov chain Monte Carlo. AVAILABILITY: MRBAYES, including the source code, documentation, sample data files, and an executable, is available at http://brahms.biology.rochester.edu/software.html. 相似文献
13.
A Markov chain Monte Carlo approach for joint inference of population structure and inbreeding rates from multilocus genotype data 总被引:1,自引:1,他引:1
下载免费PDF全文

Nonrandom mating induces correlations in allelic states within and among loci that can be exploited to understand the genetic structure of natural populations (Wright 1965). For many species, it is of considerable interest to quantify the contribution of two forms of nonrandom mating to patterns of standing genetic variation: inbreeding (mating among relatives) and population substructure (limited dispersal of gametes). Here, we extend the popular Bayesian clustering approach STRUCTURE (Pritchard et al. 2000) for simultaneous inference of inbreeding or selfing rates and population-of-origin classification using multilocus genetic markers. This is accomplished by eliminating the assumption of Hardy-Weinberg equilibrium within clusters and, instead, calculating expected genotype frequencies on the basis of inbreeding or selfing rates. We demonstrate the need for such an extension by showing that selfing leads to spurious signals of population substructure using the standard STRUCTURE algorithm with a bias toward spurious signals of admixture. We gauge the performance of our method using extensive coalescent simulations and demonstrate that our approach can correct for this bias. We also apply our approach to understanding the population structure of the wild relative of domesticated rice, Oryza rufipogon, an important partially selfing grass species. Using a sample of n = 16 individuals sequenced at 111 random loci, we find strong evidence for existence of two subpopulations, which correlates well with geographic location of sampling, and estimate selfing rates for both groups that are consistent with estimates from experimental data (s approximately 0.48-0.70). 相似文献
14.
SUMMARY: BAli-Phy is a Bayesian posterior sampler that employs Markov chain Monte Carlo to explore the joint space of alignment and phylogeny given molecular sequence data. Simultaneous estimation eliminates bias toward inaccurate alignment guide-trees, employs more sophisticated substitution models during alignment and automatically utilizes information in shared insertion/deletions to help infer phylogenies. AVAILABILITY: Software is available for download at http://www.biomath.ucla.edu/msuchard/bali-phy. 相似文献
15.
Ronquist F 《Trends in ecology & evolution》2004,19(9):475-481
Much recent progress in evolutionary biology is based on the inference of ancestral states and past transformations in important traits on phylogenetic trees. These exercises often assume that the tree is known without error and that ancestral states and character change can be mapped onto it exactly. In reality, there is often considerable uncertainty about both the tree and the character mapping. Recently introduced Bayesian statistical methods enable the study of character evolution while simultaneously accounting for both phylogenetic and mapping uncertainty, adding much needed credibility to the reconstruction of evolutionary history. 相似文献
16.
Single-nucleotide polymorphisms (SNPs) are a class of attractive genetic markers for population genetic studies and for identifying genetic variations underlying complex traits. However, the usefulness and efficiency of SNPs in comparison to microsatellites in different scientific contexts, e.g., population structure inference or association analysis, still must be systematically evaluated through large empirical studies. In this article, we use the Collaborative Studies on Genetics of Alcoholism (COGA) data from Genetic Analysis Workshop 14 (GAW14) to compare the performance of microsatellites and SNPs in the whole human genome in the context of population structure inference. A total of 328 microsatellites and 15,840 SNPs are used to infer population structure in 236 unrelated individuals. We find that, on average, the informativeness of random microsatellites is four to twelve times that of random SNPs for various population comparisons, which is consistent with previous studies. Our results also indicate that for the combined set of microsatellites and SNPs, SNPs constitute the majority among the most informative markers and the use of these SNPs leads to better inference of population structure than the use of microsatellites. We also find that the inclusion of less informative markers may add noise and worsen the results. 相似文献
17.
Marttinen P Hanage WP Croucher NJ Connor TR Harris SR Bentley SD Corander J 《Nucleic acids research》2012,40(1):e6
Analysis of important human pathogen populations is currently under transition toward whole-genome sequencing of growing numbers of samples collected on a global scale. Since recombination in bacteria is often an important factor shaping their evolution by enabling resistance elements and virulence traits to rapidly transfer from one evolutionary lineage to another, it is highly beneficial to have access to tools that can detect recombination events. Multiple advanced statistical methods exist for such purposes; however, they are typically limited either to only a few samples or to data from relatively short regions of a total genome. By harnessing the power of recent advances in Bayesian modeling techniques, we introduce here a method for detecting homologous recombination events from whole-genome sequence data for bacterial population samples on a large scale. Our statistical approach can efficiently handle hundreds of whole genome sequenced population samples and identify separate origins of the recombinant sequence, offering an enhanced insight into the diversification of bacterial clones at the level of the whole genome. A data set of 241 whole genome sequences from an important pandemic lineage of Streptococcus pneumoniae is used together with multiple simulated data sets to demonstrate the potential of our approach. 相似文献
18.
Most phylogenetic tree estimation methods assume that there is a single set of hierarchical relationships among sequences in a data set for all sites along an alignment. Mosaic sequences produced by past recombination events will violate this assumption and may lead to misleading results from a phylogenetic analysis due to the imposition of a single tree along the entire alignment. Therefore, the detection of past recombination is an important first step in an analysis. A Bayesian model for the changes in topology caused by recombination events is described here. This model relaxes the assumption of one topology for all sites in an alignment and uses the theory of Hidden Markov models to facilitate calculations, the hidden states being the underlying topologies at each site in the data set. Changes in topology along the multiple sequence alignment are estimated by means of the maximum a posteriori (MAP) estimate. The performance of the MAP estimate is assessed by application of the model to data sets of four sequences, both simulated and real. 相似文献
19.
High-throughput genotyping and sequencing technologies can generate dense sets of genetic markers for large numbers of individuals. For most species, these data will contain many markers in linkage disequilibrium (LD). To utilize such data for population structure inference, we investigate the use of haplotypes constructed by combining the alleles at single-nucleotide polymorphisms (SNPs). We introduce a statistic derived from information theory, the gain of informativeness for assignment (GIA), which quantifies the additional information for assigning individuals to populations using haplotype data compared to using individual loci separately. Using a two-loci-two-allele model, we demonstrate that combining markers in linkage equilibrium into haplotypes always leads to nonpositive GIA, suggesting that combining the two markers is not advantageous for ancestry inference. However, for loci in LD, GIA is often positive, suggesting that assignment can be improved by combining markers into haplotypes. Using GIA as a criterion for combining markers into haplotypes, we demonstrate for simulated data a significant improvement of assigning individuals to candidate populations. For the many cases that we investigate, incorrect assignment was reduced between 26% and 97% using haplotype data. For empirical data from French and German individuals, the incorrectly assigned individuals can, for example, be decreased by 73% using haplotypes. Our results can be useful for challenging population structure and assignment problems, in particular for studies where large-scale population-genomic data are available. 相似文献
20.
Although growing numbers of single nucleotide polymorphisms (SNPs) and microsatellites (short tandem repeat polymorphisms or STRPs) are used to infer population structure, their relative properties in this context remain poorly understood. SNPs and STRPs mutate differently, suggesting multi-locus genotypes at these loci might differ in ability to detect population structure. Here, we use coalescent simulations to measure the power of sets of SNPs and STRPs to identify population structure. To maximize the applicability of our results to empirical studies, we focus on the popular STRUCTURE analysis and evaluate the role of several biological and practical factors in the detection of population structure. We find that: (1) fewer unlinked STRPs than SNPs are needed to detect structure at recent divergence times <0.3 Ne generations; (2) accurate estimation of the number of populations requires many fewer STRPs than SNPs; (3) for both marker types, declines in power due to modest gene flow (Nem=1.0) are largely negated by increasing marker number; (4) variation in the STRP mutational model affects power modestly; (5) SNP haplotypes (θ=1, no recombination) provide power comparable with STRP loci (θ=10); (6) ascertainment schemes that select highly variable STRP or SNP loci increase power to detect structure, though ascertained data may not be suitable to other inference; and (7) when samples are drawn from an admixed population and one of its parent populations, the reduction in power to detect two populations is greater for STRPs than SNPs. These results should assist the design of multi-locus studies to detect population structure in nature. 相似文献