首页 | 本学科首页   官方微博 | 高级检索  
     


Searching for Footprints of Positive Selection in Whole-Genome SNP Data From Nonequilibrium Populations
Authors:Pavlos Pavlidis  Jeffrey D. Jensen  Wolfgang Stephan
Affiliation:*Department of Biology II, Ludwig-Maximilians-University Munich, 82152 Planegg, Germany and Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, Massachusetts
Abstract:A major goal of population genomics is to reconstruct the history of natural populations and to infer the neutral and selective scenarios that can explain the present-day polymorphism patterns. However, the separation between neutral and selective hypotheses has proven hard, mainly because both may predict similar patterns in the genome. This study focuses on the development of methods that can be used to distinguish neutral from selective hypotheses in equilibrium and nonequilibrium populations. These methods utilize a combination of statistics on the basis of the site frequency spectrum (SFS) and linkage disequilibrium (LD). We investigate the patterns of genetic variation along recombining chromosomes using a multitude of comparisons between neutral and selective hypotheses, such as selection or neutrality in equilibrium and nonequilibrium populations and recurrent selection models. We perform hypothesis testing using the classical P-value approach, but we also introduce methods from the machine-learning field. We demonstrate that the combination of SFS- and LD-based statistics increases the power to detect recent positive selection in populations that have experienced past demographic changes.GENOMES contain information related to the history of natural populations. Past neutral and selective processes may have left footprints in the genome. Recent advances in population genetics aim to understand the patterns of genetic diversity and identify events that have led to genetic adaptations. Among them, positive selection has been a focus of many recent studies (Harr et al. 2002; Kim and Stephan 2002; Glinka et al. 2003; Akey et al. 2004; Orengo and Aguadé 2004). Their goal is to (i) provide evidence of positive selection, (ii) estimate the strength and the rate of selection, and (iii) localize the targets of selection. These objectives form the basis of a long-term pursuit, which is the understanding of the molecular basis of adaptation of populations in a changing environment.Positive selection can cause genetic hitchhiking when a beneficial mutation spreads in the population (Maynard Smith and Haigh 1974). When a strongly beneficial mutation occurs and spreads in a population, linked neutral or slightly deleterious variants hitchhike with it, and their frequency increases. According to Maynard Smith and Haigh''s model, three patterns are generated locally around the position of the beneficial mutation. First, the level of variability will be reduced since standing variation of the population that is not linked to the beneficial allele vanishes, and tightly linked polymorphisms may fix (Kaplan et al. 1989; Stephan et al. 1992). Second, the site frequency spectrum (SFS), which describes the frequency of allelic variants, shifts from its neutral expectation toward rare and high-frequency derived variants (Braverman et al. 1995; Fay and Wu 2000). The third signature describes the emergence of specific linkage disequilibrium (LD) patterns around the target of positive selection, such as an elevated level of LD in the early phase of the fixation process of the beneficial mutation and a decay of LD across the selected site at the end of the selective phase (Kim and Nielsen 2004; Stephan et al. 2006).The availability of genome-wide SNP data has made possible the scanning of genomes and the identification of loci that may have been targets of recent selective events. Several approaches have been developed within the last years that can detect the molecular signatures of positive selection (Kim and Stephan 2002; Jensen et al. 2005; Nielsen et al. 2005). While the methods of Kim and Stephan (2002) and Jensen et al. (2005) are designed to analyze subgenomic SNP data, the approach of Nielsen et al. (2005) can be applied to both subgenomic and whole-genome data (reviewed in Pavlidis et al. 2008). For this reason we concentrate here on the latter procedure. This method, called SweepFinder, calculates the probability P(x) that a polymorphism of multiplicity x is linked to a beneficial mutation using a simple selective model and the SFS prior to the selective event. Then, for each location in the genome it compares a selective with a neutral model assuming independence between the SNPs, therefore calculating the composite likelihood ratio Λ. Thus, it identifies regions where the likelihood of the selective sweep is greater than that of the neutral model using the maximum value ΛMAX of Λ.The ω-statistic, developed by Kim and Nielsen (2004), detects specific LD patterns caused by genetic hitchhiking (described above). In the study by Kim and Nielsen (2004) the maximum value of the ω-statistic was used to identify the targets of selective sweeps. Later, Jensen et al. (2007) studied its performance in separating demographic from selective scenarios. An important result by Jensen et al. (2007) is the demonstration that for demographic parameters relevant to nonequilibrium populations (such as the cosmopolitan populations of Drosophila melanogaster) the ω-statistic can distinguish between neutral and selective scenarios. This article further develops SweepFinder and the ω-statistic such that they can eventually be applied to whole-genome SNP data sets that have been collected from nonequilibrium populations. In particular, populations undergoing population-size bottlenecks are of interest as these size changes may confound the patterns of selective sweeps (Barton 1998). For this reason we use the following approach: first, we theoretically analyze the genealogies of bottlenecked populations under neutrality and show to what extent they resemble the genealogies of single hitchhiking (SHH) events. We also point out the importance of high-frequency-derived variants in the identification of selective sweeps. Second, we study the statistical properties of SweepFinder and the ω-statistic separately and in combination. As the main result, we demonstrate that the combination of these two methods (that include both SFS and LD information) increases the power for detecting recent SHH events in nonequilibrium populations, in particular when machine-learning techniques are employed. Third we analyze the performance of SweepFinder and the ω-statistic in the detection of recurrent hitchhiking (RHH) events.
Keywords:
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号