首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We propose a multilocus version of FST and a measure of haplotype diversity using localized haplotype clusters. Specifically, we use haplotype clusters identified with BEAGLE, which is a program implementing a hidden Markov model for localized haplotype clustering and performing several functions including inference of haplotype phase. We apply this methodology to HapMap phase 3 data. With this haplotype-cluster approach, African populations have highest diversity and lowest divergence from the ancestral population, East Asian populations have lowest diversity and highest divergence, and other populations (European, Indian, and Mexican) have intermediate levels of diversity and divergence. These relationships accord with expectation based on other studies and accepted models of human history. In contrast, the population-specific FST estimates obtained directly from single-nucleotide polymorphisms (SNPs) do not reflect such expected relationships. We show that ascertainment bias of SNPs has less impact on the proposed haplotype-cluster-based FST than on the SNP-based version, which provides a potential explanation for these results. Thus, these new measures of FST and haplotype-cluster diversity provide an important new tool for population genetic analysis of high-density SNP data.GENOME-WIDE data sets from worldwide panels of individuals provide an outstanding opportunity to investigate the genetic structure of human populations (Conrad et al. 2006; International Hapmap Consortium 2007; Jakobsson et al. 2008; Auton et al. 2009). Populations around the globe form a continuum rather than discrete units (Serre and Paabo 2004; Weiss and Long 2009). However, notions of discrete populations can be appropriate when, for example, ancestral populations were separated by geographic distance or barriers such that little gene flow occurred.FST (Wright 1951; Weir and Cockerham 1984; Holsinger and Weir 2009) is a measure of population divergence. It measures variation between populations vs. within populations. One can calculate a global measure, assuming that all populations are equally diverged from an ancestral population, or one can calculate FST for specific populations or for pairs of populations while utilizing data from all populations (Weir and Hill 2002). One use of FST is to test for signatures of selection (reviewed in Oleksyk et al. 2010).FST may be calculated for single genetic markers. For multiallelic markers, such as microsatellites, this is useful, but single-nucleotide polymorphisms (SNPs) contain much less information when taken one at a time, and thus it is advantageous to calculate averages over windows of markers (Weir et al. 2005) or even over the whole genome. The advantage of windowed FST is that it can be used to find regions of the genome that show different patterns of divergence, indicative of selective forces at work during human history.Another measure of human evolutionary history is haplotype diversity. Haplotype diversity may be measured using a count of the number of observed haplotypes in a region or by the expected haplotype heterozygosity based on haplotype frequencies in a region. Application of this regional measure to chromosomal data can be achieved by a haplotype block strategy (Patil et al. 2001) or by windowing (Conrad et al. 2006; Auton et al. 2009).One problem with the analysis of population structure based on genome-wide panels of SNPs is that a large proportion of the SNPs were ascertained in Caucasians, potentially biasing the results of the analyses. Analysis based on haplotypes is less susceptible to such bias (Conrad et al. 2006). This is because haplotypes can be represented by multiple patterns of SNPs; thus lack of ascertainment of a particular SNP does not usually prevent observation of the haplotype. On a chromosome-wide scale, one cannot directly use entire haplotypes, because all the haplotypes in the sample will almost certainly be unique, thus providing no information on population structure. Instead one can use haplotypes on a local basis, either by using windows of adjacent markers or by using localized haplotype clusters, for example those obtained from fastPHASE (Scheet and Stephens 2006) or BEAGLE (Browning 2006; Browning and Browning 2007a).Localized haplotype clusters are a clustering of haplotypes on a localized basis. At the position of each genetic marker, haplotypes are clustered according to their similarity in the vicinity of the position. Both fastPHASE and BEAGLE use hidden Markov modeling to perform the clustering, although the specific models used by the two programs differ.Localized haplotype clusters derived from fastPHASE have been used to investigate haplotype diversity, to create neighbor-joining trees of populations, and to create multidimensional scaling (MDS) plots (Jakobsson et al. 2008). It was found that haplotype clusters showed different patterns of diversity to SNPs, while the neighbor-joining and MDS plots were similar between haplotype clusters and SNPs.In this work, we apply windowed FST methods to localized haplotype clusters derived from the BEAGLE program (Browning and Browning 2007a,b, 2009). We consider population-average, population-specific, and pairwise FST estimates (Weir and Hill 2002). Population-average FST''s either assume that all the populations are equally diverged from a common ancestor, which is not realistic, or represent the average of a set of population-specific values. This can be convenient in that the results are summarized by a single statistic; however, information is lost. A common procedure is to calculate FST for each pair of populations, and these values reflect the degree of divergence between the two populations. Different levels of divergence are allowed for each pair of populations but each estimate uses data from only that pair of populations. On the other hand, population-specific FST''s allow unequal levels of divergence in a single analysis that makes use of all the data.We compare results from the localized haplotype clusters to those using SNPs directly. The results of applying localized haplotype clusters to population-specific FST estimation are very striking, showing better separation of populations and a more realistic pattern of divergence than for population-specific FST estimation using SNPs directly. We also use BEAGLE''s haplotype clusters in a haplotype diversity measure and investigate the relationship between this measure of haplotype-cluster diversity and the recombination rate.  相似文献   

2.
For various species, high quality sequences and complete genomes are nowadays available for many individuals. This makes data analysis challenging, as methods need not only to be accurate, but also time efficient given the tremendous amount of data to process. In this article, we introduce an efficient method to infer the evolutionary history of individuals under the multispecies coalescent model in networks (MSNC). Phylogenetic networks are an extension of phylogenetic trees that can contain reticulate nodes, which allow to model complex biological events such as horizontal gene transfer, hybridization and introgression. We present a novel way to compute the likelihood of biallelic markers sampled along genomes whose evolution involved such events. This likelihood computation is at the heart of a Bayesian network inference method called SnappNet, as it extends the Snapp method inferring evolutionary trees under the multispecies coalescent model, to networks. SnappNet is available as a package of the well-known beast 2 software.Recently, the MCMC_BiMarkers method, implemented in PhyloNet, also extended Snapp to networks. Both methods take biallelic markers as input, rely on the same model of evolution and sample networks in a Bayesian framework, though using different methods for computing priors. However, SnappNet relies on algorithms that are exponentially more time-efficient on non-trivial networks. Using simulations, we compare performances of SnappNet and MCMC_BiMarkers. We show that both methods enjoy similar abilities to recover simple networks, but SnappNet is more accurate than MCMC_BiMarkers on more complex network scenarios. Also, on complex networks, SnappNet is found to be extremely faster than MCMC_BiMarkers in terms of time required for the likelihood computation. We finally illustrate SnappNet performances on a rice data set. SnappNet infers a scenario that is consistent with previous results and provides additional understanding of rice evolution.  相似文献   

3.
Determining the ancestry of unidentified human remains is a major task for bioarchaeologists and forensic anthropologists. Here, we report an assessment of the computer program that has become the main tool for accomplishing this task. Called Fordisc, the program determines ancestry through discriminant function analysis of cranial measurements. We evaluated the utility of Fordisc with 200 specimens of known ancestry. We ran the analyses with and without the test specimen''s source population included in the program''s reference sample, and with and without specifying the sex of the test specimen. We also controlled for the possibility that the number of variables employed affects the program''s ability to attribute ancestry. The results of the analyses suggest that Fordisc''s utility in research and medico-legal contexts is limited. Fordisc will only return a correct ancestry attribution when an unidentified specimen is more or less complete, and belongs to one of the populations represented in the program''s reference samples. Even then Fordisc can be expected to classify no more than 1 per cent of specimens with confidence.  相似文献   

4.
5.
Leaves of Vitis californica Benth. (California wild grape) exposed to a photon flux density (PFD) equivalent to full sun exhibited temperature-dependent reductions in the rates or efficiencies of component photosynthetic processes. During high-PFD exposure, net CO2 uptake, photon yield of oxygen evolution, and photosystem II chlorophyll fluorescence at 77 Kelvin (Fm, Fv, and Fv/Fm) were more severely inhibited at high and low temperatures than at intermediate temperatures. Sun leaves tolerated high PFD more than growth chamber-grown leaves but exhibited qualitatively similar temperature-dependent responses to high-PFD exposures. Photosystem II fluorescence and net CO2 uptake exhibited different sensitivities to PFD and temperature. Fluorescence and gas exchange kinetics during exposure to high PFD suggested an interaction of multiple, temperature-dependent processes, involving both regulation of energy distribution and damage to photosynthetic components. Comparison of Fv/Fm to photon yield of oxygen evolution yielded a single, curvilinear relationship, regardless of growth condition or treatment temperature, whereas the relationship between Fm (or Fv) and photon yield varied with growth conditions. This indicated that Fv/Fm was the most reliable fluorescence indicator of PSII photochemical efficiency for leaves of different growth conditions and treatments.  相似文献   

6.
Effective population size (Ne) is a central evolutionary concept, but its genetic estimation can be significantly complicated by age structure. Here we investigate Ne in Atlantic salmon (Salmo salar) populations that have undergone changes in demography and population dynamics, applying four different genetic estimators. For this purpose we use genetic data (14 microsatellite markers) from archived scale samples collected between 1951 and 2004. Through life table simulations we assess the genetic consequences of life history variation on Ne. Although variation in reproductive contribution by mature parr affects age structure, we find that its effect on Ne estimation may be relatively minor. A comparison of estimator models suggests that even low iteroparity may upwardly bias Ne estimates when ignored (semelparity assumed) and should thus empirically be accounted for. Our results indicate that Ne may have changed over time in relatively small populations, but otherwise remained stable. Our ability to detect changes in Ne in larger populations was, however, likely hindered by sampling limitations. An evaluation of Ne estimates in a demographic context suggests that life history diversity, density-dependent factors, and metapopulation dynamics may all affect the genetic stability of these populations.THE effective size of a population (Ne) is an evolutionary parameter that can be informative on the strength of stochastic evolutionary processes, the relevance of which relative to deterministic forces has been debated for decades (e.g., Lande 1988). Stochastic forces include environmental, demographic, and genetic components, the latter two of which are thought to be more prominent at reduced population size, with potentially detrimental consequences for average individual fitness and population persistence (Newman and Pilson 1997; Saccheri et al. 1998; Frankham 2005). The quantification of Ne in conservation programs is thus frequently advocated (e.g., Luikart and Cornuet 1998; Schwartz et al. 2007), although gene flow deserves equal consideration given its countering effects on genetic stochasticity (Frankham et al. 2003; Palstra and Ruzzante 2008).Effective population size is determined mainly by the lifetime reproductive success of individuals in a population (Wright 1938; Felsenstein 1971). Variance in reproductive success, sex ratio, and population size fluctuations can reduce Ne below census population size (Frankham 1995). Given the difficulty in directly estimating Ne through quantification of these demographic factors (reviewed by Caballero 1994), efforts have been directed at inferring Ne indirectly through measurement of its genetic consequences (see Leberg 2005, Wang 2005, and Palstra and Ruzzante 2008 for reviews). Studies employing this approach have quantified historical levels of genetic diversity and genetic threats to population persistence (e.g., Nielsen et al. 1999b; Miller and Waits 2003; Johnson et al. 2004). Ne has been extensively studied in (commercially important) fish species, due to the common availability of collections of archived samples that facilitate genetic estimation using the temporal method (e.g., Hauser et al. 2002; Shrimpton and Heath 2003; Gomez-Uchida and Banks 2006; Saillant and Gold 2006).Most models relating Ne to a population''s genetic behavior make simplifying assumptions regarding population dynamics. Chiefly among these is the assumption of discrete generations, frequently violated in practice given that most natural populations are age structured with overlapping generations. Here, theoretical predictions still apply, provided that population size and age structure are constant (Felsenstein 1971; Hill 1972). Ignored age structure can introduce bias into temporal genetic methods for the estimation of Ne, especially for samples separated by time spans that are short relative to generation interval (Jorde and Ryman 1995; Waples and Yokota 2007; Palstra and Ruzzante 2008). Moreover, estimation methods that do account for age structure (e.g., Jorde and Ryman 1995) still assume this structure to be constant. Population dynamics will, however, likely be altered as population size changes, thus making precise quantifications of the genetic consequences of acute population declines difficult (Nunney 1993; Engen et al. 2005; Waples and Yokota 2007). This problem may be particularly relevant when declines are driven by anthropogenic impacts, such as selective harvesting regimes, that can affect age structure and Ne simultaneously (Ryman et al. 1981; Allendorf et al. 2008). Demographic changes thus have broad conservation implications, as they can affect a population''s sensitivity to environmental stochasticity and years of poor recruitment (Warner and Chesson 1985; Ellner and Hairston 1994; Gaggiotti and Vetter 1999). Consequently, although there is an urgent need to elucidate the genetic consequences of population declines, relatively little is understood about the behavior of Ne when population dynamics change (but see Engen et al. 2005, 2007).Here we focus on age structure and Ne in Atlantic salmon (Salmo salar) river populations in Newfoundland and Labrador. The freshwater habitat in this part of the species'' distribution range is relatively pristine (Parrish et al. 1998), yet Atlantic salmon in this area have experienced demographic declines, associated with a commercial marine fishery, characterized by high exploitation rates (40–80% of anadromous runs; Dempson et al. 2001). A fishery moratorium was declared in 1992, with rivers displaying differential recovery patterns since then (Dempson et al. 2004b), suggesting a geographically variable impact of deterministic and stochastic factors, possibly including genetics. An evaluation of those genetic consequences thus requires accounting for potential changes in population dynamics as well as in life history. Life history in Atlantic salmon can be highly versatile (Fleming 1996; Hutchings and Jones 1998; Fleming and Reynolds 2004), as exemplified by the high variation in age-at-maturity displayed among and within populations (Hutchings and Jones 1998), partly reflecting high phenotypic plasticity (Hutchings 2004). This diversity is particularly evident in the reproductive biology of males, which can mature as parr during juvenile freshwater stages (Jones and King 1952; Fleming and Reynolds 2004) and/or at various ages as anadromous individuals, when returning to spawn in freshwater from ocean migration. Variability in life history strategies is further augmented by iteroparity, which can be viewed as a bet-hedging strategy to deal with environmental uncertainty (e.g., Orzack and Tuljapurkar 1989; Fleming and Reynolds 2004). Life history diversity and plasticity may allow salmonid fish populations to alter and optimize their life history under changing demography and population dynamics, potentially acting to stabilize Ne. Reduced variance in individual reproductive success at low breeder abundance (genetic compensation) will achieve similar effects and might be a realistic aspect of salmonid breeding systems (Ardren and Kapuscinski 2003; Fraser et al. 2007b). Little is currently known about the relationships between life history plasticity, demographic change and Ne, partly due to scarcity of the multivariate data required for these analyses.Our objective in this article is twofold. First, we use demographic data for rivers in Newfoundland to quantify how life history variation influences age structure in Atlantic salmon and hence Ne and its empirical estimation from genetic data. We find that variation in reproductive contribution by mature parr has a much smaller effect on the estimation of Ne than is often assumed. Second, we use temporal genetic data to estimate Ne and quantify the genetic consequences of demographic changes. We attempt to account for potential sources of bias, associated with (changes in) age structure and life history, by using four different analytical models to estimate Ne: a single-sample estimator using the linkage disequilibrium method (Hill 1981), the temporal model assuming discrete generations (Nei and Tajima 1981; Waples 1989), and two temporal models for species with overlapping generations (Waples 1990a,b; Jorde and Ryman 1995) that differ principally in assumptions regarding iteroparity. A comparison of results from these different estimators suggests that iteroparity may often warrant analytical consideration, even when it is presumably low. Although sometimes limited by statistical power, a quantification and comparison of temporal changes in Ne among river populations suggests a more prominent impact of demographic changes on Ne in relatively small river populations.  相似文献   

7.
d-Alanyl:d-lactate (d-Ala:d-Lac) and d-alanyl:d-serine ligases are key enzymes in vancomycin resistance of Gram-positive cocci. They catalyze a critical step in the synthesis of modified peptidoglycan precursors that are low binding affinity targets for vancomycin. The structure of the d-Ala:d-Lac ligase VanA led to the understanding of the molecular basis for its specificity, but that of d-Ala:d-Ser ligases had not been determined. We have investigated the enzymatic kinetics of the d-Ala:d-Ser ligase VanG from Enterococcus faecalis and solved its crystal structure in complex with ADP. The overall structure of VanG is similar to that of VanA but has significant differences mainly in the N-terminal and central domains. Based on reported mutagenesis data and comparison of the VanG and VanA structures, we show that residues Asp-243, Phe-252, and Arg-324 are molecular determinants for d-Ser selectivity. These residues are conserved in both enzymes and explain why VanA also displays d-Ala:d-Ser ligase activity, albeit with low catalytic efficiency in comparison with VanG. These observations suggest that d-Ala:d-Lac and d-Ala:d-Ser enzymes have evolved from a common ancestral d-Ala:d-X ligase. The crystal structure of VanG showed an unusual interaction between two dimers involving residues of the omega loop that are deeply anchored in the active site. We constructed an octapeptide mimicking the omega loop and found that it selectively inhibits VanG and VanA but not Staphylococcus aureus d-Ala:d-Ala ligase. This study provides additional insight into the molecular evolution of d-Ala:d-X ligases and could contribute to the development of new structure-based inhibitors of vancomycin resistance enzymes.  相似文献   

8.
Detecting genetic signatures of selection is of great interest for many research issues. Common approaches to separate selective from neutral processes focus on the variance of FST across loci, as does the original Lewontin and Krakauer (LK) test. Modern developments aim to minimize the false positive rate and to increase the power, by accounting for complex demographic structures. Another stimulating goal is to develop straightforward parametric and computationally tractable tests to deal with massive SNP data sets. Here, we propose an extension of the original LK statistic (TLK), named TF–LK, that uses a phylogenetic estimation of the population''s kinship () matrix, thus accounting for historical branching and heterogeneity of genetic drift. Using forward simulations of single-nucleotide polymorphisms (SNPs) data under neutrality and selection, we confirm the relative robustness of the LK statistic (TLK) to complex demographic history but we show that TF–LK is more powerful in most cases. This new statistic outperforms also a multinomial-Dirichlet-based model [estimation with Markov chain Monte Carlo (MCMC)], when historical branching occurs. Overall, TF–LK detects 15–35% more selected SNPs than TLK for low type I errors (P < 0.001). Also, simulations show that TLK and TF–LK follow a chi-square distribution provided the ancestral allele frequencies are not too extreme, suggesting the possible use of the chi-square distribution for evaluating significance. The empirical distribution of TF–LK can be derived using simulations conditioned on the estimated matrix. We apply this new test to pig breeds SNP data and pinpoint outliers using TF–LK, otherwise undetected using the less powerful TLK statistic. This new test represents one solution for compromise between advanced SNP genetic data acquisition and outlier analyses.THE development of methods aiming at detecting molecular signatures of selection is one of the major concerns of modern population genetics. Broadly, such methods can be classified into four groups: methods focusing on (i) the interspecific comparison of gene substitution patterns, (ii) the frequency spectrum and models of selective sweeps, (iii) linkage disequilibrium (LD) and haplotype structure, and (iv) patterns of genetic differentiation among populations (for a review see Nielsen 2005). Tests based on the comparison of polymorphism and divergence at the species level inform on mostly ancient selective processes. Population-based approaches, however, are designed to pinpoint modern processes of local adaptation and speciation occurring among populations within a species. Such approaches also become crucial in the fields of agronomical and biomedical sciences, for instance, to pinpoint possible interesting (QTL) regions and disease susceptibility genes. Especially, human, livestock, and cultivated plants genetics may benefit from such methods while whole-genome single-nucleotide polymorphisms (SNPs) genotyping technologies are becoming routinely available (e.g., Barreiro et al. 2008; Flori et al. 2009).In the population genomic era (Luikart et al. 2003), identifying genes under selection or neutral markers influenced by nearby selected genes is a task in itself for quantifying the role of selection in the evolutionary history of species. Conversely, the accurate inference of demographic parameters such as effective population sizes, migration rates, and divergence times between populations relies on the use of neutral marker data sets. One approach of detecting loci under selection (outliers) with population genetic data is based on the genetic differentiation between loci influenced only by neutral processes (genetic drift, mutation, migration) and loci influenced by selection.Lewontin and Krakauer''s (LK) test for the heterogeneity of the inbreeding coefficient (F) across loci was the first to be developed with regard to this concept (Lewontin and Krakauer 1973). The LK test was immediately subject to criticisms (Nei and Maruyama 1975; Lewontin and Krakauer 1975; Robertson, 1975a,b; Tsakas and Krimbas 1976; Nei and Chakravarti 1977; Nei et al. 1977). Indeed, its assumptions are likely to be violated due to loci with high mutation rate, variation of F due to unequal effective population size (Ne) among demes, and correlation of allele frequencies among demes due to historical branching. The robustness of the LK test to the effects of demography was tested through coalescent simulations by Beaumont and Nichols (1996). They tested the influence of different models of population structure on the joint distribution of FST (i.e., the inbreeding coefficient F) and heterozygosity (He). The FST distribution under an infinite-island model is inflated for low He values under both the infinite-allele model (IAM) and the stepwise mutation model (SMM) (Beaumont and Nichols 1996). This tendency becomes, however, more marked when strong differences in effective size Ne and gene flow among demes occur, that is, when allele frequencies are correlated among local demes. This suggests an excess of false significant loci when one assumes an infinite-island model as a null hypothesis, while correlations of gene frequencies substantially occur. However, the FST distribution shows robustness properties for high He values (typical from microsatellite markers). Therefore, Beaumont and Nichols (1996) suggested the possibility of detecting outliers by using the distribution of neutral FST conditionally on He under the infinite-island model of symmetric migration, with mutation.The problem of accounting for correlations of allele frequencies among subpopulations was discussed by Robertson (1975a), who showed how these correlations inflated the variance of the LK test. Different approaches were taken to cope with the problem. It was, for instance, proposed to restrict the analysis to pairwise comparisons (Tsakas and Krimbas 1976; Vitalis et al. 2001). However, as pointed out by Beaumont (2005), reducing the number of populations to be compared to many pairwise comparisons raises the problem of nonindependence in multiple testing and may reduce the power to detect outliers. Another way was to assume that subpopulation allele frequencies are correlated through a common migrant gene pool, that is, the ancestral population in a star-like population divergence. In this case, subpopulations evolve with an unequal number of migrants coming from the migrant pool and/or to different amounts of genetic drift. This demographic scenario can be explicitly modeled using the multinomial-Dirichlet likelihood approach (Balding 2003). This multinomial-Dirichlet likelihood (or Beta-binomial for biallelic markers such as SNPs) was implemented by Beaumont and Balding (2004) and subsequently by Foll and Gaggiotti (2008), Gautier et al. (2009), Guo et al. (2009), and Riebler et al. (2010), in a Bayesian hierarchical model in which the FST is decomposed into two components: a locus-specific (α) effect and a population-specific (β) effect. This Bayesian statistical model together with prior assumptions on α and β was implemented in a Markov chain Monte Carlo (MCMC) algorithm. A substantial improvement made by Foll and Gaggiotti (2008) was to use a reverse-jumping (RJ)-MCMC to simultaneously estimate the posterior distribution of a model with selection (with α and β) and of a model without selection (with β only). More recently, Excoffier et al. (2009) addressed the issue of accounting for “heterogeneous affinities between sampled populations”—in other words, accounting for migrant genes that do not necessarily originate from the same pool—by using a hierarchically structured population model. They showed by simulations that the false positive rate is lower under a hierarchically structured population model than under a simple island model, for the IAM and the SMM applicable to microsatellite markers and for a SNP mutation model. Excoffier et al.(2009) thus proposed to extend the Beaumont and Nichols (1996) method to a hierarchically structured population model.Nowadays, a computational challenge is to analyze data sets with increasing numbers of markers and populations, under complex demographic histories, in a reasonable amount of time. This is especially the case in agronomical and biomedical sciences with the increasingly used biallelic SNP markers. A question arises as to whether FST-based methods would be sufficiently powerful to detect outliers with SNP markers. Indeed, for low He values, the inflation of the FST distribution under the infinite-island model accentuates dramatically when assuming a mutation model typical for SNPs (simulations of Eveno et al. 2008). Excoffier et al. (2009) corroborated these results and also indicated that the FST distribution is generally broader under a model of hierarchically structured populations when using SNP markers. In addition, as the authors pinpoint, although the hierarchical island model is more conservative than the island model, an excess of false positives can be obtained “if the underlying genetic structure is more complex …, for instance in case of complex demographic histories, involving population splits, range expansion, bottleneck or admixture events” (Excoffier et al. 2009, p. 12). The Bayesian hierarchical models developed by Beaumont and Balding (2004) and Foll and Gaggiotti (2008) effectively account for strong effective size and migration rate variation among subpopulations, but they still impose a star-like demographic model in which the current populations share a common migrant pool and are not supposed to have undergone historical branching. More practically, MCMC-based methods might suffer from a computational time requirement when analyzing large marker data sets such as SNP chips data sets. Therefore, the development of simple parametric tests potentially dealing with a summary of the population tree, including historical branching as well as population size variation, remains an alternative solution to achieve a good compromise between advanced genetic data acquisition and outlier analyses.In this article, we describe an extension of the original parametric LK test for biallelic markers that deals with complex population trees through a statistic that takes into account the kinship (or coancestry) matrix between populations, under pure drift with no migration. The statistics of the classical test (TLK) and its extension (TF–LK) are expected to follow a chi-square distribution with (n – 1) d.f., where n is the number of populations studied. Through forward simulations of neutral SNPs data under increasingly complex demographic histories, we obtained the empirical distribution of both statistics and showed that they follow a chi-square distribution provided the ancestral allele frequencies are not too extreme. These results also emphasize the robustness of these statistics to variation in demographic histories. Forward simulations of the same demographic models but including selection in one population allowed us to evaluate the power of both statistics to detect selection. We show that the extension of the LK test is more powerful at detecting outliers than the classical LK test for complex demographic histories. A comparison with one of the MCMC methods for multinomial-Dirichlet models (Foll and Gaggiotti 2008) also revealed substantial additional power. We apply this new statistical test to a data set of SNP markers in known genes of the pig genome, taking advantage of the availability of microsatellite markers for the estimation of the kinship matrix. This new parametric test can help to screen large marker data sets and large numbers of populations for outliers in a reasonable amount of time, although we recommend to simulate the empirical distribution of the TF–LK statistics conditionally on the estimated kinship matrix.  相似文献   

9.
The rapidly growing amount of genomic sequence data being generated and made publicly available necessitate the development of new data storage and archiving methods. The vast amount of data being shared and manipulated also create new challenges for network resources. Thus, developing advanced data compression techniques is becoming an integral part of data production and analysis. The HapMap project is one of the largest public resources of human single-nucleotide polymorphisms (SNPs), characterizing over 3 million SNPs genotyped in over 1000 individuals. The standard format and biological properties of HapMap data suggest that a dedicated genetic compression method can outperform generic compression tools. We propose a compression methodology for genetic data by introducing HapZipper, a lossless compression tool tailored to compress HapMap data beyond benchmarks defined by generic tools such as gzip, bzip2 and lzma. We demonstrate the usefulness of HapZipper by compressing HapMap 3 populations to <5% of their original sizes. HapZipper is freely downloadable from https://bitbucket.org/pchanda/hapzipper/downloads/HapZipper.tar.bz2.  相似文献   

10.
11.
To establish an advantageous method for the production of l-amino acids, microbial isomerization of d- and dl-amino acids to l-amino acids was studied. Screening experiments on a number of microorganisms showed that cell suspensions of Pseudomonas fluorescens and P. miyamizu were capable of isomerizing d- and dl-phenylalanines to l-phenylalanine. Various conditions suitable for isomerization by these organisms were investigated. Cells grown in a medium containing d-phenylalanine showed highest isomerization activity, and almost completely converted d- or dl-phenylalanine into l-phenylalanine within 24 to 48 hr of incubation. Enzymatic studies on this isomerizing system suggested that the isomerization of d- or dl-phenylalanine is not catalyzed by a single enzyme, “amino acid isomerase,” but the conversion proceeds by a two step system as follows: d-pheylalanine is oxidized to phenylpyruvic acid by d-amino acid oxidase, and the acid is converted to l-phenylalanine by transamination or reductive amination.  相似文献   

12.
Captive populations where natural mating in groups is used to obtain offspring typically yield unbalanced population structures with highly skewed parental contributions and unknown pedigrees. Consequently, for genetic parameter estimation, relationships need to be reconstructed or estimated using DNA marker data. With missing parents and natural mating groups, commonly used pedigree reconstruction methods are not accurate and lead to loss of data. Relatedness estimators, however, infer relationships between all animals sampled. In this study, we compared a pedigree relatedness method and a relatedness estimator (“molecular relatedness”) method using accuracy of estimated breeding values. A commercial data set of common sole, Solea solea, with 51 parents and 1953 offspring (“full data set”) was used. Due to missing parents, for 1338 offspring, a pedigree could be reconstructed with 10 microsatellite markers (“reduced data set”). Cross-validation of both methods using the reduced data set showed an accuracy of estimated breeding values of 0.54 with pedigree reconstruction and 0.55 with molecular relatedness. Accuracy of estimated breeding values increased to 0.60 when applying molecular relatedness to the full data set. Our results indicate that pedigree reconstruction and molecular relatedness predict breeding values equally well in a population with skewed contributions to families. This is probably due to the presence of few large full-sib families. However, unlike methods with pedigree reconstruction, molecular relatedness methods ensure availability of all genotyped selection candidates, which results in higher accuracy of breeding value estimation.To estimate genetic parameters, additive genetic relationships between individuals are inferred from known pedigrees (Falconer and Mackay 1996; Lynch and Walsh 1997). However, in natural populations (Ritland 2000; Thomas et al. 2002) and in captive species where natural mating in groups is used to obtain offspring (Brown et al. 2005; Fessehaye et al. 2006; Blonk et al. 2009) pedigrees are reconstructed. In these populations there is no control on mating structure, and typically unbalanced population structures with highly skewed parental contributions are obtained (Bekkevold et al. 2002; Brown et al. 2005; Fessehaye et al. 2006; Blonk et al. 2009). To reconstruct pedigrees, parental allocation methods are often used (Marshall et al. 1998; Avise et al. 2002; Duchesne et al. 2002). These methods require that all parents be known. For situations where parental information is not available, numerous DNA-marker-based methods for estimating molecular relatedness have been developed (Lynch 1988; Queller and Goodnight 1989; Ritland 2000; Toro et al. 2002). These relatedness estimators determine relationship values between individuals on a continuous scale. Evaluation of relatedness estimators within real and simulated data in both plants and animals (e.g., see Van de Casteele et al. 2001 ; Milligan 2003; Oliehoek et al. 2006; Rodríguez-Ramilo et al. 2007; Bink et al. 2008) has generally focused on bias and sampling error of estimated genetic variances or relatedness values. Relatively little attention has been paid to their efficiency for estimation of breeding values.Two types of relatedness estimators are currently available: method-of-moments estimators and maximum-likelihood estimators. Method-of-moments estimators (e.g., Queller and Goodnight 1989; Li et al. 1993; Ritland 1996; Lynch and Ritland 1999; Toro et al. 2002) determine relationships while calculating sharing of alleles between pairs in different ways. A variant of method-of-moments estimators is the transformation of continuous relatedness values to categorical genealogical relationships using “explicit pedigree reconstruction” (Fernández and Toro 2006) or thresholds (Rodríguez-Ramilo et al. 2007). However, correlations of transformed coancestries with known genealogical coancestries are low (Rodríguez-Ramilo et al. 2007). Several studies have compared different method-of-moments estimators but none revealed one single best estimator (Van de Casteele et al. 2001; Oliehoek et al. 2006; Rodríguez-Ramilo et al. 2007; Bink et al. 2008).Maximum-likelihood (ML) approaches classify animals into a limited number of relationship classes (Mousseau et al. 1998; Thomas et al. 2002; Wang 2004; Herbinger et al. 2006; Anderson and Weir 2007). For each pair a likelihood to fall into a possible relatedness class (e.g., full sib vs. unrelated) is calculated given its genotype and phenotype. ML techniques combined with a Markov chain Monte Carlo approach reconstruct groups with specific relationships jointly and are therefore more efficient than other ML approaches. To minimize standard errors, all discussed ML methods require balanced population structures, large sibling groups, and a large variance of relatedness (Thomas et al. 2002; Wang 2004; Anderson and Weir 2007). Therefore, these methods may not be suitable for natural mating systems.Unlike parental allocation methods, a benefit from relatedness estimators is that essentially all selection candidates are maintained for breeding value estimation, even with missing parents. The question is, however, whether such relatedness estimators also give accurate breeding values to perform selection.In this study, we test suitability of a relatedness estimator to obtain breeding values in a population of common sole, Solea solea (n = 1953) obtained by natural mating. First, we estimate breeding values using pedigree relatedness of animals for which a pedigree could be reconstructed (using parental allocation). This data set (n = 1338) is further referred to as “reduced data set.” We compare results with estimated breeding values using a simple but robust method-of-moments relatedness estimator: “molecular relatedness” (Toro et al. 2002, 2003). Next, we estimate breeding values using molecular relatedness in the full data set (n = 1953). Results show that accuracies of estimated breeding values obtained with molecular relatedness and pedigree relatedness are comparable. Accuracy increases when breeding values are estimated with molecular relatedness in the full data set. This implies that a molecular relatedness estimator can be used to estimate breeding values in captive natural mating populations.  相似文献   

13.
The importance of genes of major effect for evolutionary trajectories within and among natural populations has long been the subject of intense debate. For example, if allelic variation at a major-effect locus fundamentally alters the structure of quantitative trait variation, then fixation of a single locus can have rapid and profound effects on the rate or direction of subsequent evolutionary change. Using an Arabidopsis thaliana RIL mapping population, we compare G-matrix structure between lines possessing different alleles at ERECTA, a locus known to affect ecologically relevant variation in plant architecture. We find that the allele present at ERECTA significantly alters G-matrix structure—in particular the genetic correlations between branch number and flowering time traits—and may also modulate the strength of natural selection on these traits. Despite these differences, however, when we extend our analysis to determine how evolution might differ depending on the ERECTA allele, we find that predicted responses to selection are similar. To compare responses to selection between allele classes, we developed a resampling strategy that incorporates uncertainty in estimates of selection that can also be used for statistical comparisons of G matrices.THE structure of the genetic variation that underlies phenotypic traits has important consequences for understanding the evolution of quantitative traits (Fisher 1930; Lande 1979; Bulmer 1980; Kimura 1983; Orr 1998; Agrawal et al. 2001). Despite the infinitesimal model''s allure and theoretical tractability (see Orr and Coyne 1992; Orr 1998, 2005a,b for reviews of its influence), evidence has accumulated from several sources (artificial selection experiments, experimental evolution, and QTL mapping) to suggest that genes of major effect often contribute to quantitative traits. Thus, the frequency and role of genes of major effect in evolutionary quantitative genetics have been a subject of intense debate and investigation for close to 80 years (Fisher 1930; Kimura 1983; Orr 1998, 2005a,b). Beyond the conceptual implications, the prevalence of major-effect loci also affects our ability to determine the genetic basis of adaptations and species differences (e.g., Bradshaw et al. 1995, 1998).Although the existence of genes of major effect is no longer in doubt, we still lack basic empirical data on how segregating variation at such genes affects key components of evolutionary process (but see Carrière and Roff 1995). In other words, How does polymorphism at genes of major effect alter patterns of genetic variation and covariation, natural selection, and the likely response to selection? The lack of data stems, in part, from the methods used to detect genes of major effect: experimental evolution (e.g., Bull et al. 1997; Zeyl 2005) and QTL analysis (see Erickson et al. 2004 for a review) often detect such genes retrospectively after they have become fixed in experimental populations or the species pairs used to generate the mapping population. The consequences of polymorphism at these genes on patterns of variation, covariation, selection, and the response to selection—which can be transient (Agrawal et al. 2001)—are thus often unobserved.A partial exception to the absence of data on the effects of major genes comes from artificial selection experiments, in which a substantial evolutionary response to selection in the phenotype after a plateau is often interpreted as evidence for the fixation of a major-effect locus (Frankham et al. 1968; Yoo 1980a,b; Frankham 1980; Shrimpton and Robertson 1988a,b; Caballero et al. 1991; Keightley 1998; see Mackay 1990 and Hill and Caballero 1992 for reviews). However, many of these experiments report only data on the selected phenotype (e.g., bristle number) or, alternatively, the selected phenotype and some measure of fitness (e.g., Frankham et al. 1968, Yoo 1980b; Caballero et al. 1991; Mackay et al. 1994; Fry et al. 1995; Nuzhdin et al. 1995; Zur Lage et al. 1997), making it difficult to infer how a mutation will affect variation, covariation, selection, and evolutionary responses for a suite of traits that might affect fitness themselves. One approach is to document how variation at individual genes of major effect affects the genetic variance–covariance matrix (“G matrix”; Lande 1979), which represents the additive genetic variance and covariance between traits.Although direct evidence for variation at major-effect genes altering patterns of genetic variation, covariation, and selection is rare, there is abundant evidence for the genetic mechanisms that could produce these dynamics. A gene of major effect could have these consequences due to any of at least three genetic mechanisms: (1) pleiotropy, where a gene of major effect influences several traits, including potentially fitness, simultaneously, (2) physical linkage or linkage disequilibrium (LD), in which a gene of major effect is either physically linked or in LD with other genes that influence other traits under selection, and (3) epistasis, in which the allele present at a major-effect gene alters the phenotypic effect of other loci and potentially phenotypes under selection. Evidence for these three evolutionary genetic mechanisms leading to changes in suites of traits comes from a variety of sources, including mutation accumulation experiments (Clark et al. 1995; Fernandez and Lopez-Fanjul 1996), mutation induction experiments (Keightley and Ohnishi 1998), artificial selection experiments (Long et al. 1995), and transposable element insertions (Rollmann et al. 2006). For pleiotropy in particular, major-effect genes that have consequences on several phenotypic traits are well known from the domestication and livestock breeding literature [e.g., myostatin mutations in Belgian blue cattle and whippets (Arthur 1995; Grobet et al. 1997; Mosher et al. 2007), halothane genes in pigs (Christian and Rothschild 1991; Fujii et al. 1991), and Booroola and Inverdale genes in sheep (Amer et al. 1999; Visscher et al. 2000)]. While these data suggest that variation at major-effect genes could—and probably does—influence variation, covariation, and selection on quantitative traits, data on the magnitude of these consequences remain lacking.Recombinant inbred line (RIL) populations are a promising tool for investigating the influence of major-effect loci. During advancement of the lines from F2''s to RILs, alternate alleles at major-effect genes (and most of the rest of the genome) will be made homozygous, simplifying comparisons among genotypic classes. Because of the high homozygosity, individuals within RILs are nearly genetically identical, facilitating phenotyping of many genotypes under a range of environments. In addition, because of recombination, alternative alleles are randomized across genetic backgrounds—facilitating robust comparisons between sets of lines differing at a major-effect locus.Here we investigate how polymorphism at an artificially induced mutation, the erecta locus in Arabidopsis thaliana, affects the magnitude of these important evolutionary genetic parameters under ecologically realistic field conditions. We use the Landsberg erecta (Ler) × Columbia (Col) RIL population of A. thaliana to examine how variation at a gene of major effect influences genetic variation, covariation, and selection on quantitative traits in a field setting. The Ler × Col RIL population is particularly suitable, because it segregates for an artificially induced mutation at the erecta locus, which has been shown to influence a wide variety of plant traits. The Ler × Col population thus allows a powerful test of the effects of segregating variation at a gene—chosen a priori—with numerous pleiotropic effects. The ERECTA gene is a leucine-rich receptor-like kinase (LRR-RLK) (Torii et al. 1996) and has been shown to affect plant growth rates (El-Lithy et al. 2004), stomatal patterning and transpiration efficiency (Masle et al. 2005; Shpak et al. 2005), bacterial pathogen resistance (Godiard et al. 2003), inflorescence and floral organ size and shape (Douglas et al. 2002; Shpak et al. 2003, 2004), and leaf polarity (Xu et al. 2003; Qi et al. 2004).Specifically, we sought to answer the following questions: (1) Is variation at erecta significantly associated with changes to the G matrix? (2) Is variation at erecta associated with changes in natural selection on genetically variable traits? And (3) is variation at erecta associated with significantly different projected evolutionary responses to selection?  相似文献   

14.
Codon usage bias is the nonrandom use of synonymous codons for the same amino acid. Most population genetic models of codon usage evolution assume that the population is at mutation–selection–drift equilibrium. Natural populations, however, frequently deviate from equilibrium, often because of recent demographic changes. Here, we construct a matrix model that includes the effects of a recent change in population size on estimates of selection on preferred vs. unpreferred codons. Our results suggest that patterns of synonymous polymorphisms affecting codon usage can be quite erratic after such a change; statistical methods that fail to take demographic effects into account can then give incorrect estimates of important parameters. We propose a new method that can accurately estimate both demographic and codon usage parameters. The method also provides a simple way of testing for the effects of covariates such as gene length and level of gene expression on the intensity of selection, which we apply to a large Drosophila melanogaster polymorphism data set. Our analyses of twofold degenerate codons reveal that (i) selection acts in favor of preferred codons, (ii) there is mutational bias in favor of unpreferred codons, (iii) shorter genes and genes with higher expression levels are under stronger selection, and (iv) there is little evidence for a recent change in population size in the Zimbabwe population of D. melanogaster.CODONS specifying the same amino acid are called synonymous codons. These are often used nonrandomly, with some codons appearing more frequently than others. This biased usage of synonymous codons has been found in many organisms such as Drosophila, yeast, and bacteria (Ikemura 1985; Duret and Mouchiroud 1999; Hershberg and Petrov 2008). Conventionally, synonymous codons for a given amino acid are divided into two classes: preferred and unpreferred codons (Ikemura 1985; Akashi 1994; Duret and Mouchiroud 1999). Several observations indicate that codon usage is affected by natural selection. First, in species with codon usage bias, preferred codons generally correspond to the most abundant tRNA species (Ikemura 1981). Second, highly expressed genes usually have higher codon usage bias than genes with low expression (Sharp and Li 1986; Duret and Mouchiroud 1999; Hey and Kliman 2002). Third, the synonymous substitution rate of a gene has been shown to be negatively correlated with its degree of codon usage bias (Sharp and Li 1986; Bierne and Eyre-Walker 2006). The most commonly cited explanations of the apparent fitness differences between preferred and unpreferred codons are selection for translation efficiency, translational accuracy, and mRNA stability (Ikemura 1985; Eyre-Walker and Bulmer 1993; Akashi 1994; Drummond et al. 2005). Recently, it has been proposed that exon splicing also affects codon usage bias (Warnecke and Hurst 2007).From a population genetics perspective, the extent of codon usage bias is ultimately a product of the joint effects of mutation, selection, genetic drift, recombination, and demographic history. The Li–Bulmer model of drift, selection, and reversible mutation between preferred and unpreferred codons at a site is the most widely used model (Li 1987; Bulmer 1991; McVean and Charlesworth 1999). Applications of this model generally assume that the population is at mutation–selection–drift equilibrium. However, empirical studies have suggested that changes in the strengths of various driving forces may not be unusual. For example, in Drosophila melanogaster, there is evidence that the population size (Li and Stephan 2006; Thornton and Andolfatto 2006; Keightley and Eyre-Walker 2007; Stephan and Li 2007), recombinational landscape (Takano-Shimizu 1999), and mutational process (Takano-Shimizu 2001; Kern and Begun 2005) may have changed significantly over the species'' evolutionary history.Such changes cause departures from equilibrium. Theoretical models show that it takes a very long time, proportional to the reciprocal of the mutation rate, for the population to approach a new equilibrium state (Tachida 2000; Comeron and Kreitman 2002). Before reaching equilibrium, the population often shows counterintuitive patterns of evolution (Eyre-Walker 1997; Takano-Shimizu 1999, 2001; Comeron and Kreitman 2002; Comeron and Guthrie 2005; Charlesworth and Eyre-Walker 2007). Despite these theoretical results, details of the patterns of polymorphism and substitution rates following a recent change in population size, and their effects on estimates of strength of selection, have not been determined.The above findings point to the importance of incorporating nonequilibrium factors into the study of codon usage bias. To this end, we extend the Li–Bulmer model to allow population size to vary over time, by representing the evolutionary process by a transition matrix. By analyzing this matrix model, we show that a recent change in population size can result in erratic patterns of codon usage and that methods failing to take into account these demographic effects can give false estimates of the intensity of selection.To solve these problems, we propose a new method, which does not require polarizing ancestral vs. derived states using outgroup data (cf. Cutter and Charlesworth 2006), but requires only knowledge of preferred vs. unpreferred states defined by patterns of codon usage. We use information on both polymorphic and fixed sites, which enables both mutational bias and the strength of selection to be estimated, in contrast to previous methods that use information on polymorphisms alone. Simulations indicate that this method can accurately estimate both demographic and codon usage parameters and can distinguish between selection and demography. We use the new method to analyze a large D. melanogaster polymorphism data set (Shapiro et al. 2007) and find evidence for natural selection on synonymous codons. We use our approach to show that genes with shorter coding sequences and higher levels of expression are under significantly stronger selection than longer genes with lower expression.  相似文献   

15.
Joshua S. Paul  Yun S. Song 《Genetics》2010,186(1):321-338
The multilocus conditional sampling distribution (CSD) describes the probability that an additionally sampled DNA sequence is of a certain type, given that a collection of sequences has already been observed. The CSD has a wide range of applications in both computational biology and population genomics analysis, including phasing genotype data into haplotype data, imputing missing data, estimating recombination rates, inferring local ancestry in admixed populations, and importance sampling of coalescent genealogies. Unfortunately, the true CSD under the coalescent with recombination is not known, so approximations, formulated as hidden Markov models, have been proposed in the past. These approximations have led to a number of useful statistical tools, but it is important to recognize that they were not derived from, though were certainly motivated by, principles underlying the coalescent process. The goal of this article is to develop a principled approach to derive improved CSDs directly from the underlying population genetics model. Our approach is based on the diffusion process approximation and the resulting mathematical expressions admit intuitive genealogical interpretations, which we utilize to introduce further approximations and make our method scalable in the number of loci. The general algorithm presented here applies to an arbitrary number of loci and an arbitrary finite-alleles recurrent mutation model. Empirical results are provided to demonstrate that our new CSDs are in general substantially more accurate than previously proposed approximations.THE probability of observing a sample of DNA sequences under a given population genetics model—which is referred to as the sampling probability or likelihood—plays an important role in a wide range of problems in a genetic variation study. When recombination is involved, however, obtaining an analytic formula for the sampling probability has hitherto remained a challenging open problem (see Jenkins and Song 2009, 2010 for recent progress on this problem). As such, much research (Griffiths and Marjoram 1996; Kuhner et al. 2000; Nielsen 2000; Stephens and Donnelly 2000; Fearnhead and Donnelly 2001; De Iorio and Griffiths 2004a,b; Fearnhead and Smith 2005; Griffiths et al. 2008; Wang and Rannala 2008) has focused on developing Monte Carlo methods on the basis of the coalescent with recombination (Griffiths 1981; Kingman 1982a,b; Hudson 1983), a well-established mathematical framework that models the genealogical history of sample chromosomes. These Monte Carlo-based full-likelihood methods mark an important development in population genetics analysis, but a well-known obstacle to their utility is that they tend to be computationally intensive. For a whole-genome variation study, approximations are often unavoidable, and it is therefore important to think of ways to minimize the trade-off between scalability and accuracy.A popular likelihood-based approximation method that has had a significant impact on population genetics analysis is the following approach introduced by Li and Stephens (2003): Given a set Φ of model parameters (e.g., mutation rate, recombination rate, etc.), the joint probability p(h1, … , hn | Φ) of observing a set {h1, … , hn} of haplotypes sampled from a population can be decomposed as a product of conditional sampling distributions (CSDs), denoted by π,(1)where π(hk+1|h1, …, hk, Φ) is the probability of an additionally sampled haplotype being of type hk+1, given a set of already observed haplotypes h1, …, hk. In the presence of recombination, the true CSD π is unknown, so Li and Stephens proposed using an approximate CSD in place of π, thus obtaining the following approximation of the joint probability:(2)Li and Stephens referred to this approximation as the product of approximate conditionals (PAC) model. In general, the closer is to the true CSD π, the more accurate the PAC model becomes. Notable applications and extensions of this framework include estimating crossover rates (Li and Stephens 2003; Crawford et al. 2004) and gene conversion parameters (Gay et al. 2007; Yin et al. 2009), phasing genotype data into haplotype data (Stephens and Scheet 2005; Scheet and Stephens 2006), imputing missing data to improve power in association mapping (Stephens and Scheet 2005; Li and Abecasis 2006; Marchini et al. 2007; Howie et al. 2009), inferring local ancestry in admixed populations (Price et al. 2009), inferring human colonization history (Hellenthal et al. 2008), inferring demography (Davison et al. 2009), and so on.Another problem in which the CSD plays a fundamental role is importance sampling of genealogies under the coalescent process (Stephens and Donnelly 2000; Fearnhead and Donnelly 2001; De Iorio and Griffiths 2004a,b; Fearnhead and Smith 2005; Griffiths et al. 2008). In this context, the optimal proposal distribution can be written in terms of the CSD π (Stephens and Donnelly 2000), and as in the PAC model, an approximate CSD may be used in place of π. The performance of an importance sampling scheme depends critically on the proposal distribution and therefore on the accuracy of the approximation . Often in conjunction with composite-likelihood frameworks (Hudson 2001; Fearnhead and Donnelly 2002), importance sampling has been used in estimating fine-scale recombination rates (McVean et al. 2004; Fearnhead and Smith 2005; Johnson and Slatkin 2009).So far, a significant scope of intuition has gone into choosing the approximate CSDs used in these problems (Marjoram and Tavaré 2006). In the case of completely linked loci, Stephens and Donnelly (2000) suggested constructing an approximation by assuming that the additional haplotype hk+1 is an imperfect copy of one of the first k haplotypes, with copying errors corresponding to mutation. Fearnhead and Donnelly (2001) generalized this construction to include crossover recombination, assuming that the haplotype hk+1 is an imperfect mosaic of the first k haplotypes (i.e., hk+1 is obtained by copying segments from h1, …, hk, where crossover recombination can change the haplotype from which copying is performed). The associated CSD, which we denote by , can be interpreted as a hidden Markov model and so admits an efficient dynamic programming solution. Finally, Li and Stephens (2003) proposed a modification to Fearnhead and Donnelly''s model that limits the hidden state space, thereby providing a computational simplification; we denote the corresponding approximate CSD by .Although these approaches are computationally appealing, it is important to note that they are not derived from, though are certainly motivated by, principles underlying typical population genetics models, in particular the coalescent process (Griffiths 1981; Kingman 1982a,b; Hudson 1983). The main objective of this article is to develop a principled technique to derive an improved CSD directly from the underlying population genetics model. Rather than relying on intuition, we base our work on mathematical foundation. The theoretical framework we employ is the diffusion process. De Iorio and Griffiths (2004a,b) first introduced the diffusion-generator approximation technique to obtain an approximate CSD in the case of a single locus (i.e., no recombination). Griffiths et al. (2008) later extended the approach to two loci to include crossover recombination, assuming a parent-independent mutation model at each locus. In this article, we extend the framework to develop a general algorithm that applies to an arbitrary number of loci and an arbitrary finite-alleles recurrent mutation model.Our work can be summarized as follows. Using the diffusion-generator approximation technique, we derive a recursion relation satisfied by an approximate CSD. This recursion can be used to construct a closed system of coupled linear equations, in which the conditional sampling probability of interest appears as one of the unknown variables. The system of equations can be solved using standard numerical analysis techniques. However, the size of the system grows superexponentially with the number of loci and, consequently, so does the running time. To remedy this drawback, we introduce additional approximations to make our approach scalable in the number of loci. Specifically, the recursion admits an intuitive genealogical interpretation, and, on the basis of this interpretation, we propose modifications to the recursion, which then can be easily solved using dynamic programming. The computational complexity of the modified algorithm is polynomial in the number of loci, and, importantly, the resulting CSD has little loss of accuracy compared to that following from the full recursion.The accuracy of approximate CSDs has not been discussed much in the literature, except in the application-specific context for which they are being employed. In this article, we carry out an empirical study to explicitly test the accuracy of various CSDs and demonstrate that our new CSDs are in general substantially more accurate than previously proposed approximations. We also consider the PAC framework and show that our approximations also produce more accurate PAC-likelihood estimates. We note that for the maximum-likelihood estimation of recombination rates, the actual value of the likelihood may not be so important, as long as it is maximized near the true recombination rate. However, in many other applications—e.g., phasing genotype data into haplotype data, imputing missing data, importance sampling, and so on—the accuracy of the CSD and PAC-likelihood function over a wide range of parameter values may be important. Thus, we believe that the theoretical work presented here will have several practical implications; our method can be applied in a wide range of statistical tools that use CSDs, improving their accuracy.The remainder of this article is organized as follows. To provide intuition for the ensuing mathematics, we first describe a genealogical process that gives rise to our CSD. Using our genealogical interpretation, we consider two additional approximations and relate these to previously proposed CSDs. Then, in the following section, we derive our CSD using the diffusion-generator approach and provide mathematical statements for the additional approximations; some interesting limiting behavior is also described there. This section is self-contained and may be skipped by the reader uninterested in mathematical details. Finally, in the subsequent section, we carry out a simulation study to compare the accuracy of various approximate CSDs and demonstrate that ours are generally the most accurate.  相似文献   

16.
William R. Engels 《Genetics》2009,183(4):1431-1441
Exact conditional tests are often required to evaluate statistically whether a sample of diploids comes from a population with Hardy–Weinberg proportions or to confirm the accuracy of genotype assignments. This requirement is especially common when the sample includes multiple alleles and sparse data, thus rendering asymptotic methods, such as the common χ2-test, unreliable. Such an exact test can be performed using the likelihood ratio as its test statistic rather than the more commonly used probability test. Conceptual advantages in using the likelihood ratio are discussed. A substantially improved algorithm is described to permit the performance of a full-enumeration exact test on sample sizes that are too large for previous methods. An improved Monte Carlo algorithm is also proposed for samples that preclude full enumeration. These algorithms are about two orders of magnitude faster than those currently in use. Finally, methods are derived to compute the number of possible samples with a given set of allele counts, a useful quantity for evaluating the feasibility of the full enumeration procedure. Software implementing these methods, ExactoHW, is provided.WHEN studying the genetics of a population, one of the first questions to be asked is whether the genotype frequencies fit Hardy–Weinberg (HW) expectations. They will fit HW if the population is behaving like a single randomly mating unit without intense viability selection acting on the sampled loci. In addition, testing for HW proportions is often used for quality control in genotyping, as the test is sensitive to misclassifications or undetected null alleles. Traditionally, geneticists have relied on test statistics with asymptotic χ2-distributions to test for goodness-of-fit with respect to HW proportions. However, as pointed out by several authors (Elston and Forthofer 1977; Emigh 1980; Louis and Dempster 1987; Hernandez and Weir 1989; Guo and Thompson 1992; Chakraborty and Zhong 1994; Rousset and Raymond 1995; Aoki 2003; Maiste and Weir 2004; Wigginton et al. 2005; Kang 2008; Rohlfs and Weir 2008), these asymptotic tests quickly become unreliable when samples are small or when rare alleles are involved. The latter situation is increasingly common as techniques for detecting large numbers of alleles become widely used. Moreover, loci with large numbers of alleles are intentionally selected for use in DNA identification techniques (e.g., Weir 1992). The result is often sparse-matrix data for which the asymptotic methods cannot be trusted.A solution to this problem is to use an exact test (Levene 1949; Haldane 1954) analogous to Fisher''s exact test for independence in a 2 × 2 contingency table and its generalization to rectangular tables (Freeman and Halton 1951). In this approach, one considers only potential outcomes that have the same allele frequencies as observed, thus greatly reducing the number of outcomes that must be analyzed. One then identifies all such outcomes that deviate from the HW null hypothesis by at least as much the observed sample. The total probability of this subset of outcomes, conditioned on HW and the observed allele frequencies, is then the P-value of the test. When it is not possible to enumerate all outcomes, it is still feasible to approximate the P-value by generating a large random sample of tables.The exact HW test has been used extensively and eliminates the uncertainty inherent in the asymptotic methods (Emigh 1980; Hernandez and Weir 1989; Guo and Thompson 1992; Rousset and Raymond 1995). However, there are two difficulties with the application of this method and its interpretation, both of which are addressed in this report.The first issue is the question of how one decides which of the potential outcomes are assigned to the subset that deviates from HW proportions by as much as or more than the observed sample. If the alternative hypothesis is specifically an excess or a dearth of homozygotes, then the tables can be ordered by Rousset and Raymond''s (1995) U-score or, equivalently, by Robertson and Hill''s (1984) minimum-variance estimator of the inbreeding coefficient. However, when no specific direction of deviation from HW is suspected, then there are several possible test statistics that can be used (Emigh 1980). These include the χ2-statistic, the likelihood ratio (LR), and the conditional probability itself. The last option is by far the most widely used (Elston and Forthofer 1977; Louis and Dempster 1987; Chakraborty and Zhong 1994; Weir 1996; Wigginton et al. 2005) and implemented in the GENEPOP software package (Rousset 2008). The idea of using the null-hypothesis probability as the test statistic was originally suggested in the context of rectangular contingency tables (Freeman and Halton 1951), but this idea has been criticized for its lack of discrimination between the null hypothesis and alternatives (Gibbons and Pratt 1975; Radlow and Alf 1975; Cressie and Read 1989). For example, suppose a particular sample was found to have a very low probability under the null hypothesis of HW. Such a result would usually tend to argue against the population being in HW equilibrium. However, if this particular outcome also has a very low probability under even the best-fitting alternative hypothesis, then it merely implies that a rare event has occurred regardless of whether the population is in random-mating proportions. The first part of this report compares the use of probability vs. the likelihood ratio as the test statistic in HW exact tests. Reasons for preferring the likelihood ratio are presented.The second difficulty in performing HW exact tests is the extensive computation needed for large samples when multiple alleles are involved. In this report I present a new algorithm for carrying out these calculations. This method adapts some of the techniques originally developed for rectangular contingency tables in which each possible outcome is represented as a path through a lattice-like network (Mehta and Patel 1983). Unlike the loop-based method currently in use (Louis and Dempster 1987), the new algorithm uses recursion and can be applied to any number of alleles without modification. In addition, it improves the efficiency by about two orders of magnitude, thus allowing the full enumeration procedure to be applied to larger samples and with greater numbers of alleles.The recursion algorithm has been tested successfully on samples with as many as 20 alleles when most of those alleles are rare. However, there are still some samples for which a complete enumeration is not practical. For example, the data from the human Rh locus in Figure 1D would require examining 2 × 1056 tables (see below). For such cases a Monte Carlo approach must be used (Guo and Thompson 1992). Several improvements to the method of independent random tables are suggested here to make that approach practical for even the largest of realistic samples, thus eliminating the need for the less-accurate Markov chain approach.Open in a separate windowFigure 1.—Sample data sets: examples that have been used in previous discussions of exact tests for HW proportions. For each data set, a triangular matrix of genotype counts is shown next to the vector of allele counts. (A) From Table 2, bottom row, of Louis and Dempster (1987). (B) From Figure 2 of Guo and Thompson (1992). (C) From the documentation included with the GENEPOP software package (Rousset 2008). (D) From Figure 5 of Guo and Thompson (1992).Finally, I address the problem of determining the number of tables of genotype counts corresponding to a given set of allele counts. This number is needed for determining whether the exact test can be performed by full enumeration. Previously, this number could not be obtained except by actually carrying out the complete enumeration.The methods described are implemented in a software package, ExactoHW, for MacOS X10.5 or later. It is available in compiled form (supporting information, File S1) or as source code for academic use on request from the author.  相似文献   

17.
Polyploidy is an important aspect of the evolution of flowering plants. The potential of gene copies to diverge and evolve new functions is influenced by meiotic behavior of chromosomes leading to segregation as a single locus or duplicated loci. Switchgrass (Panicum virgatum) linkage maps were constructed using a full-sib population of 238 plants and SSR and STS markers to access the degree of preferential pairing and the structure of the tetraploid genome and as a step toward identification of loci underlying biomass feedstock quality and yield. The male and female framework map lengths were 1645 and 1376 cM with 97% of the genome estimated to be within 10 cM of a mapped marker in both maps. Each map coalesced into 18 linkage groups arranged into nine homeologous pairs. Comparative analysis of each homology group to the diploid sorghum genome identified clear syntenic relationships and collinear tracts. The number of markers with PCR amplicons that mapped across subgenomes was significantly fewer than expected, suggesting substantial subgenome divergence, while both the ratio of coupling to repulsion phase linkages and pattern of marker segregation indicated complete or near complete disomic inheritance. The proportion of transmission ratio distorted markers was relatively low, but the male map was more extensively affected by distorted transmission ratios and multilocus interactions, associated with spurious linkages.POLYPLOIDY is common among plants (Masterson 1994; Levin 2002) and is an important aspect of plant evolution. Widespread paleopolyploidy in flowering plant lineages suggests that ancient polyploidization events have contributed to the radiation of angiosperms (Soltis et al. 2009; Van de Peer et al. 2009a). Whole genome duplications are thought to be the sources of evolutionary novelty (Osborn et al. 2003; Freeling and Thomas 2006; Chen 2007; Hegarty and Hiscock 2008; Flagel and Wendel 2009; Leitch and Leitch 2008). Other attributes of polyploids considered to promote evolutionary success include increased vigor, masking of recessive alleles, and reproductive barriers arising from loss of one of the duplicate genes (Soltis and Soltis 2000; Comai 2005; Otto 2007; Van de Peer et al. 2009b). Among crop species, polyploidy likely contributed to trait improvement under artificial selection (Paterson 2005; Udall and Wendell 2006; Dubcovsky and Dvorak 2007; Hovav et al. 2008).Disomic inheritance in polyploids, in contrast to polysomic inheritance, presents opportunities for duplicated genes to diverge and evolve new functions. The relative age of whole genome duplications and the extent of homology between subgenomes greatly influence chromosomal pairing at meiosis (Soltis and Soltis 1995; Wolfe 2001; Ramsey and Schemske 2002). Polysomic inheritance resulting from random chromosome pairing is associated with doubling of a single set of chromosomes. Disomic inheritance resulting from preferential pairing is often associated with polyploidy arising from combinations of divergent genomes. The evolutionary process of diploidization leads to a shift from random to preferential pairing that is not well understood but is genetically defined in systems such as Ph1 of wheat (Triticum aestivum) and PrBn of Brassica napus (Riley and Chapman 1958; Vega and Feldman 1998; Jenczewski et al. 2003). The degree of preferential pairing also affects allelic diversity and the ability to detect linkage. Accurate information about chromosome pairing and whole or partial genome duplications is thus important for both evolutionary studies and in linkage analysis.Such information is extremely limited in the C4 panicoid species Panicum virgatum (switchgrass), which is now viewed as a promising energy crop in the United States and Europe (Lewandowski et al. 2003; McLaughlin and Kszos 2005) and is planted extensively for forage and soil conservation (Vogel and Jung 2001). Little is known about either its genome structure or inheritance. Much current bioenergy feedstock development is focused on tetraploid cytotypes (2n = 4x = 36) due to their higher yield potentials, and an initial segregation study indicated a high degree of preferential pairing in a single F1 mapping population (Missaoui et al. 2005). A once-dominant component of the tallgrass prairie in North America, switchgrass is largely self-incompatible (Martinez-Reyna and Vogel 2002) with predominantly tetraploid or octoploid cytotypes (Hultquist et al. 1997; Lu et al. 1998). Limited gene flow appears possible between different cytotypes suggested by DNA content variation within collection sites and seed lots (Nielsen 1944; Hultquist et al. 1997; Narasimhamoorthy et al. 2008). True diploids appear to be rare (Nielsen 1944; Young et al. 2010). Multivalents in meiosis have not been observed in tetraploids or F1 hybrids between upland and lowland tetraploids, although rare univalents occurred (Barnett and Carver 1967; Martinez-Reyna et al. 2001). However, polysomic inheritance may occur with random bivalent pairing (Howard and Swaminathan 1953).Sustainable production of switchgrass for bioenergy to meet the goal of reducing greenhouse gas emissions will require advances in feedstock production that include improvements in yield (Carroll and Somerville 2009). Switchgrass has extensive genetic diversity and potential for genetic improvements, but each cycle of phenotypic selection can take several years (McLaughlin and Kszos 2005; Parrish and Fike 2005; Bouton 2007). Detailed understanding of genome structure to enable efficient marker-assisted selection (MAS) can speed this process considerably. Complete linkage maps are therefore required to both understand chromosome pairing and allow MAS.We report the construction of the first complete linkage maps of two switchgrass genotypes. The linkage maps provide genetic evidence for disomic inheritance in lowland, tetraploid switchgrass. Gene-derived markers enabled a comparative analysis to sorghum, revealing syntenic relationships between the diploid sorghum genome and the tetraploid switchgrass subgenomes. Transmission ratio distortion and multilocus interactions were analyzed in detail to document their potential influence on map accuracy and map-based studies in switchgrass.  相似文献   

18.
The first enzyme in the pathway for l-arabinose catabolism in eukaryotic microorganisms is a reductase, reducing l-arabinose to l-arabitol. The enzymes catalyzing this reduction are in general nonspecific and would also reduce d-xylose to xylitol, the first step in eukaryotic d-xylose catabolism. It is not clear whether microorganisms use different enzymes depending on the carbon source. Here we show that Aspergillus niger makes use of two different enzymes. We identified, cloned, and characterized an l-arabinose reductase, larA, that is different from the d-xylose reductase, xyrA. The larA is up-regulated on l-arabinose, while the xyrA is up-regulated on d-xylose. There is however an initial up-regulation of larA also on d-xylose but that fades away after about 4 h. The deletion of the larA gene in A. niger results in a slow growth phenotype on l-arabinose, whereas the growth on d-xylose is unaffected. The l-arabinose reductase can convert l-arabinose and d-xylose to their corresponding sugar alcohols but has a higher affinity for l-arabinose. The Km for l-arabinose is 54 ± 6 mm and for d-xylose 155 ± 15 mm.  相似文献   

19.
Bayesian inference methods are extensively used to detect the presence of population structure given genetic data. The primary output of software implementing these methods are ancestry profiles of sampled individuals. While these profiles robustly partition the data into subgroups, currently there is no objective method to determine whether the fixed factor of interest (e.g. geographic origin) correlates with inferred subgroups or not, and if so, which populations are driving this correlation. We present ObStruct, a novel tool to objectively analyse the nature of structure revealed in Bayesian ancestry profiles using established statistical methods. ObStruct evaluates the extent of structural similarity between sampled and inferred populations, tests the significance of population differentiation, provides information on the contribution of sampled and inferred populations to the observed structure and crucially determines whether the predetermined factor of interest correlates with inferred population structure. Analyses of simulated and experimental data highlight ObStruct''s ability to objectively assess the nature of structure in populations. We show the method is capable of capturing an increase in the level of structure with increasing time since divergence between simulated populations. Further, we applied the method to a highly structured dataset of 1,484 humans from seven continents and a less structured dataset of 179 Saccharomyces cerevisiae from three regions in New Zealand. Our results show that ObStruct provides an objective metric to classify the degree, drivers and significance of inferred structure, as well as providing novel insights into the relationships between sampled populations, and adds a final step to the pipeline for population structure analyses.  相似文献   

20.
It is widely recognized that the mixed linear model is an important tool for parameter estimation in the analysis of complex pedigrees, which includes both pedigree and genomic information, and where mutually dependent genetic factors are often assumed to follow multivariate normal distributions of high dimension. We have developed a Bayesian statistical method based on the decomposition of the multivariate normal prior distribution into products of conditional univariate distributions. This procedure permits computationally demanding genetic evaluations of complex pedigrees, within the user-friendly computer package WinBUGS. To demonstrate and evaluate the flexibility of the method, we analyzed two example pedigrees: a large noninbred pedigree of Scots pine (Pinus sylvestris L.) that includes additive and dominance polygenic relationships and a simulated pedigree where genomic relationships have been calculated on the basis of a dense marker map. The analysis showed that our method was fast and provided accurate estimates and that it should therefore be a helpful tool for estimating genetic parameters of complex pedigrees quickly and reliably.MUCH effort in genetics has been devoted to revealing the underlying genetic architecture of quantitative or complex traits. Traditionally, the polygenic model has been used extensively to estimate genetic variances and breeding values of natural and breeding populations, where an infinite number of genes is assumed to code for the trait of interest (Bulmer 1971; Falconer and Mackay 1996). The genetic variance of a quantitative trait can be decomposed into an additive part that corresponds to the effects of individual alleles and a part that is nonadditive because of interactions between alleles. Attention has generally been focused on the estimation of additive genetic variance (and heritability), since additive variation is directly proportional to the response of selection via the breeder''s equation (Falconer and Mackay 1996, Chap. 11). However, to estimate additive genetic variation and heritability accurately, it can be important to identify potential nonadditive sources in genetic evaluations (Misztal 1997; Ovaskainen et al. 2008; Waldmann et al. 2008), especially if the pedigree being analyzed contains a large proportion of full-sibs and clones, as these in particular give rise to nonadditive genetic relationships (Lynch and Walsh 1998, pp. 145). The polygenic model using pedigree and phenotypic information, i.e., the animal model (Henderson 1984), has been the model of choice for estimating genetic parameters in breeding and natural populations (Abney et al. 2000; Sorensen and Gianola 2002; O′Hara et al. 2008).Recent breakthroughs in molecular techniques have made it possible to create genome-wide, single nucleotide polymorphism (SNP) maps. These maps have helped to uncover a vast amount of new loci responsible for trait expression and have provided general insights into the genetic architecture of quantitative traits (e.g., Valdar et al. 2006; Visscher 2008; Flint and Mackay 2009). These insights can help when calculating disease risks in humans, when attempting to increase the yield from breeding programs, and when estimating relatedness in conservation programs. High-density SNPs of many species of importance to science and agriculture can now be scored quickly and relatively cheaply, for example, in mice (Valdar et al. 2006), chickens (Muir et al. 2008), and dairy cattle (VanRaden et al. 2009).In the analysis of populations of breeding stock, the inclusion of dense marker data has improved the predictive ability (i.e., reliability) of genetic evaluations compared to the traditional phenotype model, both in simulations (Meuwissen et al. 2001; Calus et al. 2008; Hayes et al. 2009) and when using real data (Legarra et al. 2008; VanRaden et al. 2009; González-Recio et al. 2009). Meuwissen et al. (2001) suggested that the effect of all markers should first be estimated, and then summed, to obtain genomic estimated breeding values (GEBVs). An alternative procedure, where all markers are used to compute the genomic relationship matrix (in place of the additive polygenic relationship matrix) has also been suggested (e.g., Villanueva et al. 2005; VanRaden 2008; Hayes et al. 2009); this matrix is then incorporated into the statistical analysis to estimate GEBVs. A comparison of both procedures (VanRaden 2008) yielded similar estimates of GEBVs in cases where the effect of an individual allele was small. In addition, if not all pedigree members have marker information, a combined relationship matrix derived from both genotyped and ungenotyped individuals could be computed; this has been shown to increase the accuracy of GEBVs (Legarra et al. 2009; Misztal et al. 2009). Another plausible option to incorporate marker information is to use low-density SNP panels within families and to trace the effect of SNPs from high-density genotyped ancestors, as suggested by Habier et al. (2009) and Weigel et al. (2009). However, fast and powerful computer algorithms, which can use the marker information as efficiently as possible in the analysis of quantitative traits, are needed to obtain accurate GEBVs from genome-wide marker data.This study describes the development of an efficient Bayesian method for incorporating general relationships into the genetic evaluation procedure. The method is based on expressing the multivariate normal prior distribution as a product of one-dimensional normal distributions, each conditioned on the descending variables. When evaluating the genetic parameters of natural and breeding populations, high-dimensional distributions are often used as prior distributions of various genetic effects, such as the additive polygenic effect (Wang et al. 1993), multivariate additive polygenic effects (Van Tassell and Van Vleck 1996), and quantitative trait loci (QTL) effects via the identical-by-decent matrix (Yi and Xu 2000). A Bayesian framework is adopted to obtain posterior distributions of all unknown parameters, estimated by using Markov chain Monte Carlo (MCMC) sampling algorithms in the software package WinBUGS (Lunn et al. 2000, 2009). By performing prior calculations in the form of the factorized product of simple univariate conditional distributions, the computational time of the MCMC estimation procedure is reduced considerably. This feature permits rapid inference for both the polygenic model and the genomic relationship model. Moreover, the decomposition allows for inbreeding of varying degree, since the correct genetic covariance structure can be inferred into the analysis. In this article, we test the method on two previously published pedigree data sets: phenotype data from a large pedigree of Scots pine, incorporation of information on both additive and dominance genetic relationships (Waldmann et al. 2008); and genomic information obtained from a genome-wide scan of a simulated animal population (Lund et al. 2009).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号