首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The BGLR-R package implements various types of single-trait shrinkage/variable selection Bayesian regressions. The package was first released in 2014, since then it has become a software very often used in genomic studies. We recently develop functionality for multitrait models. The implementation allows users to include an arbitrary number of random-effects terms. For each set of predictors, users can choose diffuse, Gaussian, and Gaussian–spike–slab multivariate priors. Unlike other software packages for multitrait genomic regressions, BGLR offers many specifications for (co)variance parameters (unstructured, diagonal, factor analytic, and recursive). Samples from the posterior distribution of the models implemented in the multitrait function are generated using a Gibbs sampler, which is implemented by combining code written in the R and C programming languages. In this article, we provide an overview of the models and methods implemented BGLR’s multitrait function, present examples that illustrate the use of the package, and benchmark the performance of the software.  相似文献   

2.
Daniel Gianola 《Genetics》2013,194(3):573-596
Whole-genome enabled prediction of complex traits has received enormous attention in animal and plant breeding and is making inroads into human and even Drosophila genetics. The term “Bayesian alphabet” denotes a growing number of letters of the alphabet used to denote various Bayesian linear regressions that differ in the priors adopted, while sharing the same sampling model. We explore the role of the prior distribution in whole-genome regression models for dissecting complex traits in what is now a standard situation with genomic data where the number of unknown parameters (p) typically exceeds sample size (n). Members of the alphabet aim to confront this overparameterization in various manners, but it is shown here that the prior is always influential, unless np. This happens because parameters are not likelihood identified, so Bayesian learning is imperfect. Since inferences are not devoid of the influence of the prior, claims about genetic architecture from these methods should be taken with caution. However, all such procedures may deliver reasonable predictions of complex traits, provided that some parameters (“tuning knobs”) are assessed via a properly conducted cross-validation. It is concluded that members of the alphabet have a room in whole-genome prediction of phenotypes, but have somewhat doubtful inferential value, at least when sample size is such that np.  相似文献   

3.
The use of dense SNPs to predict the genetic value of an individual for a complex trait is often referred to as “genomic selection” in livestock and crops, but is also relevant to human genetics to predict, for example, complex genetic disease risk. The accuracy of prediction depends on the strength of linkage disequilibrium (LD) between SNPs and causal mutations. If sequence data were used instead of dense SNPs, accuracy should increase because causal mutations are present, but demographic history and long-term negative selection also influence accuracy. We therefore evaluated genomic prediction, using simulated sequence in two contrasting populations: one reducing from an ancestrally large effective population size (Ne) to a small one, with high LD common in domestic livestock, while the second had a large constant-sized Ne with low LD similar to that in some human or outbred plant populations. There were two scenarios in each population; causal variants were either neutral or under long-term negative selection. For large Ne, sequence data led to a 22% increase in accuracy relative to ∼600K SNP chip data with a Bayesian analysis and a more modest advantage with a BLUP analysis. This advantage increased when causal variants were influenced by negative selection, and accuracy persisted when 10 generations separated reference and validation populations. However, in the reducing Ne population, there was little advantage for sequence even with negative selection. This study demonstrates the joint influence of demography and selection on accuracy of prediction and improves our understanding of how best to exploit sequence for genomic prediction.  相似文献   

4.
As the molecular marker density grows, there is a strong need in both genome-wide association studies and genomic selection to fit models with a large number of parameters. Here we present a computationally efficient generalized ridge regression (RR) algorithm for situations in which the number of parameters largely exceeds the number of observations. The computationally demanding parts of the method depend mainly on the number of observations and not the number of parameters. The algorithm was implemented in the R package bigRR based on the previously developed package hglm. Using such an approach, a heteroscedastic effects model (HEM) was also developed, implemented, and tested. The efficiency for different data sizes were evaluated via simulation. The method was tested for a bacteria-hypersensitive trait in a publicly available Arabidopsis data set including 84 inbred lines and 216,130 SNPs. The computation of all the SNP effects required <10 sec using a single 2.7-GHz core. The advantage in run time makes permutation test feasible for such a whole-genome model, so that a genome-wide significance threshold can be obtained. HEM was found to be more robust than ordinary RR (a.k.a. SNP-best linear unbiased prediction) in terms of QTL mapping, because SNP-specific shrinkage was applied instead of a common shrinkage. The proposed algorithm was also assessed for genomic evaluation and was shown to give better predictions than ordinary RR.  相似文献   

5.
C-L Wang  X-D Ding  J-Y Wang  J-F Liu  W-X Fu  Z Zhang  Z-J Yin  Q Zhang 《Heredity》2013,110(3):213-219
Estimation of genomic breeding values is the key step in genomic selection (GS). Many methods have been proposed for continuous traits, but methods for threshold traits are still scarce. Here we introduced threshold model to the framework of GS, and specifically, we extended the three Bayesian methods BayesA, BayesB and BayesCπ on the basis of threshold model for estimating genomic breeding values of threshold traits, and the extended methods are correspondingly termed BayesTA, BayesTB and BayesTCπ. Computing procedures of the three BayesT methods using Markov Chain Monte Carlo algorithm were derived. A simulation study was performed to investigate the benefit of the presented methods in accuracy with the genomic estimated breeding values (GEBVs) for threshold traits. Factors affecting the performance of the three BayesT methods were addressed. As expected, the three BayesT methods generally performed better than the corresponding normal Bayesian methods, in particular when the number of phenotypic categories was small. In the standard scenario (number of categories=2, incidence=30%, number of quantitative trait loci=50, h2=0.3), the accuracies were improved by 30.4%, 2.4%, and 5.7% points, respectively. In most scenarios, BayesTB and BayesTCπ generated similar accuracies and both performed better than BayesTA. In conclusion, our work proved that threshold model fits well for predicting GEBVs of threshold traits, and BayesTCπ is supposed to be the method of choice for GS of threshold traits.  相似文献   

6.
In genome-based prediction there is considerable uncertainty about the statistical model and method required to maximize prediction accuracy. For traits influenced by a small number of quantitative trait loci (QTL), predictions are expected to benefit from methods performing variable selection [e.g., BayesB or the least absolute shrinkage and selection operator (LASSO)] compared to methods distributing effects across the genome [ridge regression best linear unbiased prediction (RR-BLUP)]. We investigate the assumptions underlying successful variable selection by combining computer simulations with large-scale experimental data sets from rice (Oryza sativa L.), wheat (Triticum aestivum L.), and Arabidopsis thaliana (L.). We demonstrate that variable selection can be successful when the number of phenotyped individuals is much larger than the number of causal mutations contributing to the trait. We show that the sample size required for efficient variable selection increases dramatically with decreasing trait heritabilities and increasing extent of linkage disequilibrium (LD). We contrast and discuss contradictory results from simulation and experimental studies with respect to superiority of variable selection methods over RR-BLUP. Our results demonstrate that due to long-range LD, medium heritabilities, and small sample sizes, superiority of variable selection methods cannot be expected in plant breeding populations even for traits like FRIGIDA gene expression in Arabidopsis and flowering time in rice, assumed to be influenced by a few major QTL. We extend our conclusions to the analysis of whole-genome sequence data and infer upper bounds for the number of causal mutations which can be identified by LASSO. Our results have major impact on the choice of statistical method needed to make credible inferences about genetic architecture and prediction accuracy of complex traits.  相似文献   

7.

Background

Genomic selection can be implemented by a multi-step procedure, which requires a response variable and a statistical method. For pure-bred pigs, it was hypothesised that deregressed estimated breeding values (EBV) with the parent average removed as the response variable generate higher reliabilities of genomic breeding values than EBV, and that the normal, thick-tailed and mixture-distribution models yield similar reliabilities.

Methods

Reliabilities of genomic breeding values were estimated with EBV and deregressed EBV as response variables and under the three statistical methods, genomic BLUP, Bayesian Lasso and MIXTURE. The methods were examined by splitting data into a reference data set of 1375 genotyped animals that were performance tested before October 2008, and 536 genotyped validation animals that were performance tested after October 2008. The traits examined were daily gain and feed conversion ratio.

Results

Using deregressed EBV as the response variable yielded 18 to 39% higher reliabilities of the genomic breeding values than using EBV as the response variable. For daily gain, the increase in reliability due to deregression was significant and approximately 35%, whereas for feed conversion ratio it ranged between 18 and 39% and was significant only when MIXTURE was used. Genomic BLUP, Bayesian Lasso and MIXTURE had similar reliabilities.

Conclusions

Deregressed EBV is the preferred response variable, whereas the choice of statistical method is less critical for pure-bred pigs. The increase of 18 to 39% in reliability is worthwhile, since the reliabilities of the genomic breeding values directly affect the returns from genomic selection.  相似文献   

8.

Background

Genomic selection has gained much attention and the main goal is to increase the predictive accuracy and the genetic gain in livestock using dense marker information. Most methods dealing with the large p (number of covariates) small n (number of observations) problem have dealt only with continuous traits, but there are many important traits in livestock that are recorded in a discrete fashion (e.g. pregnancy outcome, disease resistance). It is necessary to evaluate alternatives to analyze discrete traits in a genome-wide prediction context.

Methods

This study shows two threshold versions of Bayesian regressions (Bayes A and Bayesian LASSO) and two machine learning algorithms (boosting and random forest) to analyze discrete traits in a genome-wide prediction context. These methods were evaluated using simulated and field data to predict yet-to-be observed records. Performances were compared based on the models'' predictive ability.

Results

The simulation showed that machine learning had some advantages over Bayesian regressions when a small number of QTL regulated the trait under pure additivity. However, differences were small and disappeared with a large number of QTL. Bayesian threshold LASSO and boosting achieved the highest accuracies, whereas Random Forest presented the highest classification performance. Random Forest was the most consistent method in detecting resistant and susceptible animals, phi correlation was up to 81% greater than Bayesian regressions. Random Forest outperformed other methods in correctly classifying resistant and susceptible animals in the two pure swine lines evaluated. Boosting and Bayes A were more accurate with crossbred data.

Conclusions

The results of this study suggest that the best method for genome-wide prediction may depend on the genetic basis of the population analyzed. All methods were less accurate at correctly classifying intermediate animals than extreme animals. Among the different alternatives proposed to analyze discrete traits, machine-learning showed some advantages over Bayesian regressions. Boosting with a pseudo Huber loss function showed high accuracy, whereas Random Forest produced more consistent results and an interesting predictive ability. Nonetheless, the best method may be case-dependent and a initial evaluation of different methods is recommended to deal with a particular problem.  相似文献   

9.
Bayesian methods are a popular choice for genomic prediction of genotypic values. The methodology is well established for traits with approximately Gaussian phenotypic distribution. However, numerous important traits are of dichotomous nature and the phenotypic counts observed follow a Binomial distribution. The standard Gaussian generalized linear models (GLM) are not statistically valid for this type of data. Therefore, we implemented Binomial GLM with logit link function for the BayesB and Bayesian GBLUP genomic prediction methods. We compared these models with their standard Gaussian counterparts using two experimental data sets from plant breeding, one on female fertility in wheat and one on haploid induction in maize, as well as a simulated data set. With the aid of the simulated data referring to a bi-parental population of doubled haploid lines, we further investigated the influence of training set size (N), number of independent Bernoulli trials for trait evaluation (n i ) and genetic architecture of the trait on genomic prediction accuracies and abilities in general and on the relative performance of our models. For BayesB, we in addition implemented finite mixture Binomial GLM to account for overdispersion. We found that prediction accuracies increased with increasing N and n i . For the simulated and experimental data sets, we found Binomial GLM to be superior to Gaussian models for small n i , but that for large n i Gaussian models might be used as ad hoc approximations. We further show with simulated and real data sets that accounting for overdispersion in Binomial data can markedly increase the prediction accuracy.  相似文献   

10.
The joint action of multiple genes is an important source of variation for complex traits and human diseases. However, mapping genes with epistatic effects and gene–environment interactions is a difficult problem because of relatively small sample sizes and very large parameter spaces for quantitative trait locus models that include such interactions. Here we present a nonparametric Bayesian method to map multiple quantitative trait loci (QTL) by considering epistatic and gene–environment interactions. The proposed method is not restricted to pairwise interactions among genes, as is typically done in parametric QTL analysis. Rather than modeling each main and interaction term explicitly, our nonparametric Bayesian method measures the importance of each QTL, irrespective of whether it is mostly due to a main effect or due to some interaction effect(s), via an unspecified function of the genotypes at all candidate QTL. A Gaussian process prior is assigned to this unknown function. In addition to the candidate QTL, nongenetic factors and covariates, such as age, gender, and environmental conditions, can also be included in the unspecified function. The importance of each genetic factor (QTL) and each nongenetic factor/covariate included in the function is estimated by a single hyperparameter, which enters the covariance function and captures any main or interaction effect associated with a given factor/covariate. An initial evaluation of the performance of the proposed method is obtained via analysis of simulated and real data.TRAITS showing continuous variation are called quantitative traits and are typically controlled by multiple genetic and nongenetic factors, which tend to have relatively small effects individually. Crosses between inbred lines produce suitable populations for quantitative trait locus (QTL) mapping and are available for agricultural plants and for animal (e.g., mouse) models of human diseases. Such crosses are often used to detect QTL. For these inbred line crosses, uniform genetic backgrounds, controlled breeding schemes, and controlled environment ensure that there is little or no confounding of uncontrolled sources of variability with genetic effects. The potential for such confounding complicates and limits the analysis and interpretation of human data. Because of the homology between humans and rodents, rodent models can be extremely useful in advancing our understanding of certain human diseases. In the past 2 decades, various statistical approaches have been developed to identify QTL in inbred line crosses (see, for example, Doerge et al. 1997 for review). To perform QTL mapping (identification), a large number of candidate positions (candidate QTL) along the genome are selected. These candidate QTL may all be located at genetic markers (positions of sequence variants in the genome where the genotypes of all individuals in a mapping population can be measured) or also in between markers if the marker density is not high. QTL mapping may then be performed by considering one candidate QTL at a time or multiple candidate QTL simultaneously. For inbred line crosses with low marker density and considering a single candidate QTL at a time, the interval-mapping method was proposed by Lander and Botstein (1989). However, these authors showed that interval mapping tends to identify a “ghost” QTL located in between two actual linked QTL if two or more closely linked QTL exist. This problem can be reduced or eliminated in two ways: (1) by using composite-interval mapping (Jansen and Stam 1994; Zeng 1994) which still performs a one-dimensional QTL search but conditional on the genotypes at a pair of markers flanking the marker interval containing the current QTL, to absorb the effects of background (nontarget QTL) outside of the target interval; or (2) by performing multiple QTL mapping, where two or more QTL are mapped simultaneously. Furthermore, if several QTL affect a quantitative trait mostly through their interactions (epistasis) while having nonexistent or weak main effects, then interval mapping or single-marker analysis will fail to detect such QTL. QTL interactions may not be limited to pairwise interactions. Marchini et al. (2005) have shown by simulation that searching for three loci jointly in the presence of a three-way interaction is more powerful than searching for a single or a pair of QTL. There are various different implementations of multiple QTL mapping. Most methods still perform only pairwise searches, with and without epistasis. The most recent methods are based on Bayesian variable selection and consider a group of candidate QTL or all candidate QTL in the genome simultaneously (e.g., Yi et al. 2007). These methods are typically still limited to pairwise interactions among QTL and do not consider gene–environment interactions.The identification of QTL can be viewed as a very large variable selection problem: for p candidate QTL, with p typically in the hundreds or thousands and sample size in the low hundreds, there are 2p possible main-effect models, possible two-way interactions, and possible higher-order (k > 2) interactions. For inbred line crosses, where multiple-QTL mapping models can be represented as multiple linear regression models, classical variable selection methods such as forward and stepwise selection (Broman and Speed 2002) have been used in searching for main and two-way interaction effects. Bayesian analysis implemented by Markov chain Monte Carlo (MCMC) and based on the composite model space framework (Godsill 2001, 2003) has been introduced to genetic mapping (Yi 2004). Well-known Bayesian variable selection methods such as reversible jump MCMC (Green 1995) and stochastic search variable selection (SSVS) (George and McCulloch 1993) are special cases. SSVS and similar methods employ mixture priors for the regression coefficients, which specify different distributions for the coefficients under the null (effect negligible) and alternative (effect nonnegligible) hypotheses. The marginal posterior probabilities of the alternative hypotheses can be used to identify a subset of important parameters on the basis of Bayesian multiple comparison rules, including the median probability model (with a threshold of 0.5) and Bayesian false discovery rate control (e.g., Müller et al. 2006).An alternative to variable selection with mixture priors is classical and Bayesian shrinkage- or penalty-based inference. For the classical approach of penalized regression, while an L2-based shrinkage method (ridge regression) cannot perform variable selection, other methods, in particular the L1-based lasso of Tibshirani (1996) and later lasso extensions, are capable of performing variable selection by reducing the effects of unimportant variables effectively to zero. The lasso has been applied to parametric, regression-based QTL mapping (Yi and Xu 2008). The penalized regression methods can be interpreted as Bayesian regression models with particular sparsity priors imposed on the regression coefficients (Park and Casella 2008).Regression methods are also used for association mapping in human populations. Recently, Kwee et al. (2008) proposed a semiparametric regression-based approach for candidate regions in human association mapping, where a quantitative trait is regressed on a nonparametric function of the tagSNP genotypes within a region. They analyzed a (small) subset of the genome and tested for the joint significance of the subset. Their method potentially can be used to model interactions among SNPs and covariates. However, Kwee et al. (2008) fit their model using least-squares kernel machines, a dimension-reducing technique that is identical to an analysis based on a specific linear mixed model. Model selection for different types of kernels and different sets of variables is performed using criteria such as Akaike''s information criteria (Akaike 1974) and Bayesian information criteria (Schwarz 1978), which may not be appropriate or feasible in large-scale, sparse variable selection situations.We (Huang et al. 2010) recently developed a Bayesian semiparametric QTL mapping method, where nongenetic covariate effects are modeled nonparametrically. This method was implemented via MCMC, and a Gaussian process prior (O''Hagan 1978; Neal 1996, 1997) was placed on the unknown covariate function. The Gaussian process is particularly well suited for curve estimation due to its flexible sample path shapes. This method allows one or more nongenetic covariates to have an arbitrary (nonlinear) relationship with the phenotype. Another strong advantage of the Gaussian process is its ability to deal with high-dimensional data compared to other nonparametric techniques such as spline regression (Wahba 1984; Heckman 1986; Chen 1988; Speckman 1988; Cuzick 1992; Hastie and Loader 1993). There has been a growing interest in using Gaussian processes as a unifying framework for studying multivariate regression (Rasmussen 1996), pattern classification (Williams and Barber 1998), and hierarchical modeling (Menzefricke 2000). In this article, we build on this work and propose a nonparametric Bayesian method for multiple QTL mapping by including not only nongenetic covariates but also all candidate QTL in the unknown function. A Gaussian process prior (GPP) is again placed on the unknown function, and a variable selection approach is implemented for the hyperparameters of the GPP (one for each QTL and nongenetic covariate). Here, we rely on mixture priors and MCMC implementation, and we focus on linkage mapping in inbred line crosses, while in ongoing and future work we are considering shrinkage priors, deterministic algorithms, and association mapping. Our application of the GPP differs from “standard” applications in that the QTL covariates included in the unknown function are discrete, not continuous, with a small number (two or three) of possible values (the genotype codes). The goal of using a GPP here is not curve or response surface modeling but rather high-dimensional variable selection (QTL and nongenetic covariates) with a method requiring only a single parameter for each variable while accounting for any multiway interactions among the candidate variables.To improve current methods for linkage mapping in inbred line crosses and for association analysis of human populations, we need to be able to detect QTL irrespective of whether they act mostly through main effects, interactions with other QTL, or interactions with environment. Fitting a parametric model including all these potential effects for a genome-wide search would substantially increase the multiple-testing problem, in addition to being computationally extremely demanding. Here we offer an alternative. We show that our nonparametric Bayesian method can identify QTL irrespective of whether they act through main effects, through interactions with other QTL, or with environmental factors. This method cannot identify the source(s) of a QTL''s importance (main or interaction effects involving this QTL). Therefore, once a small number of important QTL have been identified in a genome-wide scan, then these QTL can be further analyzed with detailed parametric models to determine the source(s) of their importance.The remainder of the article is organized as follows. We first present the nonparametric multiple-QTL model and outline the MCMC sampler in the next section. Simulation results and the analysis of a real data set are presented in the section following that. And we end the article with a discussion and conclusions.  相似文献   

11.
Predictive ability of models for litter size in swine on the basis of different sources of genetic information was investigated. Data represented average litter size on 2598, 1604 and 1897 60K genotyped sows from two purebred and one crossbred line, respectively. The average correlation (r) between observed and predicted phenotypes in a 10-fold cross-validation was used to assess predictive ability. Models were: pedigree-based mixed-effects model (PED), Bayesian ridge regression (BRR), Bayesian LASSO (BL), genomic BLUP (GBLUP), reproducing kernel Hilbert spaces regression (RKHS), Bayesian regularized neural networks (BRNN) and radial basis function neural networks (RBFNN). BRR and BL used the marker matrix or its principal component scores matrix (UD) as covariates; RKHS employed a Gaussian kernel with additive codes for markers whereas neural networks employed the additive genomic relationship matrix (G) or UD as inputs. The non-parametric models (RKHS, BRNN, RNFNN) gave similar predictions to the parametric counterparts (average r ranged from 0.15 to 0.23); most of the genome-based models outperformed PED (r = 0.16). Predictive abilities of linear models and RKHS were similar over lines, but BRNN varied markedly, giving the best prediction (r = 0.31) when G was used in crossbreds, but the worst (r = 0.02) when the G matrix was used in one of the purebred lines. The r values for RBFNN ranged from 0.16 to 0.23. Predictive ability was better in crossbreds (0.26) than in purebreds (0.15 to 0.22). This may be related to family structure in the purebred lines.  相似文献   

12.
The Bayesian LASSO (BL) has been pointed out to be an effective approach to sparse model representation and successfully applied to quantitative trait loci (QTL) mapping and genomic breeding value (GBV) estimation using genome-wide dense sets of markers. However, the BL relies on a single parameter known as the regularization parameter to simultaneously control the overall model sparsity and the shrinkage of individual covariate effects. This may be idealistic when dealing with a large number of predictors whose effect sizes may differ by orders of magnitude. Here we propose the extended Bayesian LASSO (EBL) for QTL mapping and unobserved phenotype prediction, which introduces an additional level to the hierarchical specification of the BL to explicitly separate out these two model features. Compared to the adaptiveness of the BL, the EBL is “doubly adaptive” and thus, more robust to tuning. In simulations, the EBL outperformed the BL in regard to the accuracy of both effect size estimates and phenotypic value predictions, with comparable computational time. Moreover, the EBL proved to be less sensitive to tuning than the related Bayesian adaptive LASSO (BAL), which introduces locus-specific regularization parameters as well, but involves no mechanism for distinguishing between model sparsity and parameter shrinkage. Consequently, the EBL seems to point to a new direction for QTL mapping, phenotype prediction, and GBV estimation.REGULARIZATION or shrinkage methods are gaining increasing recognition as a valuable alternative to variable selection techniques in dealing with oversaturated or otherwise ill-defined regression problems in both the classical and Bayesian frameworks (e.g., O''hara and Sillanpää 2009). Many studies (e.g., Xu 2003; Wang et al. 2005; Zhang and Xu 2005; De los Campos et al. 2009; Usai et al. 2009; Wu et al. 2009; Xu et al. 2009) have documented the potential of shrinkage methods for quantitative trait locus (QTL) mapping and genomic breeding value (GBV) estimation using genome-wide dense sets of markers. Lee et al. (2008) make a clear connection between phenotype prediction and GBV estimation, suggesting that methods developed for one are also applicable to the other. We thus use the two concepts interchangeably throughout this article.Regularized regression methods, such as ridge regression (Hoerl and Kennard 1970) or the least absolute shrinkage and selection operator (LASSO) (Tibshirani 1996), are essentially penalized likelihood procedures, where suitable penalty functions are added to the negative log-likelihood to automatically shrink spurious effects (effects of redundant covariates) toward zero, while allowing relevant effects to take values farther from zero.It has been pointed out that these non-Bayesian shrinkage methods are not suitable for oversaturated models. Zou and Hastie (2005) and Park and Casella (2008) noted that the LASSO cannot select a number of nonzero effects exceeding the sample size. Xu (2003) found that for ridge regression to work, the number of model effects should be in the same order as the number of observations. This is impractical for genomic selection, which capitalizes on the variation due to small-marker effects, the number of which can exceed the sample size, by contrast to QTL mapping where interest lies mostly in a small subset of loci with large effects on the focal phenotype. In connection with the LASSO, the Bayesian LASSO (BL) (Park and Casella 2008; Yi and Xu 2008) has been proposed to overcome this limitation by imposing a selective shrinkage across regression parameters. Xu (2003) also proposed a Bayesian shrinkage method for QTL mapping, which extends ridge regression in a similar fashion.Although the BL has been successfully applied to QTL mapping (e.g., Yi and Xu 2008) and to GBV estimation (e.g., De los Campos et al. 2009), it relies on a single parameter known as the regularization parameter to simultaneously regulate the overall model sparsity and the extent to which individual regression coefficients are shrunken. However, this is unrealistic when dealing with a large number of predictors whose effect sizes may differ by orders of magnitude. It is therefore natural to ask whether this practice can be relaxed and how such an attempt may impinge on the model performance (e.g., Sun et al. 2010).Here we propose an extension to the Bayesian LASSO for QTL mapping and unobserved phenotype prediction. Our method, the extended Bayesian LASSO (EBL), introduces locus-specific regularization parameters and utilizes a parameterization that clearly separates the overall model sparsity from the degree of shrinkage of individual regression parameters. We use simulated data to investigate the performance of the EBL relative to the Bayesian LASSO in mapping QTL and in predicting unobserved phenotypes. We also compare the performance of the EBL to the Bayesian adaptive LASSO (BAL) recently proposed by Sun et al. (2010), which also assumes locus-specific regularization parameters.  相似文献   

13.

Background

Residual feed intake (RFI), a measure of feed efficiency, is the difference between observed feed intake and the expected feed requirement predicted from growth and maintenance. Pigs with low RFI have reduced feed costs without compromising their growth. Identification of genes or genetic markers associated with RFI will be useful for marker-assisted selection at an early age of animals with improved feed efficiency.

Methodology/Principal findings

Whole genome association studies (WGAS) for RFI, average daily feed intake (ADFI), average daily gain (ADG), back fat (BF) and loin muscle area (LMA) were performed on 1,400 pigs from the divergently selected ISU-RFI lines, using the Illumina PorcineSNP60 BeadChip. Various statistical methods were applied to find SNPs and genomic regions associated with the traits, including a Bayesian approach using GenSel software, and frequentist approaches such as allele frequency differences between lines, single SNP and haplotype analyses using PLINK software. Single SNP and haplotype analyses showed no significant associations (except for LMA) after genomic control and FDR. Bayesian analyses found at least 2 associations for each trait at a false positive probability of 0.5. At generation 8, the RFI selection lines mainly differed in allele frequencies for SNPs near (<0.05 Mb) genes that regulate insulin release and leptin functions. The Bayesian approach identified associations of genomic regions containing insulin release genes (e.g., GLP1R, CDKAL, SGMS1) with RFI and ADFI, of regions with energy homeostasis (e.g., MC4R, PGM1, GPR81) and muscle growth related genes (e.g., TGFB1) with ADG, and of fat metabolism genes (e.g., ACOXL, AEBP1) with BF. Specifically, a very highly significantly associated QTL for LMA on SSC7 with skeletal myogenesis genes (e.g., KLHL31) was identified for subsequent fine mapping.

Conclusions/significance

Important genomic regions associated with RFI related traits were identified for future validation studies prior to their incorporation in marker-assisted selection programs.  相似文献   

14.

Background

The prediction accuracy of several linear genomic prediction models, which have previously been used for within-line genomic prediction, was evaluated for multi-line genomic prediction.

Methods

Compared to a conventional BLUP (best linear unbiased prediction) model using pedigree data, we evaluated the following genomic prediction models: genome-enabled BLUP (GBLUP), ridge regression BLUP (RRBLUP), principal component analysis followed by ridge regression (RRPCA), BayesC and Bayesian stochastic search variable selection. Prediction accuracy was measured as the correlation between predicted breeding values and observed phenotypes divided by the square root of the heritability. The data used concerned laying hens with phenotypes for number of eggs in the first production period and known genotypes. The hens were from two closely-related brown layer lines (B1 and B2), and a third distantly-related white layer line (W1). Lines had 1004 to 1023 training animals and 238 to 240 validation animals. Training datasets consisted of animals of either single lines, or a combination of two or all three lines, and had 30 508 to 45 974 segregating single nucleotide polymorphisms.

Results

Genomic prediction models yielded 0.13 to 0.16 higher accuracies than pedigree-based BLUP. When excluding the line itself from the training dataset, genomic predictions were generally inaccurate. Use of multiple lines marginally improved prediction accuracy for B2 but did not affect or slightly decreased prediction accuracy for B1 and W1. Differences between models were generally small except for RRPCA which gave considerably higher accuracies for B2. Correlations between genomic predictions from different methods were higher than 0.96 for W1 and higher than 0.88 for B1 and B2. The greater differences between methods for B1 and B2 were probably due to the lower accuracy of predictions for B1 (~0.45) and B2 (~0.40) compared to W1 (~0.76).

Conclusions

Multi-line genomic prediction did not affect or slightly improved prediction accuracy for closely-related lines. For distantly-related lines, multi-line genomic prediction yielded similar or slightly lower accuracies than single-line genomic prediction. Bayesian variable selection and GBLUP generally gave similar accuracies. Overall, RRPCA yielded the greatest accuracies for two lines, suggesting that using PCA helps to alleviate the “n ≪ p” problem in genomic prediction.

Electronic supplementary material

The online version of this article (doi:10.1186/s12711-014-0057-5) contains supplementary material, which is available to authorized users.  相似文献   

15.

Background

Since both the number of SNPs (single nucleotide polymorphisms) used in genomic prediction and the number of individuals used in training datasets are rapidly increasing, there is an increasing need to improve the efficiency of genomic prediction models in terms of computing time and memory (RAM) required.

Methods

In this paper, two alternative algorithms for genomic prediction are presented that replace the originally suggested residual updating algorithm, without affecting the estimates. The first alternative algorithm continues to use residual updating, but takes advantage of the characteristic that the predictor variables in the model (i.e. the SNP genotypes) take only three different values, and is therefore termed “improved residual updating”. The second alternative algorithm, here termed “right-hand-side updating” (RHS-updating), extends the idea of improved residual updating across multiple SNPs. The alternative algorithms can be implemented for a range of different genomic predictions models, including random regression BLUP (best linear unbiased prediction) and most Bayesian genomic prediction models. To test the required computing time and RAM, both alternative algorithms were implemented in a Bayesian stochastic search variable selection model.

Results

Compared to the original algorithm, the improved residual updating algorithm reduced CPU time by 35.3 to 43.3%, without changing memory requirements. The RHS-updating algorithm reduced CPU time by 74.5 to 93.0% and memory requirements by 13.1 to 66.4% compared to the original algorithm.

Conclusions

The presented RHS-updating algorithm provides an interesting alternative to reduce both computing time and memory requirements for a range of genomic prediction models.  相似文献   

16.

Background

Nellore cattle play an important role in beef production in tropical systems and there is great interest in determining if genomic selection can contribute to accelerate genetic improvement of production and fertility in this breed. We present the first results of the implementation of genomic prediction in a Bos indicus (Nellore) population.

Methods

Influential bulls were genotyped with the Illumina Bovine HD chip in order to assess genomic predictive ability for weight and carcass traits, gestation length, scrotal circumference and two selection indices. 685 samples and 320 238 single nucleotide polymorphisms (SNPs) were used in the analyses. A forward-prediction scheme was adopted to predict the genomic breeding values (DGV). In the training step, the estimated breeding values (EBV) of bulls were deregressed (dEBV) and used as pseudo-phenotypes to estimate marker effects using four methods: genomic BLUP with or without a residual polygenic effect (GBLUP20 and GBLUP0, respectively), a mixture model (Bayes C) and Bayesian LASSO (BLASSO). Empirical accuracies of the resulting genomic predictions were assessed based on the correlation between DGV and dEBV for the testing group.

Results

Accuracies of genomic predictions ranged from 0.17 (navel at weaning) to 0.74 (finishing precocity). Across traits, Bayesian regression models (Bayes C and BLASSO) were more accurate than GBLUP. The average empirical accuracies were 0.39 (GBLUP0), 0.40 (GBLUP20) and 0.44 (Bayes C and BLASSO). Bayes C and BLASSO tended to produce deflated predictions (i.e. slope of the regression of dEBV on DGV greater than 1). Further analyses suggested that higher-than-expected accuracies were observed for traits for which EBV means differed significantly between two breeding subgroups that were identified in a principal component analysis based on genomic relationships.

Conclusions

Bayesian regression models are of interest for future applications of genomic selection in this population, but further improvements are needed to reduce deflation of their predictions. Recurrent updates of the training population would be required to enable accurate prediction of the genetic merit of young animals. The technical feasibility of applying genomic prediction in a Bos indicus (Nellore) population was demonstrated. Further research is needed to permit cost-effective selection decisions using genomic information.  相似文献   

17.
Genomic best linear-unbiased prediction (GBLUP) assumes equal variance for all marker effects, which is suitable for traits that conform to the infinitesimal model. For traits controlled by major genes, Bayesian methods with shrinkage priors or genome-wide association study (GWAS) methods can be used to identify causal variants effectively. The information from Bayesian/GWAS methods can be used to construct the weighted genomic relationship matrix (G). However, it remains unclear which methods perform best for traits varying in genetic architecture. Therefore, we developed several methods to optimize the performance of weighted GBLUP and compare them with other available methods using simulated and real data sets. First, two types of methods (marker effects with local shrinkage or normal prior) were used to obtain test statistics and estimates for each marker effect. Second, three weighted G matrices were constructed based on the marker information from the first step: (1) the genomic-feature-weighted G, (2) the estimated marker-variance-weighted G, and (3) the absolute value of the estimated marker-effect-weighted G. Following the above process, six different weighted GBLUP methods (local shrinkage/normal-prior GF/EV/AEWGBLUP) were proposed for genomic prediction. Analyses with both simulated and real data demonstrated that these options offer flexibility for optimizing the weighted GBLUP for traits with a broad spectrum of genetic architectures. The advantage of weighting methods over GBLUP in terms of accuracy was trait dependant, ranging from 14.8% to marginal for simulated traits and from 44% to marginal for real traits. Local-shrinkage prior EVWGBLUP is superior for traits mainly controlled by loci of a large effect. Normal-prior AEWGBLUP performs well for traits mainly controlled by loci of moderate effect. For traits controlled by some loci with large effects (explain 25–50% genetic variance) and a range of loci with small effects, GFWGBLUP has advantages. In conclusion, the optimal weighted GBLUP method for genomic selection should take both the genetic architecture and number of QTLs of traits into consideration carefully.Subject terms: Quantitative trait, Genome-wide association studies, Animal breeding, Quantitative trait, Genome-wide association studies  相似文献   

18.

Background

In quantitative trait mapping and genomic prediction, Bayesian variable selection methods have gained popularity in conjunction with the increase in marker data and computational resources. Whereas shrinkage-inducing methods are common tools in genomic prediction, rigorous decision making in mapping studies using such models is not well established and the robustness of posterior results is subject to misspecified assumptions because of weak biological prior evidence.

Methods

Here, we evaluate the impact of prior specifications in a shrinkage-based Bayesian variable selection method which is based on a mixture of uniform priors applied to genetic marker effects that we presented in a previous study. Unlike most other shrinkage approaches, the use of a mixture of uniform priors provides a coherent framework for inference based on Bayes factors. To evaluate the robustness of genetic association under varying prior specifications, Bayes factors are compared as signals of positive marker association, whereas genomic estimated breeding values are considered for genomic selection. The impact of specific prior specifications is reduced by calculation of combined estimates from multiple specifications. A Gibbs sampler is used to perform Markov chain Monte Carlo estimation (MCMC) and a generalized expectation-maximization algorithm as a faster alternative for maximum a posteriori point estimation. The performance of the method is evaluated by using two publicly available data examples: the simulated QTLMAS XII data set and a real data set from a population of pigs.

Results

Combined estimates of Bayes factors were very successful in identifying quantitative trait loci, and the ranking of Bayes factors was fairly stable among markers with positive signals of association under varying prior assumptions, but their magnitudes varied considerably. Genomic estimated breeding values using the mixture of uniform priors compared well to other approaches for both data sets and loss of accuracy with the generalized expectation-maximization algorithm was small as compared to that with MCMC.

Conclusions

Since no error-free method to specify priors is available for complex biological phenomena, exploring a wide variety of prior specifications and combining results provides some solution to this problem. For this purpose, the mixture of uniform priors approach is especially suitable, because it comprises a wide and flexible family of distributions and computationally intensive estimation can be carried out in a reasonable amount of time.  相似文献   

19.
Coalescent-based inference of phylogenetic relationships among species takes into account gene tree incongruence due to incomplete lineage sorting, but for such methods to make sense species have to be correctly delimited. Because alternative assignments of individuals to species result in different parametric models, model selection methods can be applied to optimise model of species classification. In a Bayesian framework, Bayes factors (BF), based on marginal likelihood estimates, can be used to test a range of possible classifications for the group under study. Here, we explore BF and the Akaike Information Criterion (AIC) to discriminate between different species classifications in the flowering plant lineage Silene sect. Cryptoneurae (Caryophyllaceae). We estimated marginal likelihoods for different species classification models via the Path Sampling (PS), Stepping Stone sampling (SS), and Harmonic Mean Estimator (HME) methods implemented in BEAST. To select among alternative species classification models a posterior simulation-based analog of the AIC through Markov chain Monte Carlo analysis (AICM) was also performed. The results are compared to outcomes from the software BP&P. Our results agree with another recent study that marginal likelihood estimates from PS and SS methods are useful for comparing different species classifications, and strongly support the recognition of the newly described species S. ertekinii.  相似文献   

20.
This study aimed to assess the predictive ability of different machine learning (ML) methods for genomic prediction of reproductive traits in Nellore cattle. The studied traits were age at first calving (AFC), scrotal circumference (SC), early pregnancy (EP) and stayability (STAY). The numbers of genotyped animals and SNP markers available were 2342 and 321 419 (AFC), 4671 and 309 486 (SC), 2681 and 319 619 (STAY) and 3356 and 319 108 (EP). Predictive ability of support vector regression (SVR), Bayesian regularized artificial neural network (BRANN) and random forest (RF) were compared with results obtained using parametric models (genomic best linear unbiased predictor, GBLUP, and Bayesian least absolute shrinkage and selection operator, BLASSO). A 5‐fold cross‐validation strategy was performed and the average prediction accuracy (ACC) and mean squared errors (MSE) were computed. The ACC was defined as the linear correlation between predicted and observed breeding values for categorical traits (EP and STAY) and as the correlation between predicted and observed adjusted phenotypes divided by the square root of the estimated heritability for continuous traits (AFC and SC). The average ACC varied from low to moderate depending on the trait and model under consideration, ranging between 0.56 and 0.63 (AFC), 0.27 and 0.36 (SC), 0.57 and 0.67 (EP), and 0.52 and 0.62 (STAY). SVR provided slightly better accuracies than the parametric models for all traits, increasing the prediction accuracy for AFC to around 6.3 and 4.8% compared with GBLUP and BLASSO respectively. Likewise, there was an increase of 8.3% for SC, 4.5% for EP and 4.8% for STAY, comparing SVR with both GBLUP and BLASSO. In contrast, the RF and BRANN did not present competitive predictive ability compared with the parametric models. The results indicate that SVR is a suitable method for genome‐enabled prediction of reproductive traits in Nellore cattle. Further, the optimal kernel bandwidth parameter in the SVR model was trait‐dependent, thus, a fine‐tuning for this hyper‐parameter in the training phase is crucial.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号