Graph-Based Data Selection for the Construction of Genomic Prediction Models |
| |
Authors: | Steven Maenhout Bernard De Baets Geert Haesaert |
| |
Affiliation: | *Department of Biosciences and Landscape Architecture, University College Ghent, B-9000 Gent, Belgium, §Department of Applied Mathematics, Biometrics and Process Control, Ghent University, B-9000 Gent, Belgium |
| |
Abstract: | Efficient genomic selection in animals or crops requires the accurate prediction of the agronomic performance of individuals from their high-density molecular marker profiles. Using a training data set that contains the genotypic and phenotypic information of a large number of individuals, each marker or marker allele is associated with an estimated effect on the trait under study. These estimated marker effects are subsequently used for making predictions on individuals for which no phenotypic records are available. As most plant and animal breeding programs are currently still phenotype driven, the continuously expanding collection of phenotypic records can only be used to construct a genomic prediction model if a dense molecular marker fingerprint is available for each phenotyped individual. However, as the genotyping budget is generally limited, the genomic prediction model can only be constructed using a subset of the tested individuals and possibly a genome-covering subset of the molecular markers. In this article, we demonstrate how an optimal selection of individuals can be made with respect to the quality of their available phenotypic data. We also demonstrate how the total number of molecular markers can be reduced while a maximum genome coverage is ensured. The third selection problem we tackle is specific to the construction of a genomic prediction model for a hybrid breeding program where only molecular marker fingerprints of the homozygous parents are available. We show how to identify the set of parental inbred lines of a predefined size that has produced the highest number of progeny. These three selection approaches are put into practice in a simulation study where we demonstrate how the trade-off between sample size and sample quality affects the prediction accuracy of genomic prediction models for hybrid maize.DESPITE the numerous studies devoted to molecular marker-based breeding, the genetic progress of most complex traits in today''s plant and animal breeding programs still heavily relies on phenotypic selection. Most breeding companies have established dedicated databases that store the vast number of phenotypic records that are being routinely collected throughout the course of their breeding programs. These phenotypic records are, however, gradually being complemented by various types of molecular marker scores and it is to be expected that effective marker-based selection schemes will eventually allow current phenotyping efforts to be reduced (Bernardo 2008; Hayes et al. 2009). The available marker and phenotypic databases already allow for the construction and validation of marker-based selection schemes. Mining the phenotypic databases of a breeding company is, however, quite different from analyzing the data that is generated by a carefully designed experiment. Genetic evaluation data is often severely unbalanced as elite individuals are usually tested many times on their way to becoming a commercial variety or sire, while less performing individuals are often disregarded after a single trial. Furthermore, the different phenotypic evaluation trials are separated in time and space and as such, subjected to different environmental conditions. Therefore, ranking the performance of individuals that were evaluated in different phenotypic trials is usually a nontrivial task.Animal breeders are well experienced when it comes to handling unbalanced genetic evaluation data. The best linear unbiased predictor or BLUP approach (Henderson 1975) presented a major breakthrough in this respect, especially when combined with restricted maximum-likelihood or REML estimation of the needed variance components (Patterson and Thompson 1971). Somewhat later on, this linear mixed modeling approach was also adopted by plant breeders as the de facto standard for handling unbalanced phenotypic data. The more recent developments in genomic selection (Bernardo 1995; Meuwissen et al. 2001; Gianola and van Kaam 2008) and marker-trait association studies (Yu et al. 2006) are, at least partially, BLUP-based and are therefore, in theory, perfectly suited for mining the large marker and phenotypic databases that back each breeding program. In practice, however, the unbalancedness of the available genetic evaluation data often reduces its total information content and the construction of a marker-based selection model is limited to a more balanced subset of the data.As phenotypic data are available, genotyping costs limit the total number of individuals that can be included in the construction of a genomic prediction model. The best results will be obtained by selecting a subset of individuals for which the phenotypic evaluation data exhibits the least amount of unbalancedness. In this article we demonstrate how this phenotypic subset selection problem can be translated into a standard graph theory problem that can be solved with exact algorithms or less-time-consuming heuristics.In most plant and animal species, the number of available molecular markers is rapidly increasing, while the genotyping cost per marker is decreasing. Nevertheless, as budgets are always limited, genotyping all mapped markers for a small number of individuals might be less efficient than genotyping a restricted set of well-chosen markers on a wider set of individuals. One should therefore be able to select a subset of molecular markers that covers the entire genome as uniformly as possible. We demonstrate how this marker selection problem can also be translated into a well-known graph theory problem that has an exact solution.The third problem we tackle by means of graph theory is more specific to hybrid breeding programs where the parental individuals are nearly or completely homozygous. This implies that we can deduce the molecular marker fingerprint of a hybrid individual from the marker scores of its parents. As the phenotypic data are collected on the hybrids, genotyping costs can be reduced by selecting a subset of parental inbreds that have produced the maximum number of genetically distinct offspring among themselves. Obviously, the phenotypic data on these offspring should be as balanced as possible.Besides solving the above-mentioned selection problems by means of graph theory algorithms, we demonstrate their use in a simulation study that allows us to determine the optimum trade-off between the number of individuals and the size of the genotyped molecular marker fingerprint for predicting the phenotypic performance of hybrid maize by means of ɛ-insensitive support vector machine regression (ɛ-SVR) (Maenhout et al. 2007, 2008, 2010) and best linear prediction (BLP) (Bernardo 1994, 1995, 1996). |
| |
Keywords: | |
|
|