Enhancing Genome-Enabled Prediction by Bagging Genomic BLUP

Size: px

Start display at page:

Download "Enhancing Genome-Enabled Prediction by Bagging Genomic BLUP"

Erica Ball
5 years ago
Views:

1 Enhancing Genome-Enabled Prediction by Bagging Genomic BLUP Daniel Gianola 1,,3 *, Kent A. Weigel, Nicole Krämer 4, Alessandra Stella 5, Chris-Carolin Schön 4 1 Department of Animal Sciences, University of Wisconsin-Madison, Madison, Wisconsin, United States of America, Department of Dairy Science, University of Wisconsin- Madison, Madison, Wisconsin, United States of America, 3 Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, Wisconsin, United States of America, 4 Department of Plant Breeding, Technical University of Munich, Weihenstephan Center, Munich, Germany,5 Bioinformatics and Statistics Unit, Parco Tecnologico Padano, Lodi, Italy Abstract We examined whether or not the predictive ability of genomic best linear unbiased prediction (GBLUP) could be improved via a resampling method used in machine learning: bootstrap aggregating sampling ( bagging ). In theory, bagging can be useful when the predictor has large variance or when the number of markers is much larger than sample size, preventing effective regularization. After presenting a brief review of GBLUP, bagging was adapted to the context of GBLUP, both at the level of the genetic signal and of marker effects. The performance of bagging was evaluated with four simulated case studies including known or unknown quantitative trait loci, and an application was made to real data on grain yield in wheat planted in four environments. A metric aimed to quantify candidate-specific cross-validation uncertainty was proposed and assessed; as expected, model derived theoretical reliabilities bore no relationship with cross-validation accuracy. It was found that bagging can ameliorate predictive performance of GBLUP and make it more robust against overfitting. Seemingly, 5 50 bootstrap samples was enough to attain reasonable predictions as well as stable measures of individual predictive mean squared errors. Citation: Gianola D, Weigel KA, Krämer N, Stella A, Schön C-C (014) Enhancing Genome-Enabled Prediction by Bagging Genomic BLUP. PLoS ONE 9(4): e doi: /journal.pone Editor: Qin Zhang, China Agricultrual University, China Received December 18, 013; Accepted February 13, 014; Published April 10, 014 Copyright: ß 014 Gianola et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: Research was partially supported by the Federal Ministry of Education and Research (BMBF, Germany) within the AgroClustEr 15 Synbreed- Synergistic plan and animal breeding (FKZ A), by a U.S. Department of Agriculture Hatch Grant (14- PRJ63CV) to DG, and by the Wisconsin Agriculture Experiment Station. The funders had no role on study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * gianola@ansci.wisc.edu Introduction A method for whole-genome enabled prediction of quantitative traits known as GBLUP, standing for genomic best linear unbiased prediction, was seemingly suggested first by Van Raden [1 ]. In GBLUP, a pedigree-based relationship matrix among individuals [3] is replaced by a matrix valued measure of genomic similarities constructed using molecular markers, such as single nucleotide polymorphisms (SNPs) [4]. This genomic relationship matrix or variants thereof [5 6], referred to as G hereinafter, defines a covariance structure among individuals (even if genetically unrelated in the standard sense), stemming from molecular similarity in state at additive marker loci among members of a sample. Given G and values of some needed variance components, one can use the theory of best linear unbiased prediction (BLUP) to obtain point and interval (e.g., prediction error variances) estimates of the genetic values of a set of individuals as marked by the battery of SNPs. While G can be constructed in different manners we do not address this issue here. However, we note that it is possible to separate genomic from residual variance components statistically even in the absence of genetic relatedness. Hence, care must be exercised when relating the genomic to the genetic variance; this is discussed in [7]. Using theory developed by Henderson [8 10] it can be shown that GBLUP is equivalent to a linear regression model on additive genotypic codes of markers, with the allelic substitution effects at marker loci treated as drawn independently from a distribution possessing a constant variance over markers [11 13]. There is also an equivalence between pedigree-based BLUP or G-BLUP (or of models using both pedigree and marker relationships) and nonparametric regression [14]. For instance, if the n p marker matrix X (n~number of observations, p~number of markers) is used to construct a kernel matrix XX 0, it can be established that GBLUP is a reproducing kernel Hilbert spaces regression procedure [15 16]. Also, BLUP and GBLUP can be represented as linear neural networks where inputs are entries of the pedigreebased or G matrices, respectively [17]. Hence, GBLUP can be motivated from several different perspectives. There are many competing procedures for genome-enabled prediction, such as the members of the growing Bayesian alphabet [14],[18 19], but most of these require a Bayesian Markov chain Monte Carlo (MCMC) implementation. On the other hand, GBLUP is simple, easy to understand, explain and compute, and there is software available for likelihood-based variance component estimation and for prediction of random effects. Also, GBLUP handles cross-sectional and longitudinal data flexibly and extends to multiple-trait settings in a direct manner. Further, GBLUP delivers a competitive predictive ability since members of the Bayesian alphabet typically differ by little in predictive performance and differences among methods are typically masked by cross-validation noise [19 3]. Last but not least, some members of this alphabet produce predictions that are sensitive PLOS ONE 1 April 014 Volume 9 Issue 4 e91693

2 to hyper-parameter specification [4]. Given these considerations, GBLUP or extensions thereof [5] are good candidates for routine whole-genome prediction in animal and plant breeding applications and possibly for prediction of complex traits in humans as well [7],[6]. The closeness between predictions and realized values depends mainly on three factors: prediction bias, variance of prediction error and amount of noise associated with future observations. The latter cannot be reduced by any prediction machine based on training data, so it is impossible to construct predictors attaining a perfect predictive correlation, even if the model holds. Theoretical and empirical results [1], [7 30] indicate that the proportion of (cross-validation) variance explained by a linear predictor increases up to a point with training sample size, then reaching a plateau. However, when a small number of individuals is available, any prediction machine is bound to produce predictions with a large variance. In this context, it seems worthwhile to explore avenues for enhancing accuracy (i.e., reduce bias) and reliability (i.e., decrease variance) of predictions when training size is small. The question examined here is whether or not the predictive ability of GBLUP can be improved by recourse to resampling methods used in machine learning. These include bootstrap aggregating sampling ( bagging ) and iterated bagging or debiasing [31 34]. Bagging uses bootstrap sampling to reduce variance and can enhance reliability and reduce mean squared error [31]. The conditional bias of GBLUP [19] cannot be removed by bagging, but iterated bagging has the potential of reducing variance while removing bias simultaneously. This study investigates bagging of GBLUP, with consideration of debiasing deferred to a future investigation. The second section of this paper gives a review of GBLUP and of its inherent inaccuracy (bias). The third section describes bagging in the context of GBLUP at the level of the genetic signal and of marker effects. The fourth section examines the performance of bagging in four simulated case studies, and the fifth section presents an application to real data on grain yield in wheat planted in four environments. The paper concludes with a discussion and with a proposal of a metric aimed to quantify candidate-specific cross-validation uncertainty. Materials and Methods GBLUP Idealized conditions. Assume that effects of nuisance factors (e.g., year to year variation) have been removed in a preprocessing stage (this can also be done in a single-stage, but we ignore this for simplicity). GBLUP can be motivated by positing the linear regression model on markers y~xbze, where y is an n 1 vector of observations or pre-processed data measured on a set of individuals or lines; X is an n p matrix of marker genotypes, with its typical element x ij being the genotype code at locus j observed in individual i, and with rankðxþƒminðn,pþ; b is a p 1 vector of unknown allelic substitution effects when marker genotypes are coded, e.g., as {1,0 and 1 for aa, Aa and AA at locus A, say, or when these coded values are deviated from the corresponding column means or standardized. Above, e*n 0,Ns e is a vector of residuals where s e is the residual variance and N is an n n diagonal matrix with typical element N ii ;ifyconsists of single measurements on individuals, N ii ~1 for all i~1,,:::,n, and if y is a vector of means, N ii would be the number of observations entering into the mean. In BLUP, a distribution is assigned to b and the simplest one is ð1þ bds b *N 0, 000I s b, where s b is the variance of marker allele substitution effects. Using this assumption together with model (1) gives as marginal distribution of the data (after assuming that X and y have been centered) ydx*n 0,XX 0 s b zns e. In BLUP s b and s e are treated as known but these parameters are typically estimated from data at hand [35]. With markers, most often nvp so it is convenient to form the best linear unbiased predictor of b as ^b~s b X0 V {1 y, where V~XX 0 s b zns e. If model (1) holds, it can be shown that ^b is unbiased in the sense that E ^bdx ~EðbÞ~0. Its covariance matrix (given X) is V^b ~s4 b X0 V {1 X, and its prediction error covariance matrix (also covariance matrix of the conditional distribution of b given y and X under normality assumptions) is ðþ V bdy ~V b {V^b ~s b I{s b X0 V {1 X, ð3þ where V b ~Is b. The diagonal elements of V bdy (v jj ) lead to a measure of reliability of prediction of marker effect j: rel j ~1{ v jj. A matrix of reliabilities and co-reliabilities is R~I{V bdy V {1 b ~V^b V{1 b : ð4þ If one wishes to predict a future vector y f ~X f bze f, with future residuals independent of past ones and provided that future and past residuals stem from the same stochastic process, under normality assumptions the predictive distribution [19] is y f Dy,X,X f *N X f ^b,x f V bdy X 0 f zi f s e : ð5þ Further, if y f is the predictand and BLUP X f b ~Xf ^b~^y f is the predictor, BLUP theory yields Var y f Dy ~Var y f {Var ^y f, ð6þ so that Var ^y f ~X f V^bX 0 f. If the marker effects could be estimated such that V bdy?0, then Var ^y f?x f V b X 0 f and Var y f Dy?I f s e. Hence, the distribution in (5) would still have variance, indicating that it is not possible to attain a predictive correlation equal to 1 even if a training set has an infinite size (more formally, if the training process conveys an infinite amount of information about markers); [7] gives a discussion of related matters. Blup is conditionally inaccurate While BLUP theory is well established, quantitative geneticists tend to interpret the unbiasedness property of BLUP as if it pertained to the true unknown b, when in fact it applies to the average of the distribution of b, that is, 0.Ifb in (1) were viewed as a model on unknown effects of known quantitative trait loci (QTL), it is obvious that one should think in terms of a fixed effects model, as per the standard finite number of loci model of s b PLOS ONE April 014 Volume 9 Issue 4 e91693

3 quantitative genetics [36]. Accordingly, if effects of markers are sought because these flag some genomic region of interest, the random sampling assumption made in BLUP is not relevant, although it might lead to a more stable estimator. In the fixed effects case both b and the marked genetic signal Xb are estimated with bias by BLUP even if the model holds [19]. Markers are not QTL and the latter are causes of generating a signal to phenotype. Suppose that the true model is linear on effects ðqþ of QTL relating to phenotypes via incidence matrix Q, that is y~qqze: If the QTL effects q are viewed as fixed entities (arguably geneticists have this in mind in their quest of finding genes), EðyDQqÞ~Qq. In this situation BLUP produces the following average outcomes and E ^bdq,q,x h i ~E s b X0 V {1 ydq ~s b X0 V {1 Qq, h i E X^bDq,Q,X ~E s b XX0 V {1 ydq ~s b XX0 V {1 Qq, so the bias h of the i estimated signal is Qq{s b XX0 V {1 Qq~ I{s b XX0 V {1 Qq: On the other hand, if q is assigned a distribution, say, N 0,Is q and the p markers are treated as random as well, e.g., bds b *N 0,Is b, under normality assumptions one has where E qd^b,q,x ~EðqÞzCov(q,^b 0 )V {1 ^b ^b~^q markers Cov(q,^b 0 )~Cov(q,s b y0 V {1 X) Cov q,q0 Q 0 V {1 X~s b s q Q0 V {1 X: Using this in (8) ~s b ^q markers ~s b s q Q0 V {1 XV {1 ^b ^b where Q 0 V {1 X conveys unknown linkage disequilibrium relationships between QTL and marker genotypes; similarly, the best statement about the signal is E QqD^b,Q,X ~Q^q markers : Unfortunately, neither s q nor Q are known, so statements made about QTL from markers are based on untestable assumptions, including the view that the QTL effects and the genotypes are linearly related, as in (7). Predictive correlation when markers are the QTL Imagine a best case scenario where the markers are the QTL, and consider predicting y f ~Q f qze f. Here, Q f is the incidence matrix relating QTL effects to yet to be realized phenotypes y f, and e f is a future vector of residuals. BLUP theory, using Q f ^q (here, ^q is the BLUP(q) under the true model) as predictor, gives ð7þ ð8þ the following squared correlation between the ith elements of y f and Q f ^q (below Q 0 f,i is the ith row of Q f ) Var Q 0 r ii ~ f,i^q ~ Q0 f,ivarð^q ÞQ f,i Var y f,i s q Q0 f,i Q : f,izs e Let now n?? in which case BLUP theory yields Q 0 f,ivarð^q ÞQ f,i?q 0 f,i Q f,is q, so that h i r ii?1= 1zs e = s q Q0 f,i Q f,i : This shows that it is impossible to attain a perfect predictive correlation even when the markers are the QTL. Further, Q 0 f,i Q P nq f,i~ Q f,ij, where Q f,ij is the genotype at QTL locus j j~1 (j~1,:::,n q ) of individual i in the testing set. If QTL genotypes are centered and assumed to be in Hardy-Weinberg equilibrium E Q f,ij ~p j 1{p j, so approximately r ii & 6 4 1z P nq j~1 3 s e 7 5 p j 1{p j s q {1 ~h, ð9þ where p j is the frequency of a reference allele, Pnq p j 1{p j s q is additive genetic variance and h ~ P nq j~1 P nq j~1 p j 1{p j s q, p j 1{p j s q zs e is trait heritability. Hence, the predictive correlation has an upper bound at h. If instead of predicting individual phenotypes the problem is that of predicting an average, the upper bound for the predictive correlation is higher. The corresponding formula is easy to arrive at and it just requires replacing s e in (9) by s e =n Ave, the number of observations intervening in the average. Then r ii & 6 4 1z P nq j~1 3 s e =n Ave 7 5 p j 1{p j s q {1 j~1 n Ave h ~ 1zðn Ave {1Þh, which is heritability as used in plant breeding, or heritability of a mean [36]. The predictive correlation never reaches 1 but can get close to 1 if n Ave is very large. Values of squared predictive correlations in the range of have been attained in dairy cattle progeny tests where predictands are processed averages of production records of cows sired by a bull []. Bagging GBLUP (BGBLUP) and marker effects Bagging GBLUP. Bagging exploits the idea that predictors can be rendered more stable by repeated bootstrapping and averaging over bootstrap samples, thus reducing variance [31]. PLOS ONE 3 April 014 Volume 9 Issue 4 e91693

4 Bagging has been found to have advantages in cases where predictors are unstable, i.e., when small perturbations of the training set produce marked changes in model training [31], [34], [37]; for example, with ordinary least-squares under severe multicolinearity. An important application of bagging is in prediction using random forests [38]. Prediction methods that use regularization, such as those applied in genome-enabled selection, are often stable because penalties on model complexity reduce the effective number of parameters drastically, thus lowering variance. However, this is attained at the expense of bias with respect to marker effects and to the unknown function to be predicted (marked genetic value). A priori it would seem that bagging would not bring advantages in the context of a regularized method such as GBLUP. However, this issue has not been examined so far and there may be cases, e.g., in small populations, where random variation in training sets of small sizes has a marked impact on predictive ability. To motivate bagging, we recall that GBLUP is a regression of phenotypes on genomic relationships between individuals. Let g~xb be a vector of genomic values, G!XX 0 and assume that G {1 exists; then (1) can be written in equivalent form as y~gg {1 gze~gg ze ð10þ where g*n 0,Gs b and g ~G {1 g*n 0,G {1 s b. In scalar form, the datum for individual i is expressible under this parameterization as the regression y i ~ Xnq j~1 G ij g j ze i ~G 0 i g ze i, ð11þ where G ij is an element of the n n matrix G and G 0 i is its i th row. From basic principles, BLUPðg Þ~Cov(g,y 0 )V {1 y~s b V{1 y~^g, and since Gg ~g, use of invariance gives BLUPðgÞ~GBLUP(g )~s b GV{1 y~^g ð1þ Note that s b GV{1 is an n n matrix of heritabilities and coheritabilities ; the diagonal elements of G may be distinct from each other, so each individual may have a different heritability ascribed to it, as noted by de los Campos et al. (013). The intuitive idea behind bagging was outlined in [1]. Suppose there is a large number of training samples from the same population; by averaging over the predictions made from these samples we would end up with a reduction of variance but with the bias properties remaining the same as those from the predictor derived from a single training set. This large supply of training sets can be emulated by bootstrap sampling: by averaging over samples, one gets closer (in the mean squared error sense) to the true value, on average. Technical details of why this works are given in Appendix S1. Let the predictor formed from a single training set of size N Train be ^w G 0 i ~G 0 i ^g ði~1,,:::,n Train Þ. Its variance can be lowered by taking B bootstrap copies of size N Train (i.e., sampling with replacement from the training set) and then averaging over copies. A given y i,g 0 i may not appear at all or may be repeated several times over the B bootstrap samples. The bagging algorithm is: N For each copy b ðb~1,,:::,bþ run GBLUP using estimates of variance components obtained from the entire data set (to simplify computations), find the regressions ^g b and form a bootstrap draw for BLUP(g) as ^g b ~G b^g b ; b~1,,:::,b: ð13þ N After running the B GBLUP implementations take the following averages and ^g Bagged ~ ^w bagged ~ P B ^g b b~1 P B b~1 B, ^g b B : ð14þ N Predict vector y Test in the testing set as ^w Test ~G Test,Train^g Bagged where G Test,Train is a matrix of genomic relationships between individuals in testing and training sets. If y Test pertains to records on the same individuals the predictor is ^w Test ~^w bagged : To see how improvement arises consider the following argument. We seek to learn signal g from the GBLUP predictor ^g. Under the setting of BLUP (where g and y both vary at random over conceptual repeated sampling), the best predictor of g in the mean squared error sense is ^g because this is the conditional expectation under normality [39 40]. However, if g is the signal of a fixed target set of candidates, ^g is biased as shown earlier. Thus, Eð^gDg Þ~gzd, where d is a vector of biases, and the mean squared error matrix of ^w bagged is MSE ^w bagged jg ~ E ^w bagged {g E ^w bagged {g zvar ^w bagged : Now, if the B bootstrap copies of GBLUP are drawn from the same distribution, ^w bagged has the same expectation as ^g and, therefore, the same bias: E ^w bagged {gdg ~d. Further, if the B copies are viewed as drawn independently, the variance of ^w bagged should be B times smaller than that of any ^g b. Thus MSE ^w bagged ~dd 0 Varð^g Þ z B ; a formula for taking the correlation between samples into account is not available but the reduction in MSE would be obviously smaller than in the idealized situation where samples are independent. Hence, ^w bagged has the same bias of GBLUP but at best is B times less variable. If a prediction machine has little 0 PLOS ONE 4 April 014 Volume 9 Issue 4 e91693

5 variance, as it is the case for shrinkage methods with a large amount of regularization, predictive performance might be degraded [31], but whether this occurs in a particular problem or not can be assessed empirically only. Bagging marker effects Alternatively, one can work directly with the linear model (1), but this is more involved computationally because the problem becomes a p-dimensional one (in GBLUP there are nvp unknowns). Assuming centered data, the ridge regression estimator (BLUP of marker effects) is where some l~ s e s b ^b~ X 0 {1X 0 XzIl y, (in the BLUP framework) is available, estimated from data at hand or chosen over a cross-validation grid. Given the data matrix D~ ½y,XŠ, of order n ð1zpþ, draw B bootstrap copies by randomly sampling n rows from D with replacement, such that a particular bootstrap sample is D b ~ ½y b,x b Š. The bagged ridge regression estimator is ^b Bagged ~ 1 B X B b~1 {1X X 0 b X 0 bzil b y b: Then, given an out of sample case with marker matrix X Test, the yet-to be realized phenotypes are predicted as ^y Bagged ~X Test ^b Bagged. Using the relationship g~xb, one can obtain indirect samples of the bootstrap distribution of ^g as {1X X b X 0 b X 0 bzil b y b~h b y b (where {1X H b ~X b X 0 b X 0 bzil b is the hat matrix for sample b) and form the indirect bagged GBLUP as Results ^g Bagged,indirect ~ 1 B X B b~1 H b y b : ð15þ Simulated case studies Case Study 1: model training with 00 known QTL and N Train ~500. To evaluate the impact of bagging on model training, we simulated 00 QTL with known binary genotypes (as in inbred lines, where genotypes at a given locus are either aa or AA). Genotypes were sampled with a frequency of 0:05 at all loci and their effects were drawn independently from a N(0, 1 ) distribution; sample size was N Train ~500 so all QTL effects were likelihood identified (estimable) in the regression model described below. Any resulting disequilibrium was due to finite sample size only. Phenotypes were formed by summing the product of QTL genotype codes (0,1) times their corresponding effects over the 00 QTL and then adding a random residual drawn from Nð0,10Þ; the effective heritability (variance among simulated realized genetic values as a fraction of the total variance) attained in the simulation was The true model was employed in the training process using the true variance ratio l~0 and the effect of the number of bootstrap samples (B), each of size N Train ~500, on the bagged predictor was examined by taking B~50, 00 and 500. While B~5 or 50 is often adequate [31], a larger number of bootstrap samples is not harmful. Regressions of the 00 elements of q on either their ordinary least-squares estimates (OLS), BLUP-ridge regression ðl~0þ and on a bagged mean obtained by averaging over bootstrap sample estimates of the BLUPs of q were calculated. This was also done for the regression of the 500 simulated genetic signals ðg~qqþ on either GBLUP and on the average of the GBLUP bootstrap samples ^g Bagged. Table 1 presents results. As expected from BLUP theory [10] the regressions of either q or g on their corresponding BLUPs were near their expected value: 1. By construction, Var½BLUPðqÞŠ~Covðq,BLUPðqÞÞ, so the expected regression is necessarily 1. For QTL effects, OLS, even though being an unbiased estimator of q, produced a regression of about 0.4. The bagging procedure produced regressions that exceeded 1, both at the level of the QTL effects and of the genetic signal g. From the point of view of goodness of fit to the data, ridge regression and bagging produced models that accounted for more variation of the training data than OLS: for OLS R was about 0:49 whereas it ranged between 0.51 and 0.53 for bagging and ridge regression. Increasing the number of bootstrap copies in the bag increased R mildly with no sizable gain resulting from increasing B from 00 to 500. The regressions of g on ^g Bagged were larger than 1 but the bagged means accounted for the same proportion of variation in true g values as ^g GBLUP did, with R ~0:63 for the latter and for bagged GBLUP with B~00 or 500. Figure 1 (left panel) gives the distribution of estimates of the 00 QTL effects within each of 4 bootstrap samples for the run with B~500 as well as within the vector of bagged means. As expected bagging produced less variability among individual effect estimates, as extreme values are tempered by the averaging procedure. The impact of bagging is seen in the right panel of Figure 1: the variance among bagged means of individual effects was much less than the variance within any of the 500 bootstrap copies. Figure (left panel) shows the variance reduction when bagging was applied to the GBLUPs of individuals: the horizontal line at the bottom is the variance among the bagged GBLUPs of the 500 individuals in the training sample. The right panel of Figure shows that bagged GBLUP (BGBLUP) produces an understatement of genetic values relative to what is predicted by GBLUP for individuals ranked as high by the latter, but an overstatement otherwise; however, the two procedures give aligned predictions of genetic values. Bagging regresses extreme GBLUP estimates towards their average, which might attenuate influence from idiosyncratic samples on which GBLUP is trained. Hence, bagging enhances the shrinkage towards 0 inherent to GBLUP. Case Study : model training with 1000 known QTL and N Train ~500. As in case 1, sample size was N Train ~500 but 1000 QTL with known binary genotypes and unknown effects were simulated. Here nvrank(q) so least-squares cannot be used due to lack of estimability. The setting was as in case 1, but effective heritability was 0.68, that is, twice as when 00 QTL were simulated. The true model was employed in the training process using the true variance ratio l~0, and the number of bootstrap samples for bagging was B~5,50 or 100. Results are presented in Table. The regressions of q on ^q Bagged were much larger than 1 and the slope seemingly stabilized at about 1.3 with a bag consisting of 50 bootstrap copies. As expected, the regression of q on ^q Ridge was near 1. However, the proportion of variation in true q accounted for by the variation in PLOS ONE 5 April 014 Volume 9 Issue 4 e91693

6 Table 1. Regression coefficients of true QTL effects ðqþ on their ordinary least-squares ð^q OLS Þ, ridge regression BLUP ^q Ridge and bagged ridge regression BLUP estimates ^q Bagged, and regressions of true genetic signal ðgþ on genomic BLUP ð^g GBLUP Þ and bagged genomic BLUP at varying number of bootstrap samples ðbþ. ^g Bagged No. Bootstrap samples? B~0 B~00 B~500 Regressions of q on ; ^q OLS 0:4 R ~0:49 { { { ^q Ridge 0:98 R ~0:53 { { { ^q Bagged { 1:05 R ~0:51 1:07 R ~0:53 1:08 R ~0:53 Regressions of g on ; ^g GBLUP 0:99 R ~0:63 { { { ^g Bagged - 1:05 R ~0:61 1:08 R ~0:63 1:08 R ~0:63 R is the coefficient of determination of the regression fitted. The simulation involved a training set of 500 individuals with 00 true additive QTL fitted in the model. doi: /journal.pone t001 bagged or ridge regression estimates was smaller than in case 1, with R *0:9{0:30 for bagging and ridge regression versus R *0:51{0:53 in case 1. On the other hand, bagged GBLUP accounted for about 73% of the variation of the true genetic values g, providing a fit to signal similar to that of GBLUP. The regression of g on ^g GBLUP was also near its expected value of 1, whereas true differences in genetic values among individuals were exaggerated by a factor of about by bagging. The left panel in Figure 3 shows the alignment between 4 randomly chosen bootstrap samples and the 500 GBLUP estimates obtained in N Train. The right panel illustrates clearly that bagging ðb~100þ pulls down individuals with large GBLUP values and pulls up those placed at the left tail of the distribution of GBLUPs. Case Study 3: model training with 0 unknown QTLs, 00 markers and N Train ~500. The setting was as in case 1 ðn Train ~500Þ but here the genetic signal was generated from q~0 QTL with unknown binary genotypes that were in linkage disequilibrium (LD) with p~00 binary markers as specified below. The true model was y~qqze, where q is a 0 1 vector of allelic substitution effects. QTL genotypes were equally frequent at all loci and their effects were Figure 1. Simulation with 00 known QTL, 500 individuals in the training sample and 500 bootstrap copies for bagging. Left panel: distribution of 00 effects within each of 4 bootstrap samples (1 4), and within the average (bag) of 500 samples (5). Right panel: distribution of variance among 00 effects within each of 500 bootstrap samples and within their average (item 501, flagged with arrow). doi: /journal.pone g001 Figure. Simulation with 00 known QTL, 500 individuals in the training sample and 500 bootstrap copies for bagging. Left panel: variance (VAR) among 500 GBLUPs in each of 500 bootstrap samples. The horizontal line gives the variance among bootstrap (bagged) means for the 500 individuals. Right panel: scatter plot of bagged GBLUP (BGBLUP) versus exact GBLUP for 500 individuals. doi: /journal.pone g00 PLOS ONE 6 April 014 Volume 9 Issue 4 e91693

7 Figure 3. Simulation with 1000 known QTL, 500 individuals in the training sample and 100 bootstrap copies for bagging. Left panel: relationship between GBLUP with the entire sample and GBLUPs in 4 bootstrap samples of size 500. Right panel: relationship between bagged GBLUP (BGBLUP) and GBLUP of the 500 individuals. doi: /journal.pone g003 drawn independently from a N(0, 1 ) distribution. Phenotypes were formed by summing the product of QTL genotype codes (0,1) times their corresponding effects over the 0 QTL, and adding a random residual drawn from Nð0,10Þ. The method employed for training was ridge regression (BLUP) on 00 markers using the true variance ratio l~0, and the number of bootstrap samples for bagging was B~100: The simulation generated an effective heritability of about LD was simulated statistically by introducing correlations among columns of the Q (QTL genotypes) and X (marker genotypes) matrices, respectively. This was done by drawing N Train Table. Regression coefficients of true QTL effects ðqþ on their ridge regression BLUP ^q Ridge and bagged ridge regression BLUP estimates ^q Bagged, and regressions of true genetic signal (g) on genomic BLUP ð^g GBLUP Þ and bagged genomic BLUP ^g Bagged at varying number of bootstrap samples ðbþ. No. Bootstrap samples? B~5 B~50 B~100 Regressions of q on ; ^q Ridge 0:97 R ~0:30 { { { ^q Bagged { 1:30 R ~0:9 1:3 R ~0:9 1:3 R ~0:9 Regressions of g on ; ^g GBLUP 0:99 R ~0:73 { { { ^g Bagged - 1:5 R ~0:73 1:7 R ~0:73 1:7 R ~0:73 R is coefficient of determination of the regression fitted. The simulation involved a training set of 500 individuals with 1000 true additive QTL fitted in the model. doi: /journal.pone t00 PLOS ONE 7 April 014 Volume 9 Issue 4 e91693

independent Beta(a,b) random variables corresponding to the rows of these matrices and then sampling a Bernoulli random variable (i.e., 0 for aa and 1 to AA, say) with probability of success given by the draw from the Beta distribution.

8 independent Beta(a,b) random variables corresponding to the rows of these matrices and then sampling a Bernoulli random variable (i.e., 0 for aa and 1 to AA, say) with probability of success given by the draw from the Beta distribution. Thus, columns of matrices Q or X (with entries 0 or 1) had a beta-binomial distribution (e.g., Casella and George, 199) and an expected correlation equal to 1=(azbz1); employing a~0:30 and b~0:70 a correlation equal to 1 would be expected. Using this approach the first 10 QTL were in LD among themselves as well as in LD with the first 100 markers; these markers were in mutual LD themselves with a correlation of 1 also. On the other hand, QTL 11 0 were in mutual linkage equilibrium as well as with all other markers. To illustrate, in the sample simulated realized genotypes at QTL 1 and had a correlation of 0.48, whereas QTL 1 and 15 had a correlation of 0.0. Likewise, the correlation between markers 1 and was 0.50, that between markers 1 and 00 was 0.04 and QTL was correlated with marker 105 at QTL genotypes at each locus were multipleregressed on the 00 marker genotypes: for QTL 1 10 (in LD with markers) the R of the regression of QTL genotypes on the 00 marker genotypes was about 0.70 or larger. For the 10 QTL in LE with markers R fluctuated around This last result illustrates that even a null association accounts for some variation, merely because the likelihood increases monotonically with model complexity (in this case there are 00 partial regressions of each QTL genotype on markers). The regression of phenotypes on QTL genotypes or on markers had an R at 0.4 and 0.43, respectively, with the latter being larger simply because more parameters are fitted in the model. On the other hand, the squared correlation between true signal and fitted values was 0.87 for the QTL model versus 0.15 for the marker-based model: even though markers captured more variation (because of higher model complexity) than a regression on true genotypes, their ability of capturing signal was much less. We measured similarity among individuals by constructing genomic correlation matrices. This was done by centering both Q and X, calculating Q Centered Q 0 Centered and X CenteredX 0 Centered, and converting these into correlation matrices R Q and R M, respectively. A plot of the off-diagonal elements of R M versus those of R Q is shown in Figure 4 for the corresponding pairs of individuals. Although there is an association between genomic correlations at the QTL and marker levels, the latter ones were smaller in absolute values. This association was not perfect: at a given level of correlation at the QTL level, there was much variation in relationships when measured by markers. Implications of this on accuracy of genome-enabled prediction are discussed in [7]. GBLUP and BGBLUP were calculated at each of l,l and 3 l (with l~0) as variance ratio, to investigate the impact of regularization on the ability of capturing true signal, Qq. The regression of signal on GBLUP was 0.48, 0.54 and 0.60 for the three values of the regularization parameter, respectively, whereas that for BGBLUP was 0.50, 0.59 and 0.65, respectively. This illustrates that the regression of true values on GBLUP predictions is not 1 when the model is incorrect, as it is the case when markers are not QTL, as simulated here. The corresponding R values were 0.5, 0.7 and 0.8 for GBLUP, and 0.4, 0.8 and 0.30 for BGBLUP. While l~0 is the correct regularization to be exerted at the QTL level, this is not so for the model based on markers, where a stronger degree of regularization is needed ðthe number of markers was larger than the number of QTLÞ. An approximation is that the correct variance ratio for Figure 4. Simulation with 0 unknown QTL, 00 markers and 500 individuals: genomic correlations among 500 individuals at QTL or marker loci. doi: /journal.pone g004 the marker based model should be p=n q ~10 times larger than l, where n q ~0 is the number of QTL. We found (results not shown) that the regressions of signal on GBLUP and BGBLUP increased as larger values of the regularization parameter were applied to the marker based model, with the optimum being near 10l, as expected. Case Study 4: predictive cross-validation with 100 unknown QTLs and 500 markers. The simulation posed 100 unknown QTL whose additive effects were drawn from a Nð0,1:5Þ distribution and 500 individuals genotyped for 500 markers. The LD structure was similar to that of case 3: the first 50 QTL were in mutual linkage disequilibrium as well as in LD with the first 50 markers. QTL were in LE among themselves as well with all markers; the first 50 markers were in mutual LD and markers were in LE. Residuals were drawn from Nð0,10Þ and the effective heritability attained was about The 500 individuals were distributed at random into two nonoverlapping training and testing sets with 50 members in each. This was done 100 times at random, producing 100 trainingtesting pairs enabling estimation of the cross-validation distribution. GBLUP and BGBLUP were fitted to the training data ðn Train ~50Þ for each of 13 values of the regularization parameter l in the sequence l Seq ~ 1 16, 1 8, 1 4, 1,1,,4,8,16,3,64,18,56 l00 True 00, ð16þ where l00 True 00~ s e s b with s b &s q n q. In each training instance, p BGBLUP was implemented with B~5 bootstrap samples of size N Train drawn from the training set of size 50. Three metrics were used to evaluate the two prediction methods: goodness of fit in the training set (correlation between fitted and observed values), predictive correlation (predicted phenotypes in the testing set and realized values) and predictive mean-squared error, that is, average squared difference between predicted and realized values over the N Test ~50 cases in the testing set. PLOS ONE 8 April 014 Volume 9 Issue 4 e91693

9 The preceding involved 13 BGBLUP and GBLUP implementations in each of the 100 random cross-validations, for a total of 1300 comparisons. Overall (Figure 5), BGBLUP attained a better predictive performance than GBLUP because predictive correlations were typically larger (in some cases more than twice as large as GBLUP) and mean squared errors of prediction were lower as well. Thus, BGBLUP was more reliable (larger correlation) and more accurate (smaller mean squared error) than GBLUP. The superiority of BGBLUP over GBLUP became smaller when regularization was stronger over the grid defined by l Seq. However, it was not until log 10 l Seq became 5 or more times larger than (log 10 l00 True00) that GBLUP caught up with bagging but was never better. Actually, it was not until the l value used for model training was 3 times larger than l True that the two predictors delivered the same performance, suggesting a robustness property of BGBLUP that GBLUP seems to lack. Model complexity is reduced as l increases (the effective number of parameters decreases) so the variance of GBLUP decreases as well, in which case bagging offers little help. Contrary to what was stated by [7] we did not find evidence that bagging damaged predictive performance under any of the regularization regimes entertained. Results for selected settings are discussed in the following paragraph. Figure 6 shows training and predictive correlations and mean squared errors for BGBLUP (y-axis) and GBLUP (x-axis) at levels of under-regularization : l~l True =16 and l~l True =. Here, where shrinkage is less than it should be, given the model, BGBLUP reduced overfitting (smaller training set correlations), increased predictive correlations and reduced mean squared errors, relative to BLUP. Differences were marked: in the upper (lower) middle panels of Figure 6 it is seen that the largest predictive correlation attained by GBLUP was smaller than 0.30 (0.39) and that bagging increased the corresponding correlation to about 0.38 (0.41). The effect of bagging on reducing predictive mean squared was also clear. Figure 7 presents results when the true value of l (400) was used for training. GBLUP was again close to overfitting and BGBLUP reduced training correlations, thus tempering the problem. Predictively, BGBLUP was better than GBLUP in all 100 comparisons, both in the correlation and MSE senses. Over the 100 cross-validation runs, the predictive correlation ranged between and (median 0.8) for GBLUP and between 0.43 and 0.43 (median 0.330) for BGBLUP. MSE of prediction ranged between and (median 38.47) for GBLUP and between 30.4 and (median 35.01) for BGBLUP. Comparing the medians of the distributions, BGBLUP enhanced the predictive correlation by 17% and MSE was 91% of that of GBLUP. Figure 8 depicts what was found when excessive shrinkage was applied in training: the regularization parameter values were 8l True (upper panel) and 16l True (lower panel). BGBLUP was only marginally better than GBLUP. Strong shrinkage rendered the training model exceedingly simple so both methods delivered similar same predictive ability. Differences between methods vanished when 56l True was used as shrinkage parameter. We repeated the experiment with the setting of case study 4 but increasing training and testing sample sizes to 500 each. Here, training sample size was 5 times larger than the number of markers: no differences between BGBLUP and GBLUP were found at any level of regularization. The reason is that n now exceeds p, making GBLUP fairly stable, in which case the variance reduction property of BGBLUP does not help much. In summary, in our simulations BGBLUP was typically better than GBLUP at most points of the regularization grid considered. Its performance was better than that of GBLUP at values close to optimal regularization, and differences were large when shrinkage was small, because bootstrap sampling with averaging reduced variance. If l is small, GBLUP tends to overfit and to be variable but BGBLUP alleviates these problems. In addition to helping with overfitting, BGBLUP was robust with respect to departures from optimal regularization, e.g., to errors in the variance ratio. The experiment with a much larger training sample size than the number of markers indicated that the performance of bagging depends on p=n Train : we conjecture that as this ratio increases (as it will surely be the case with DNA sequence data) use of BGBLUP may enhance predictive ability in some real data situations, simply because overfitting and colinearity will be exacerbated by introducing a massive number of variates in the model. Analysis of wheat data The data set is at BLR/index.html, in the BLR package in R, and has been used by, e.g., [17],[4 43]. The data represents 599 wheat inbred lines each genotyped with 179 DArT (Diversity Array Technology) markers, and planted in 4 distinct environments. DArT markers may take on one of two values, denoting presence or absence of an allele. Records came from several international trials conducted at the International Maize and Wheat Improvement Center (CIMMYT), Mexico. The trait considered was average grain yield for each line in each of the four environments. This response variable is an average over a balanced set of plots and replicates and, within environment, the residual variance is expected to be constant over lines. To provide proof of concept, we used a ridge-regression BLUP model with 600 randomly chosen markers. All markers were not employed in order to facilitate calculations, since matrix inversion was used for every bootstrap sample and crossvalidation. With a large number of markers or individuals or for routine application, GBLUP can be computed in using iterative algorithms, but this is a numerical, as opposed to conceptual, issue. Further, N Train ~449 and N Test ~599{449~150; the model was trained using a grid with 10 values of l: 5, 10, 50, 100, 150, 00, 50, 300, 350 and 400. All lines were present in each partition of the data into the two disjoint training and testing sets, with the process repeated at random 100 times, to estimate the crossvalidation distribution. Each implementation of BGBLUP used 5 bootstrap samples and performance was evaluated via predictive correlation, predictive mean squared error and mean absolute deviation between predicted and realized phenotypes. We found (results not shown) that the optimum l was around in environments 1 and with stronger regularization needed in environments 3 and 4. BGBLUP was slightly better than GBLUP in environment 1 near optimum regularization, and clearly better at the lower values of l because GBLUP nearly overfitted; a similar picture emerged in environment. Overall, BGBLUP was better than GBLUP when l was below the optimum, sometimes slightly better at the optimum, with no difference if regularization was excessive, due to the fact that the two models were rendered effectively simpler as l grew. The upper and lower panels of Figure 9 give results for environments 3 and 4, respectively; results for environments 1 and were similar. As found with the simulated data, over the 100 repetitions 10 values of l~1000 comparisons, BGBLUP tended to produce larger predictive correlations and smaller mean squared errors than GBLUP. However, this superiority was not uniform and depended on whether or not the model was under or over regularized, with BGBLUP having slightly better PLOS ONE 9 April 014 Volume 9 Issue 4 e91693

10 Figure 5. Simulation with 100 unknown QTL, 500 markers and 50 individuals in each training and testing set: training correlations, predictive correlations and predictive mean-squared error (MSE) for 1300 comparisons between bagged GBLUP (BGBLUP, 5 bootstrap samples) and GBLUP. doi: /journal.pone g005 performance in the former situation but slightly worse in the latter. This is illustrated for environment 4 in the upper left panel of Figure 10, where average differences between BGBLUP and GBLUP over the grid of lambda values are shown for predictive correlations, predictive mean square error and average absolute difference between predicted and realized values. The other panels of Figure 10 indicate that, over the 100 cross-validations, BGBLUP was better at low values of l, slightly better at near optimum values of l and mildly worse when regularization was extreme ðl~400þ. Hence, it would seem that BGBLUP performs at least as well as GBLUP unless a gross error is made in assessing the value of the regularization parameter in model training. Such a large error is unlikely if the variances are estimated from training data (unless the sample is small) or evaluated over a grid of suitable candidate values. One should be cautious about elicitations of the regularization parameter based on simple theoretical arguments that may not hold. Discussion We examined whether or not bootstrap sampling in the context of GBLUP can enhance predictive ability in cross-validation. Simulation (with known or unknown QTL) and a wheat data set with grain yield information were used for this purpose. In the simulations, it was found that bagging BLUP estimates of marker effects or of genomic signal increased the slope of the regression of true marker or marked breeding value on predictor relative to what is expected under BLUP theory. When an individual was evaluated as extreme by GBLUP, bagging made the estimate less extreme. If the linear model entertained holds, the regression of true signal on GBLUP is expected to be 1, but the regression on BGBLUP is steeper because the latter has smaller variance. This is easy to see: if T is a predictand and ^T is its BLUP, then Cov(T, ^T)~Var( ^T) so the slope of the regression of T on ^T is one. Now, if ^T Bagged is the average of B bootstrap copies of ^T, the variance of ^T Bagged is about B times (assuming samples are mildly correlated) smaller than that that of ^T, but Cov(T, ^T)~Cov (T, ^T Bagged ). Hence, the regression must be larger than 1. It was also found that bagging conferred robustness to GBLUP because it is less prone to over-fitting and often delivered better predictions in terms of correlation and mean squared error even when regularization was optimal. At least in simulated data, PLOS ONE 10 April 014 Volume 9 Issue 4 e91693

11 Figure 6. Simulation with 100 unknown QTL, 500 markers and 50 individuals in each training and testing set: training correlations, predictive correlations and predictive mean-squared error (MSE) for 100 comparisons between bagged GBLUP (BGBLUP, 5 bootstrap samples) and GBLUP at two levels of under-regularization. doi: /journal.pone g006 BGBLUP was not inferior to GBLUP when shrinkage was beyond what it should be. Bagging allows estimating a cross-validation prediction error mean squared error for each subject tested. In theory, given training data, GBLUP in a testing set can be computed as s e ^g Train ~G Train G Train zi Ntrain ps b y~ I Ntrain zg {1 Train l {1y, Optimum! {1 with l Optimum being some optimum value of the regularization parameter. If the problem is that of predicting a future set of records y Test of the same individuals, the variance-covariance matrix of prediction errors (under normality assumptions) is Varðy Test Dy Train Þ~VarðgDy Train ÞzIs e : Given that the assumptions hold, there are two sources of uncertainty here. The first is uncertainty about signal (breeding value) given training data, and the second is noise variability associated with the yet to be realized observations. The model derived (expected) reliabilities of the predicted genomic value values from the training data are given by the diagonals of matrix PLOS ONE 11 April 014 Volume 9 Issue 4 e91693

MIXED MODELS THE GENERAL MIXED MODEL

MIXED MODELS THE GENERAL MIXED MODEL MIXED MODELS This chapter introduces best linear unbiased prediction (BLUP), a general method for predicting random effects, while Chapter 27 is concerned with the estimation of variances by restricted