GENOMIC SELECTION WORKSHOP: Hands on Practical Sessions (BL)

GENOMIC SELECTION WORKSHOP: Hands on Practical Sessions (BL) Paulino Pérez 1 José Crossa 2 1 ColPos-México 2 CIMMyT-México September, 2014. SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 1/29

Contents 1 General comments 2 LASSO 3 Application examples 4 Extension of BL to include infinitesimal effect SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 2/29

General comments General comments The regression linear model is given by, where e i N(0, σ 2 e), i = 1,..., n. y i = µ + p x ij β j + e i, (1) j=1 The key Idea is obtain estimates for β and then obtain GEBVs. ˆβ can be obtained using penalized regression methods, for example ridge regression (G-BLUP). Now we review another penalized regression method called LASSO=Least Angle and Shrinkage Operator. SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 3/29

LASSO LASSO In LASSO estimates for β are obtained by minimizing the augmented sum of squares: { min (y X j β j ) (y X j β j ) + λ } β j, (2) β where λ 0 is a regularization parameter that controls the trade-offs between goodness of fit (measured with sum of squares of error, SCE) and model complexity (measured with β 2 j ) Notes: 1 The value for λ can be fixed by using cross-validation methods. 2 Some of the entries in β take the value of 0, so LASSO can be useful as a variable selection method. SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 4/29

Continued... LASSO Problems with LASSO: 1 At most, n entries in β can be different from 0. This is problematic in GS, where usually n << p (curse of dimensionality). 2 It can be difficult to select the value for λ. 3 It is difficult to obtain estimates for σ 2 e. 4 It is difficult to obtain confidence intervals for β j, j = 1,..., p. Alternatives: Bayesian estimation methods... SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 5/29

Bayesian LASSO LASSO SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 6/29

Continued... LASSO SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 7/29

Continue... LASSO SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 8/29

Continued... LASSO Density function 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 In ridge regression, p(β j σβ) 2 = N(β j 0, σβ), 2 j = 1,..., p In LASSO p(β j σe, 2 λ) = DE(β j 0, λ/σe) 2 4 2 0 2 4 β Figure 1: Prior in BL and in BRR SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 9/29

LASSO Join posterior distribution of model unknowns The join distribution for model unknowns is given by: p(β, σ 2 e, µ data) = n N(y i µ+ p x ij β j, σe) 2 p(β j ω) p(σe) p(µ) p(λ 2 2 ), i=1 where p(µ) 1, p(σ 2 e) = χ 2 (σ 2 e df, S) and p(λ 2 ) = Gamma(λ 2 rate, shape). This model can be implemented using MCMC methods, for more detail see Park and Casella, 2008; de los Campos et al. (2009). j=1 The model is implemented in the package BLR. (3) SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 10/29

Contents Application examples Example 1: Barley dataset 1 General comments 2 LASSO 3 Application examples Example 1: Barley dataset Example 2: Wheat dataset (CIMMyT) 4 Extension of BL to include infinitesimal effect SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 11/29

Application examples Example 1: Barley dataset Example 1: Barley dataset This example comes from Xi and Xu (2008). DH population with n = 145 lines, each line tested in 25 environments. The response variable is grain yield. We have p = 127 MM covering 7 chromosomes. BL model fitted using the BLR package in R with B = 20, 000 iterations, burn in = 10,000, thin=10. SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 12/29

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 13/29 Figure 2: Point estimates for β Application examples Example 1: Barley dataset B=20,000, burnin=10,000, a=b=0.1 β j 1 0 1 2 3 0 20 40 60 80 100 120 j

Application examples Example 1: Barley dataset β 2 β 12 0.0 0.4 0.8 0.0 0.2 0.4 0.6 0.8 0.5 0.0 0.5 1.0 1.5 2.0 0 1 2 3 β 13 β 27 0.0 0.4 0.8 0.0 0.4 0.8 1.2 0.5 0.0 0.5 1.0 1.5 2.0 1.5 1.0 0.5 0.0 0.5 0.0 0.4 0.8 1.2 β 32 0.0 0.5 1.0 1.5 β 34 0.5 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 Figure 3: Posterior distributions for β s SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 14/29

Application examples Example 1: Barley dataset 0.0 0.4 0.8 1.2 β 37 0.0 0.5 1.0 1.5 β 43 1.5 1.0 0.5 0.0 0.5 1.5 1.0 0.5 0.0 β 95 β 101 0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.5 1.0 0.5 0.0 0.5 0.5 0.0 0.5 1.0 1.5 2.0 β 102 0.0 0.4 0.8 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Figure 4: Posterior distributions for β s SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 15/29

Contents Application examples Example 2: Wheat dataset (CIMMyT) 1 General comments 2 LASSO 3 Application examples Example 1: Barley dataset Example 2: Wheat dataset (CIMMyT) 4 Extension of BL to include infinitesimal effect SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 16/29

Application examples Example 2: Wheat dataset (CIMMyT) Example 2: Wheat dataset (CIMMyT) Data for n = 599 wheat lines evaluated in 4 environments, wheat improvement program, CIMMyT. The dataset includes p = 1279 molecular markers (x ij, i = 1,..., n, j = 1,..., p) (coded as 0,1). The pedigree information is also available. Lets load the dataset in R, 1 Load R 2 Install BGLR package (if not yet installed) 3 Load the package 4 Load the data SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 17/29

Continued... Application examples Example 2: Wheat dataset (CIMMyT) Lets assume that we want to predict the grain yield for environment 1 using ridge regression or equivalently the G-BLUP. We do not know the value for σ 2 e and λ, so we can obtain estimates using the data. We will use the function BGLR. R code below fit the BL model using Bayesian approach with non informative priors for σ 2 e, λ, rm(list=ls()) library(bglr) data(wheat) Y=wheat.Y X=wheat.X y=y[,1] setwd( /tmp/ ) #Linear predictor ETA=list(list(X=X,model="BL")) fml<-bglr(y=y,eta=eta,niter=10000, burnin=5000,thin=10) plot(fml$yhat,y[,1]) SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 18/29

Application examples Example 2: Wheat dataset (CIMMyT) Continued... 2.0 1.5 1.0 0.5 0.0 0.5 1.0 2 1 0 1 2 3 fml$yhat Y[, 1] Figure shows observed vs predicted grain yield. Predictions ŷ = ˆµ + X ˆβ, and estimates for σ 2 e, λ can be obtained easily in R > fml$yhat > fml$vare [1] 0.5379243 > fml$eta[[1]]$lambda [1] 19.19093 SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 19/29

Application examples Example 2: Wheat dataset (CIMMyT) Continued... 0.10 0.00 0.10 0.10 0.05 0.00 0.05 0.10 Predicted Marker effects Bayesian LASSO Bayesian Ridge Regression 2.0 1.0 0.0 1.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 Predicted Genetic Values Bayesian LASSO Bayesian Ridge Regression SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 20/29

Continued... Application examples Example 2: Wheat dataset (CIMMyT) The GEBVs can be obtained easily in R, #GEVBs #option 1 X%*%fmL$ETA[[1]]$b #option 2 fml$yhat-fml$mu Excersise: Lets assume that we want to predict the grain yield for some wheat lines. Assume that we have only the genotypic information for those lines. Write the R code for fitting a BL model. SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 21/29

Extension of BL to include infinitesimal effect Extension of BL to include infinitesimal effect de los Campos et al. (2009) extended the basic BL model to include an infinitesimal effect, that is: y i = µ + p x ij β j + u i + e i, (4) j=1 where u N(0, σ 2 ua) and A is the pedigree matrix. The model can be implemented using Bayesian methods. SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 22/29

Extension of BL to include infinitesimal effect Example 3: Including an infinitesimal effect In this example we continue with the analysis of the wheat dataset, and we include an infinitesimal effect in the model. rm(list=ls()) setwd("/tmp") library(bglr) data(wheat) #Loads the wheat dataset X=wheat.X A=wheat.A Y=wheat.Y y=y[,1] #Linear predictor ETA=list(list(X=X,model="BL"), list(k=a,model="rkhs")) ### Runs the Gibbs sampler fm<-bglr(y=y,eta=eta, niter=30000,burnin=5000,thin=10) SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 23/29

Extension of BL to include infinitesimal effect σ e 2 0.30 0.40 0.50 0.60 Density 0 2 4 6 8 0 500 1500 2500 Iter 0.30 0.40 0.50 0.60 σ e 2 σ u 2 0.05 0.15 0.25 Density 0 2 4 6 8 10 0 500 1500 2500 Iter 0.05 0.15 0.25 σ u 2 Figure 5: Posterior distribution for σ 2 e and σ 2 u SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 24/29

Extension of BL to include infinitesimal effect 0 200 400 600 800 1000 1200 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Marker βj Figure 6: Marker effects SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 25/29

Extension of BL to include infinitesimal effect h 2 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030 Narrow sense heritability calculated according to Xi and Xu (2008), h 2 j = V j ˆβ 2 j V y, where V y is the phenotypic variance, and V j is the sample variance of x ij ; i = 1,..., n. 0 200 400 600 800 1000 1200 Marker Figure 7: Heritability SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 26/29

Extension of BL to include infinitesimal effect 2 1 0 1 2 3 2.0 1.5 1.0 0.5 0.0 0.5 1.0 Phenotype Pred. Gen. Value Figure 8: Observed vs predicted values SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 27/29

Extension of BL to include infinitesimal effect Questions? SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 28/29

Extension of BL to include infinitesimal effect References Park, T. and Casella, G. (2008). The Bayesian Lasso. Journal of the American Statistical Association, 103, 681 686. Yi, N. y Xu, S. (2008). Bayesian Lasso for Quantitative Trait Loci Mapping. Genetics, 179, 1045 1055. de los Campos G., H. Naya, D. Gianola, J. Crossa, A. Legarra, E. Manfredi, K. Weigel and J. Cotes. (2009). Predicting Quantitative Traits with Regression Models for Dense Molecular Markers and Pedigree. Genetics 182: 375-385. Pérez-Rodríguez P., G. de los Campos, J. Crossa and D. Gianola. (2010). Genomic-enabled prediction based on molecular markers and pedigree using the BLR package in R. The plant Genome, 3(2): 106-116. SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 29/29