GENOMIC SELECTION WORKSHOP: Hands on Practical Sessions (BL)

Similar documents
Prediction of genetic Values using Neural Networks

Package BLR. February 19, Index 9. Pedigree info for the wheat dataset

Supplementary Materials

Package BGLR. R topics documented: October 2, Version 1.0. Date Title Bayesian Generalized Linear Regression

File S1: R Scripts used to fit models

Package BGGE. August 10, 2018

BGLR: A Statistical Package for Whole-Genome Regression

BAYESIAN GENOMIC PREDICTION WITH GENOTYPE ENVIRONMENT INTERACTION KERNEL MODELS. Universidad de Quintana Roo, Chetumal, Quintana Roo, México.

Recent advances in statistical methods for DNA-based prediction of complex traits

Package bwgr. October 5, 2018

Computations with Markers

Threshold Models for Genome-Enabled Prediction of Ordinal Categorical Traits in Plant Breeding

Lecture 14: Shrinkage

Quantitative genetics theory for genomic selection and efficiency of breeding value prediction in open-pollinated populations

arxiv: v1 [stat.me] 10 Jun 2018

Bayesian Linear Regression

QTL model selection: key players

GWAS IV: Bayesian linear (variance component) models

The linear model is the most fundamental of all serious statistical models encompassing:

Lecture 8 Genomic Selection

THE ABILITY TO PREDICT COMPLEX TRAITS from marker data

One-week Course on Genetic Analysis and Plant Breeding January 2013, CIMMYT, Mexico LOD Threshold and QTL Detection Power Simulation

Multiple QTL mapping

Package LBLGXE. R topics documented: July 20, Type Package

Bayesian Genomic Prediction with Genotype 3 Environment Interaction Kernel Models

Bayesian Linear Models

Selection of the Bandwidth Parameter in a Bayesian Kernel Regression Model for Genomic-Enabled Prediction

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Bayesian Linear Models

Pedigree and genomic evaluation of pigs using a terminal cross model

Lecture 5: BLUP (Best Linear Unbiased Predictors) of genetic values. Bruce Walsh lecture notes Tucson Winter Institute 9-11 Jan 2013

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Ch 2: Simple Linear Regression

INTRODUCTION TO ANIMAL BREEDING. Lecture Nr 3. The genetic evaluation (for a single trait) The Estimated Breeding Values (EBV) The accuracy of EBVs

Bayesian Linear Models

Bayesian linear regression

QTL model selection: key players

Genotyping strategy and reference population

Estimation of Parameters in Random. Effect Models with Incidence Matrix. Uncertainty

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Regression, Ridge Regression, Lasso

Bayesian construction of perceptrons to predict phenotypes from 584K SNP data.

Supplement to Bayesian inference for high-dimensional linear regression under the mnet priors

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

arxiv: v1 [stat.me] 5 Aug 2015

Bayesian Multilocus Association Models for Prediction and Mapping of Genome-Wide Data

Genome-wide Multiple Loci Mapping in Experimental Crosses by the Iterative Adaptive Penalized Regression

A Short Introduction to the Lasso Methodology

Bayesian Linear Models

Robust Bayesian Simple Linear Regression

Reduction of Model Complexity and the Treatment of Discrete Inputs in Computer Model Emulation

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

A Review of Bayesian Variable Selection Methods: What, How and Which

DOI /sagmb Statistical Applications in Genetics and Molecular Biology 2013; 12(3):

Gibbs Sampling in Linear Models #2

Lasso & Bayesian Lasso

Genome-enabled Prediction of Complex Traits with Kernel Methods: What Have We Learned?

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Integrated Anlaysis of Genomics Data

Quantitative genetics theory for genomic selection and efficiency of genotypic value prediction in open-pollinated populations

Bayesian QTL mapping using skewed Student-t distributions

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Large scale genomic prediction using singular value decomposition of the genotype matrix

A Modern Look at Classical Multivariate Techniques

MACAU 2.0 User Manual

Data Mining Stat 588

A Hybrid Bayesian Approach for Genome-Wide Association Studies on Related Individuals

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Lecture 28: BLUP and Genomic Selection. Bruce Walsh lecture notes Synbreed course version 11 July 2013

Hierarchical Generalized Linear Models for Multiple QTL Mapping

MCMC algorithms for fitting Bayesian models

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Overview. Background

IEOR165 Discussion Week 5

Lecture 8. QTL Mapping 1: Overview and Using Inbred Lines

High-dimensional regression modeling

Linear Regression (1/1/17)

Hierarchical Modeling for Spatial Data

(Genome-wide) association analysis

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling

Variance Component Models for Quantitative Traits. Biostatistics 666

Shrinkage Methods: Ridge and Lasso

STA 216, GLM, Lecture 16. October 29, 2007

Day 4: Shrinkage Estimators

Limited dimensionality of genomic information and effective population size

Case-Control Association Testing. Case-Control Association Testing

Regularization Path Algorithms for Detecting Gene Interactions

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Association studies and regression

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Linear Model Selection and Regularization

Package brnn. R topics documented: January 26, Version 0.6 Date

Empirical Bayesian LASSO-logistic regression for multiple binary trait locus mapping

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

The Pennsylvania State University The Graduate School THE BAYESIAN LASSO, BAYESIAN SCAD AND BAYESIAN GROUP LASSO WITH APPLICATIONS TO GENOME-WIDE

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

Module 4: Bayesian Methods Lecture 9 A: Default prior selection. Outline

Transcription:

GENOMIC SELECTION WORKSHOP: Hands on Practical Sessions (BL) Paulino Pérez 1 José Crossa 2 1 ColPos-México 2 CIMMyT-México September, 2014. SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 1/29

Contents 1 General comments 2 LASSO 3 Application examples 4 Extension of BL to include infinitesimal effect SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 2/29

General comments General comments The regression linear model is given by, where e i N(0, σ 2 e), i = 1,..., n. y i = µ + p x ij β j + e i, (1) j=1 The key Idea is obtain estimates for β and then obtain GEBVs. ˆβ can be obtained using penalized regression methods, for example ridge regression (G-BLUP). Now we review another penalized regression method called LASSO=Least Angle and Shrinkage Operator. SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 3/29

LASSO LASSO In LASSO estimates for β are obtained by minimizing the augmented sum of squares: { min (y X j β j ) (y X j β j ) + λ } β j, (2) β where λ 0 is a regularization parameter that controls the trade-offs between goodness of fit (measured with sum of squares of error, SCE) and model complexity (measured with β 2 j ) Notes: 1 The value for λ can be fixed by using cross-validation methods. 2 Some of the entries in β take the value of 0, so LASSO can be useful as a variable selection method. SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 4/29

Continued... LASSO Problems with LASSO: 1 At most, n entries in β can be different from 0. This is problematic in GS, where usually n << p (curse of dimensionality). 2 It can be difficult to select the value for λ. 3 It is difficult to obtain estimates for σ 2 e. 4 It is difficult to obtain confidence intervals for β j, j = 1,..., p. Alternatives: Bayesian estimation methods... SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 5/29

Bayesian LASSO LASSO SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 6/29

Continued... LASSO SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 7/29

Continue... LASSO SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 8/29

Continued... LASSO Density function 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 In ridge regression, p(β j σβ) 2 = N(β j 0, σβ), 2 j = 1,..., p In LASSO p(β j σe, 2 λ) = DE(β j 0, λ/σe) 2 4 2 0 2 4 β Figure 1: Prior in BL and in BRR SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 9/29

LASSO Join posterior distribution of model unknowns The join distribution for model unknowns is given by: p(β, σ 2 e, µ data) = n N(y i µ+ p x ij β j, σe) 2 p(β j ω) p(σe) p(µ) p(λ 2 2 ), i=1 where p(µ) 1, p(σ 2 e) = χ 2 (σ 2 e df, S) and p(λ 2 ) = Gamma(λ 2 rate, shape). This model can be implemented using MCMC methods, for more detail see Park and Casella, 2008; de los Campos et al. (2009). j=1 The model is implemented in the package BLR. (3) SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 10/29

Contents Application examples Example 1: Barley dataset 1 General comments 2 LASSO 3 Application examples Example 1: Barley dataset Example 2: Wheat dataset (CIMMyT) 4 Extension of BL to include infinitesimal effect SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 11/29

Application examples Example 1: Barley dataset Example 1: Barley dataset This example comes from Xi and Xu (2008). DH population with n = 145 lines, each line tested in 25 environments. The response variable is grain yield. We have p = 127 MM covering 7 chromosomes. BL model fitted using the BLR package in R with B = 20, 000 iterations, burn in = 10,000, thin=10. SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 12/29

SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 13/29 Figure 2: Point estimates for β Application examples Example 1: Barley dataset B=20,000, burnin=10,000, a=b=0.1 β j 1 0 1 2 3 0 20 40 60 80 100 120 j

Application examples Example 1: Barley dataset β 2 β 12 0.0 0.4 0.8 0.0 0.2 0.4 0.6 0.8 0.5 0.0 0.5 1.0 1.5 2.0 0 1 2 3 β 13 β 27 0.0 0.4 0.8 0.0 0.4 0.8 1.2 0.5 0.0 0.5 1.0 1.5 2.0 1.5 1.0 0.5 0.0 0.5 0.0 0.4 0.8 1.2 β 32 0.0 0.5 1.0 1.5 β 34 0.5 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 Figure 3: Posterior distributions for β s SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 14/29

Application examples Example 1: Barley dataset 0.0 0.4 0.8 1.2 β 37 0.0 0.5 1.0 1.5 β 43 1.5 1.0 0.5 0.0 0.5 1.5 1.0 0.5 0.0 β 95 β 101 0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.5 1.0 0.5 0.0 0.5 0.5 0.0 0.5 1.0 1.5 2.0 β 102 0.0 0.4 0.8 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Figure 4: Posterior distributions for β s SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 15/29

Contents Application examples Example 2: Wheat dataset (CIMMyT) 1 General comments 2 LASSO 3 Application examples Example 1: Barley dataset Example 2: Wheat dataset (CIMMyT) 4 Extension of BL to include infinitesimal effect SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 16/29

Application examples Example 2: Wheat dataset (CIMMyT) Example 2: Wheat dataset (CIMMyT) Data for n = 599 wheat lines evaluated in 4 environments, wheat improvement program, CIMMyT. The dataset includes p = 1279 molecular markers (x ij, i = 1,..., n, j = 1,..., p) (coded as 0,1). The pedigree information is also available. Lets load the dataset in R, 1 Load R 2 Install BGLR package (if not yet installed) 3 Load the package 4 Load the data SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 17/29

Continued... Application examples Example 2: Wheat dataset (CIMMyT) Lets assume that we want to predict the grain yield for environment 1 using ridge regression or equivalently the G-BLUP. We do not know the value for σ 2 e and λ, so we can obtain estimates using the data. We will use the function BGLR. R code below fit the BL model using Bayesian approach with non informative priors for σ 2 e, λ, rm(list=ls()) library(bglr) data(wheat) Y=wheat.Y X=wheat.X y=y[,1] setwd( /tmp/ ) #Linear predictor ETA=list(list(X=X,model="BL")) fml<-bglr(y=y,eta=eta,niter=10000, burnin=5000,thin=10) plot(fml$yhat,y[,1]) SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 18/29

Application examples Example 2: Wheat dataset (CIMMyT) Continued... 2.0 1.5 1.0 0.5 0.0 0.5 1.0 2 1 0 1 2 3 fml$yhat Y[, 1] Figure shows observed vs predicted grain yield. Predictions ŷ = ˆµ + X ˆβ, and estimates for σ 2 e, λ can be obtained easily in R > fml$yhat > fml$vare [1] 0.5379243 > fml$eta[[1]]$lambda [1] 19.19093 SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 19/29

Application examples Example 2: Wheat dataset (CIMMyT) Continued... 0.10 0.00 0.10 0.10 0.05 0.00 0.05 0.10 Predicted Marker effects Bayesian LASSO Bayesian Ridge Regression 2.0 1.0 0.0 1.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 Predicted Genetic Values Bayesian LASSO Bayesian Ridge Regression SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 20/29

Continued... Application examples Example 2: Wheat dataset (CIMMyT) The GEBVs can be obtained easily in R, #GEVBs #option 1 X%*%fmL$ETA[[1]]$b #option 2 fml$yhat-fml$mu Excersise: Lets assume that we want to predict the grain yield for some wheat lines. Assume that we have only the genotypic information for those lines. Write the R code for fitting a BL model. SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 21/29

Extension of BL to include infinitesimal effect Extension of BL to include infinitesimal effect de los Campos et al. (2009) extended the basic BL model to include an infinitesimal effect, that is: y i = µ + p x ij β j + u i + e i, (4) j=1 where u N(0, σ 2 ua) and A is the pedigree matrix. The model can be implemented using Bayesian methods. SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 22/29

Extension of BL to include infinitesimal effect Example 3: Including an infinitesimal effect In this example we continue with the analysis of the wheat dataset, and we include an infinitesimal effect in the model. rm(list=ls()) setwd("/tmp") library(bglr) data(wheat) #Loads the wheat dataset X=wheat.X A=wheat.A Y=wheat.Y y=y[,1] #Linear predictor ETA=list(list(X=X,model="BL"), list(k=a,model="rkhs")) ### Runs the Gibbs sampler fm<-bglr(y=y,eta=eta, niter=30000,burnin=5000,thin=10) SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 23/29

Extension of BL to include infinitesimal effect σ e 2 0.30 0.40 0.50 0.60 Density 0 2 4 6 8 0 500 1500 2500 Iter 0.30 0.40 0.50 0.60 σ e 2 σ u 2 0.05 0.15 0.25 Density 0 2 4 6 8 10 0 500 1500 2500 Iter 0.05 0.15 0.25 σ u 2 Figure 5: Posterior distribution for σ 2 e and σ 2 u SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 24/29

Extension of BL to include infinitesimal effect 0 200 400 600 800 1000 1200 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Marker βj Figure 6: Marker effects SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 25/29

Extension of BL to include infinitesimal effect h 2 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030 Narrow sense heritability calculated according to Xi and Xu (2008), h 2 j = V j ˆβ 2 j V y, where V y is the phenotypic variance, and V j is the sample variance of x ij ; i = 1,..., n. 0 200 400 600 800 1000 1200 Marker Figure 7: Heritability SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 26/29

Extension of BL to include infinitesimal effect 2 1 0 1 2 3 2.0 1.5 1.0 0.5 0.0 0.5 1.0 Phenotype Pred. Gen. Value Figure 8: Observed vs predicted values SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 27/29

Extension of BL to include infinitesimal effect Questions? SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 28/29

Extension of BL to include infinitesimal effect References Park, T. and Casella, G. (2008). The Bayesian Lasso. Journal of the American Statistical Association, 103, 681 686. Yi, N. y Xu, S. (2008). Bayesian Lasso for Quantitative Trait Loci Mapping. Genetics, 179, 1045 1055. de los Campos G., H. Naya, D. Gianola, J. Crossa, A. Legarra, E. Manfredi, K. Weigel and J. Cotes. (2009). Predicting Quantitative Traits with Regression Models for Dense Molecular Markers and Pedigree. Genetics 182: 375-385. Pérez-Rodríguez P., G. de los Campos, J. Crossa and D. Gianola. (2010). Genomic-enabled prediction based on molecular markers and pedigree using the BLR package in R. The plant Genome, 3(2): 106-116. SLU,Sweden GENOMIC SELECTION WORKSHOP:Hands on Practical Sessions (BL) 29/29