GENOME-ENABLED PREDICTION RIDGE REGRESSION GENOMIC BLUP ROBUST METHODS: TMAP AND LMAP

Size: px

Start display at page:

Download "GENOME-ENABLED PREDICTION RIDGE REGRESSION GENOMIC BLUP ROBUST METHODS: TMAP AND LMAP"

Dortha Lamb
5 years ago
Views:

1 GENOME-ENABLED PREDICTION RIDGE REGRESSION GENOMIC BLUP ROBUST METHODS: TMAP AND LMAP

2 RIDGE REGRESSION THE ESTIMATOR (HOERL AND KENNARD, 1979) WAS DERIVED USING A CONSTRAINED MINIMIZATION ARGUMENT Recirocal of Lagrange multilier ( regularization arameter in machine learning) Exected value Bias Covariance matrix

3 NOTE: THE INTERCEPT IS NOT REGULARIZED

4 WAYS IN WHICH RIDGE REGRESSION CAN BE INTERPRETED As shrunken estimator of regressions or marker effects (ridge) As redictor of random effects (BLUP) [THIS IS A CRYPTIC INTERPRETATION BECAUSE WE WISH TO LEARN GENE EFFECTS : these do not vary at random, but over a concetual distribution] As mean of a conditional osterior distribution (Bayes) As maximum of a enalized likelihood under the L2 norm (PMLE) In all cases we need ω OR variance ratio (and of the individual variances for interval inference)

5 BLUP

8 Bayes IN THE BAYESIAN INTERPRETATION, β HAS A TRUE, FIXED VALUE. THE RANDOMNESS REPRESENTS UNCERTAINTY PRIOR AND POSTERIOR TO OBSERVING DATA

9 RIDGE REGRESSION EXAMPLE 1

10 RIDGE REGRESSION EXAMPLE 2

14 ASSIGNING A VALUE TO ω: GENERALIZED CROSS-VALIDATION VARIANCE COMPONENTS+ BRUTE FORCE:

15 GCV: MOTIVATION The residual SS will be called T(λ)

16 SAME TRAIT IN 4 DIFFERENT ENVIRONMENTS: FOUR DIFFERENT BEHAVIORS!

17 GENOMIC BLUP

19 RECALL Genotyes (random variable W denotes genotye at a locus) W aa 1 W Aa 0 W AA 1 E HW W 2 q 2 q Var HW W E X 2 E 2 X 2 q 2 q 2 2q W aa 0 W Aa 1 W AA 2 E HW W 2 2 2q 2 q 2 Var HW W 4 2 2q 4 2 2q Coding does not affect the variance of genotyes but mean shifts 2 q 1 Deviations from means are invariant to this tye of coding W E W Coding 1 1 q 1 q 2 0 q q q 1 q 2 1 W E W Coding

20 A LOOK AT VAN RADEN S GENOMIC RELATIONSHIP MATRIX x 11. x 1 X ind,marker x 21. x 2... x n1. x n XX x 11. x 1 x 21. x 2... x n1. x n x 11 x 21. x n1.... x 1 x 2. x n 2 x 1j x 1j x 2j x 2 2j. x 1j x nj 2 x nj

21 In Van Raden s G-matrix : If all elements of G(VR) are divided by this factor, then scale is consistent with A. E E x ij x 1 j x 2 j 2 Var x ij E 2 x ij 2 j q j j q j j q j Cov x 1 j x 2j 2 ij j q j j q j 2 2 j q j E x 1 j E x 2 j 2 j q 2 j 2 j q j 1 Cov x 1j x 2 j 2 j q 2 j 2 j q j 1 j q j 2 2q Or if x s centered additive relationshi Note: LD does not enter into this form of genomic relationshi matrix

22 UNDER HARDY-WEINBERG AND IDEALIZED CONDITIONS E XX 2 j q j j q j 2 Symmetric a 12 2 j q j j q j 2.. a 1n 2 j q j j q j 2 Additive relationshis a 2n... a n,n 1 2 j q j j q j 2 2 j q j j q j 2 2 j q j j q j 2 2 j q j j q j 2

23 Likewise, if the x s are centered 1 a 12.. a 1n E X E X n X E X n 2 j q j Symmetric 1 a 2n... a n,n 1 1 A 2 j q j E X E X n X E X n 2 j q j A A= n x n matrix of additive relationshis

24 Then, the genomic relationshi matrix G X E X X E X 2 j 1 j X X V M,HW Is the realization of a rocess. If this rocess is the HW rocess, then its exectation is E X E X n X E X n 2 j q j A For examle: arent and offsring are exected to have a relationshi=0.5 but in reality it could be larger or smaller

25 MANY G-MATRICES (each may rovide a different variance comonent decomosition) Examles G VR 1 X cent X cent 2 j 1 q j G ST 1 X stdx std ; X std x ij x j Var x ij G 1 2 G VR G ST followed by some scaling? G W 1 X stdwx std where W uses LD information? G Blend G ST 1 A after some re-scaling of matrices G VR scaled in (0,2) with maminimax function? x xmin y ymin ymax ymin xmax xmin g ij,vr min g ij,vr g ij 2 max g ij,vr min g ij,vr

26 FOCUS: GWAS FOCUS: HERITABILITY FOCUS: KINSHIP

27 MARKERS ARE NOT QTL: a disconnect

28 QTLs in LD with markers QTLs in LE with markers

29 Off-diagonals of Genomic correlations among 500 individuals MARKERS OBSERVABLE UNOBSERVABLE QTLs

30 BACK TO BASIC SETTING: linear regression on markers

31 BLUP Cov, y Var 1 y y E y BRUTE FORCE 1 2 X XX 2 I e 2 1 y BRUTE FORCE 2 X XX 1 I XX 1 e y BRUTE FORCE: invert n x n and then ma onto x n MME: invert x NO COMPELLING REASON FOR MME HERE

32 MARKED GENOTYPIC VALUE SAME RESULT: BOTH n x n COMPUTATIONS REQUIRED

33 E y, variance comonents Estimate marker effects from genomic BLUP? Use standard BLUP theory under normality! E y E g y E g "ITERATED EXPECTATIONS" g E X y, variance comonents under normality E X E Cov, X Var X 1 X E X 0 2 X XX X E X y E X, y E X y X XX 1 X y X XX 1 E X y X y X XX 1 g E y, variance comonents X XX 1 I XX 1 2 e e /V M,HW 1 y [REMEMBER THIS]

34 BRUTE FORCE DEFINITION: BLUP is a conditional exectation under normality E y, variance comonents Cov, X XX 2 I e 2 1 y 2 X XX 2 I e 2 1 y 2 X XX 1 2 XX 1 e 2 1 y X XX 1 I XX 1 e y [REMEMBER?] CAN GO BACK AND FORTH BETWEEN GENOMIC BLUP AND RIDGE REGRESSION ESTIMATES OF MARKER EFFFECTS X XX 1 g g X

35 BACK TO GENOMIC BLUP When should a secific reresentation of GBLUP be used? Suose <n. Then G has at most rank=n and the inverse of G does not exist rm(list=ls(all=true)) ###LOAD LATTICE AND MATRIX library(mass) library(bglr) library(lattice) library(matrix) set.seed( ) ####GVR= genomic relationshi a la Van Raden (2008) GVR<-X%*%t(X)/varHW ar(mfrow=c(2,1)) vecgvr<-as.vector(gvr) hist(gvr,main="distribution of elements of GVR",xlab="GVR values") diaggvr<-diag(gvr) lot(diaggvr,ylab="diagonal values",main="diagonal values of GVR") ar(mfrow=c(1,1)) ###LOAD DATA data(wheat) Y<-wheat.Y X<-wheat.X y<-y[,1] n<-length(y) X<-X[,1:50] freq<-numeric(ncol(x)) for (j in 1: ncol(x)){ freq[j]<-mean(x[,j]) } X<-scale(X, center = TRUE, scale = FALSE) Frequency Distribution of elements of GVR > summary(vecgvr) Min. 1st Qu. Median Mean 3 rd Qu. Max > summary(diaggvr) GVR values Min. 1st Qu. Median Mean 3rd Qu. Max ###Markers are binary so var of marker codes is (1-) ###instead of 2(1-) er locus Diagonal values of GVR varhw<-sum(freq*(1-freq)) varhw [1] Diagonal values Index ISSUE HERE: SCALE DIFFERS FROM THAT OF A!

36 Calculation of GBLUP (once one has arrived at some G) g E g Cov g, y Var y 1 y E y G 2 g G 2 g I 2 e 1 y G G I e y g I G 1 e 2 g 2 1 y Var g y Var g g G 2 g G 2 g G 2 g I 2 e 1 2 G g G g 2 G G I e 2 G g 2 I G 1 e 2 I G 1 e 2 g 2 1 I G 1 2 e e g g 2 1 G g 2 g 2 1 G g 2 I G 1 2 e G 2 g 2 2 G g g

37 ####Does GVR have an inverse in this case? No, rank(gvr) should be 50 > GVRinv<-chol2inv(chol(GVR)) Error in chol2inv(chol(gvr)) : error in evaluating the argument 'x' in selecting a method for function 'chol2inv': Error in chol.default(gvr) : the leading minor of order 38 is not ositive definite > rankmatrix(gvr) Warning in rankmatrix(gvr) : rankmatrix(<large sarse Matrix>, method = 'tolnorm2') coerces to dense matrix. Probably should rather use method = 'qrlinpack'!? [1] 50 attr(,"method") [1] "tolnorm2" attr(,"usegrad") [1] FALSE attr(,"tol") [1] e-11 MUST USE STRONG ARM FOR CALCULATING GBLUP g GVR G VR I e 2 g 2 1 y Var g g G VR g 2 G VR G VR I e 2 I G VR G VR I e 2 g 2 1 G VR g 2 g 2 1 G VR g 2

38 Suose 2 g 0.30; e g ####Comute GBLUP using the strong-arm method. ####varg=0.30,vare=0.70 ####lambda< varg=0.30 vare=0.70 lambda=vare/varg Vstar<-(GVR+lambda*diag(n)) Vstarinv<-chol2inv(chol(Vstar)) ghat<-gvr%*%vstarinv%*%y lot(ghat,ylab="gblup",main="genomic BLUP (Van Raden G) varg=0.30 vare=0.70") ###Comute rediction error variance covariance matrix PEVMAT<-varg*(diag(n)-GVR%*%Vstarinv)%*%GVR ###CALCULATE MODEL DERIVED RELIABILITIES RELS<-varg*diag(n)-diag(PEVMAT)/varg RELGBLUPS<-diag(RELS) lot(relgblups,ylab="rel",main="reliabilities of G-BLUP (Van Raden G)") Genomic BLUP (Van Raden G) varg=0.30 vare=0.70 Reliabilities of G-BLUP (Van Raden G) GBLUP REL No evidence of overfit Index Index

39 #####Imact of G-matrix on GBLUP #####Assume same variance decomosition #####Scale to be in (0,2) GVscaled<-matrix(nrow=nrow(X),ncol=nrow(X)) VRmin<-min(GVR) VRmax<-max(GVR) VRmin VRmax for (i in 1:nrow(X)){ for (j in 1:nrow(X)){ GVscaled[i,j]<-2*(GVR[i,j]-VRmin)/(VRmax-VRmin) } } Frequency Histogram of A scaled A Histogram of GVR scaled in (0,2) ####How does it comare with A? A<-wheat.A ar(mfrow=c(2,1)) hist(a,main="histogram of A scaled") hist(gvscaled,main="histogram of GVR scaled in (0,2)") Frequency GVscaled ar(mfrow=c(1,1)) cor(as.vector(a),as.vector(gvscaled)) [1]

40 #####BLUP (assume save var decomosition) varg=0.30 vare=0.70 lambda=vare/varg VstarGVS<-(GVscaled+lambda*diag(n)) VstarinvGVS<-chol2inv(chol(VstarGVS)) ghatgvs<-gvscaled%*%vstarinvgvs%*%y ghatgvs BLUP A vs GBLUP GVS VstarA<-(A+lambda*diag(n)) VstarinvA<-chol2inv(chol(VstarA)) ghata<-a%*%vstarinva%*%y ghat ar(mfrow=c(3,1)) lot(ghata,ghatgvs,main="blup A vs GBLUP GVS") lot(ghata,ghat,main="blup A vs GBLUP GVR") lot(ghat,ghatgvs,main="gblup GVR vs GBLUP GVS") ar(mfrow=c(1,1)) > cor(ghata,ghat) [,1] [1,] > cor(ghata,ghatgvs) [,1] [1,] > cor(ghat,ghatgvs) [,1] [1,] ghatgvs ghata BLUP A vs GBLUP GVR ghata GBLUP GVR vs GBLUP GVS msea<-sum((y-ghata)**2)/n msehat<-sum((y-ghat)**2)/n msehatgvs<-sum((y-ghatgvs)**2)/n msea msehat msehatgvs > msea [1] > msehat [1] > msehatgvs [1] ghat BLUP(A) FITS BETTER (may redict worse)

41 GENERALIZED CV IN GBLUP (zero-means model, wheat data)

File S1: R Scripts used to fit models

File S1: R Scripts used to fit models This document illustrates for the wheat data set how the models were fitted in R. To begin, the required R packages are loaded as well as the wheat data from the BGLR