Lecture 9 GxE Mixed Models. Lucia Gutierrez Tucson Winter Institute

Lecture 9 GxE Mixed Models Lucia Gutierrez Tucson Winter Institute 1

Genotypic Means GENOTYPIC MEANS: y ik = G i E GE i ε ik The environment includes non-genetic factors that affect the phenotype, and usually has a large influence on quantitative traits. o Micro-environment. Environment of a single plant. Need to e controlled with experimental design. o Macro-environment. Environment associated to a location and time. GxE is the norm and not the exception in plants. Therefore defining the target environments is a crucial part in plant reeding, oth for variance component estimation and identyfing superior genotypes. Bernardo 1

Outline 1. How to control for micro-environmental variation? 1. Advanced Experimental Designs. Spatial Variation 3. Mixed Models for assumption flexiility. How to model macro-environmental variation to account for correlations and heterogeneity? 1. GxE. GxE Mixed Models 3. How to include GxE into QTL analysis? 1. QTLxE. QTLxE Example 3

Experimental Design and Analysis 1. Experimental Units. 1. Homogenous Complete Randomized Design (CRD). Heterogenous in one way Randomized Complete Block Design (RCBD) 3. Heterogenous in more than one way Latin squares or latinized designs.. Large numer of treatments. 1. Incomplete Block Designs (IBD or Alpha). Unreplicated experiments (or Federer) 3. Modeling (post-locking, spatial analysis) 4. Assumptions, 1. Independence MM to model correlations. Homogenous variances MM to model heterogeneity CRD RCBD IBD 1 3 3 1 4 4 1 3 4 4 1 3 1 3 4 4 1 3 yi µ α β γ ε = µ αi ε i yi = µ αi β ε y 4 i ik = i k ( ) ik

Assumptions Classical models are ased on some limiting assumptions: Errors are independent random variales with normal distriution and homogenous variances. DEPENDENCIES Design factors impose restrictions on randomizations that induce correlations (i.e. plots within a lock are more similar to each other than to plots on a different lock). If correlations exist, they should e included in the model to make valid inferences. Genotypes may e related imposing a correlation. Field heterogeneity also induces correlations. There might e a correlation etween environments. NON-HOMOGENOUS VARIANCES Both, genetic and environmental variances are affected y the environment (i.e. they are properties of the population). Therefore, heterogenous variances are common in field experiments. 5

Mixed Models in Field Experiments Mixed models are more flexile: correlations and heterogenous variances can e modeled. FIXED EFFECTS inference is aout specific treatments all levels of a fixed factor are included in the experiment interest in testing differences in means etween treatments need for identification constraints (sum to zero, cornerstone) RANDOM EFFECTS inference is aout a population of treatments testing the population variance of a treatment assumed to have a distriution, t i ~ N(, t ) structuring of variance-covariance, imposing correlations prediction from random effects provides est estimate of treatment rankings (BLUPs) 6

Fixed Effects vs. Random Effects FIXED EFFECTS Estimation y generalized least squares (conditional on VCOV parameters) H : t i =, for all levels Wald statistic is c r distriuted, with r = nr levels -1 Also F-approximations to Wald test can e used: Wald / r is approximately F-distriuted RANDOM EFFECTS Estimation of VCOV y (RE)ML H : t = Compare likelihood (deviance = -L) of nested models, i.e., models with and without variance component under test Approximate deviance differences y Chi-square on 1 df 7

MM to flexiilize assumptions Blocks may e considered as random effects to model the correlation of plots within a lock. This makes sense if numer of locks is sufficient to estimate a variance (i.e. >) y i = µ τ i ε i i = genotype index, = lock index ~N(, ) = random lock effect ε i ~N(, ) and ε i are assumed to e independent, i.e, cov( ;ε i ) = for oservations in same lock, covariance cov(y i ; y i ) = for oservations in same lock, correlation corr(y i ; y i )= / ( ) for oservations in different locks, covariance cov(y i ; y i ) = for oservations in general var(y i ) = 8

MM to flexiilize assumptions Σ = Σ = Independent oservations Oservations in the same lock correlated Compound symmetry 9

Numer of treatments HOW TO DEAL WITH HIGH NUMBER OF TREATMENTS? 1. STRATIFICATION: Group genotypes with similar characteristics (maturity, color, family), compare within groups. NO BETWEEN GROUP COMPARISONS.. PRODUCE HOMOGENOUS EXPERIMENTAL UNITS: Make every effort to homogenize experimental area (look for soil similarity, field conditions to reduce variation, choose seeds of similar vigor). 3. USE REPEATED CHECKS: You may use checks in a systematic way to control or model soil heterogeneity. 4. EXPERIMENTAL DESIGN WITH SPATIAL CONSIDERATIONS. Use experimental designs that include a large numer of treatments while controling variaility (i.e. alpha designs, unrep, etc.). 1

Repeated checks in a RCBD Randomized Complete Block Design: Mixed Models Modeling variance components (Test-lines random effects): y = µ Bi C Tk ε i( k 1) Estimating genotypic means (Test-lines fixed effects): y = Bi C Tk ε i( k 1) y = Xβ Zu e µ y = Xβ e C C C C C 11

Advanced Designs: Alpha Designs What are ALPHA-DESIGNS (Williams et al., )?: Designs that allows for the construction of incomplete locks with a large numer of treatments (t) and locks (k) so that t is multiple of k. Includes α(,1)-lattice designs, IBD, row-column, etc. BI 1 BI BI 3 BI 4 BI 5 BI 6 1 4 7 1 3 5 8 4 5 6 3 6 9 7 8 9 Pairs of treatments that share incomplete locks: 1 time (18 pairs): 1-, 1-3, -3, 4-5, 4-6, 5-6, 7-8, 7-9, 8-9, 1-4, 1-7, 4-7, -5, -8, 5-8, 3-6, 3-9, 6-9 times (18 pairs): 1-5, 1-6, 1-8, 1-9, -4, -6, -7, 9, 3-4, 3-5, 3-7, 3-8, 4-8, 4-9, 5-7, 5-9, 6-7, 6-8 1

Advanced Designs Two possile arrangements for an incomplete lock design with r =, v = 9 and k = 3 Replicate 1 Replicate Block 1 3 1 3 1 4 7 1 3 5 8 4 5 6 3 6 9 7 8 9 Replicate 1 Replicate Block 1 3 1 3 1 4 7 1 5 4 5 8 8 6 3 6 9 3 9 7 Which is the est design? 13

Advanced Designs DESIGN Block 1: ABC; Block : ABD; Block 3: ACD; Block 4: BCD Coincidence of treatments inside incomplete locks = COMPARISONS Direct: For A-B: Block 1 and : A-B Indirect: For A-B: (Block 1, A-C) (Block 4, B-C) = A-B Block Totals: Sum Block 3 Sum Block 4 = (ACD) (BCD) = A-B 14

Mixed Models in Advanced Designs Direct and indirect comparisons of treatment effects are comined in standard least squares, fixed effects, estimates = intra lock estimates. Information on treatment differences from lock totals ecomes availale only when locks are taken random = inter lock estimates. Comination of intra and inter lock estimates for treatment differences weighing the pieces of information y their (inverse) variances is done automatically in a REML analysis 15

Incomplete Block Designs (IBD) EXAMPLE. OIL CONTENT OF ADVANCED INBRED LINES. Treatment: sunflower IL Experimental design: IBD Resolule with r=3 and s=5 Dependent variale : Y = L ha -1 Incomplete lock 1 3 4 5 1 3 4 5 R1 6 7 8 9 1 11 1 13 14 15 16 17 18 19 1 3 4 5 R 7 8 9 1 6 13 14 15 11 1 19 16 17 18 1 3 4 5 R3 8 9 1 6 7 15 11 1 13 14 17 18 19 16 TREATMENT ASSIGNMENT: Each treatment is assigned randomly to the experimental units in the first rep. In the following reps, restrictions in the randomization are conducted such that each pair of treatment is compared the same numer of times within an incomplete lock. 16

Incomplete Block Designs (IBD) Yik = αi β γ k ( ) µ ε ik Y µ = population mean α β = effect of γ ε i ik = effect of k() ik = response of = effect of = the i - th treatment on the the i - th treatment the - th rep the k - th incomplete lock within the - th rep experimental error (residual) - th rep and the k - th incomplete lock IBD with augmented checks: Variance component estimation (Random Test-lines): y µ = Bi S ( i) Ck Tl i( k l 1) Genotypic means estimation (Fixed Test-lines): y = µ Bi S ( i) Ck Tl ε i( k l 1) ε 17

Row-Column Design (RC) Similar to incomplete locks. Two sources of variation are controled: rows and columns. Better control of field heterogeneity. row rep column 1 1 3 4 5 6 7 8 1 14 9 5 6 11 8 1 13 1 8 5 7 18 1 3 7 15 16 13 9 16 4 4 4 18 6 3 17 1 14 5 11 19 1 17 15 3 19 1 Yikl = αi β γ k ( ) λl ( ) µ ε ikl Y µ = population mean α β = effect of γ λ ε i ikl l() ikl = response of = effect of k() the i - th treatment on the - th rep and the k - th row and l - th column the i - th treatment the - th rep = effect of the k - th row within the - th rep = effect of the k - th column within the - th rep = experimental error (residual) 18

Federer s Unreplicated Design (UR) MAIN MOTIVATION: There is not enough seed for each genotype in early generation testing to replicate the genotypes. But then: 1. How do we control the sources of variation?. How do we estimate experimental error? 1 T1 13 T 14 5 T1 6 3 4 5 15 16 17 7 8 9 6 T3 7 18 T1 19 3 T3 31 8 9 1 1 3 33 34 11 T 1 3 T3 4 35 T 36 3 repeated checks in a RCBD 36 genotypes 19

Federer s Unreplicated Designs (UR) EXAMPLE. BIOMASS YIELD OF 5 BARLEY F5 Treatments: 5 Barley F5 Experimental design: Federer s unrep design (checks augmented in RCBD) Dependent variale: Y = Kg ha -1 TREATMENT ASSIGNMENT: RCBD for checks were used. To each lock a numer of genotypes is included. Different genotypes are included in the different locks. Y ik µ ε = Bi C Tk ( ) ik Y µ = population mean β = effect of ε ikl k() ikl = response C = effect of T the - th rep the - th repeated check = effect of the k - th test - line within the - th check = experimental error (residual)

Spatial Modeling In mixed models with random locks, all plots within a lock are equally correlated, ut etween locks plots are uncorrelated. It is more realistic that the correlation etween plots decays with the distance etween them. VCOV for individual trials can e modeled as a product of a decaying correlation (for example: AR1) in row direction and another decaying correlation (for example: AR1) in column direction. Spatial modeling of VCOV can e additional to locks, or sustitute of locks (ut then e careful). Experimental design vs. post-locking. 1 ρ ρ 3 ρ ρ 1 ρ ρ ρ ρ 1 ρ 3 ρ ρ ρ 1 1

Model Comparison INFORMATION CRITERIA Especially for non-nested models, information criteria (AIC, BIC) may provide alternative to likelihood ratio tests AIC = -L t BIC = -L t logn t = # of variance parameters n = # of residual degrees of freedom = (noservations - nfixed_parameters). Best model has smallest AIC/BIC if using REML estimates make sure fixed effects are the same to make valid comparisons across models.

Why using Mixed Models? Greater flexiility in modeling variance-covariance structure/ dependencies etween oservations. Accounting for heterogeneity of variance and correlation. For many situations linear mixed models provide a more natural way of modeling than standard linear models. Recovery of (inter-lock) information. Shrinkage prediction of effects (BLUPs), which is of importance in genetics. Allows modeling of dependencies for spatial & temporal (locking) and genetic reasons. 3

Genotype y Environment Interaction R G Genotype1 Genotype Genotype 1 Genotype G1 ENV 1 ENV E1 E 4

Genotype y Environment Interaction R No GxE G R GxE: divergence G G1 G1 R GxE: convergence E1 E G R GxE: cross-over G G1 G1 E1 E E1 E 5

Genotype y Environment Interaction R GxE: convergence G R GxE: divergence G G1 G1 E1 E E1 E With oth divergence and convergence, it is easy to make predictions ecause there is no cross-over interaction. G is the est genotype in all the environment. However, there is heterogeneity of variance that needs to e taken into account in models for proper estimation. 6

Genotype y Environment Interaction R GxE: cross-over R GxE: cross-over G G G1 G1 E1 E E1 E With cross-over interaction, predictions should e made y environment. There is not a genotype est in all the environments. Careful also with heterogeneity of variance. 7

Genotype y Environment Interaction MULTI-ENVIRONMENT TRIALS Used to characterize a set of genotypes over varying conditions Trials in different locations Trials with different practices (agronomy) Trials over multiple years Information Does a genotype perform well over all environments? If not, in which specific environments? Can a genotype profit from improvements of the environment? 8

Multiple environments ONE-STAGE ANALYSIS P ik = Gi E Dk ( ) µ GE ε Analyze field-plot data and model GxE simultaneously. Need information from experimental design and replications. i ik TWO-STAGES ANALYSIS First stage: analysis per trial Quality control (assumptions/outliers/etc) Otain predictions per genotype Second stage: use the genotype y environment tale of predictions GxE analysis QTLxE analysis P i Trial 1 Trial Trial n GxE tale of means (predictions) = µ Gi E GEi 9

Multiple environments The analysis of MET data aims at finding an adequate model for the phenotypic responses as a function of genetic and environmental factors modelling the mean Reliale conclusions depends on an appropriate structure for the residual ε i Assumption of independence of residuals etween environments is highly unrealistic (in which case this assumption is valid?). A more realistic model assumes residuals coming from some multivariate normal distriution. Finding an appropriate structure for ε i that reflects the heterogeneity of genetic variances and correlations a necessary first step towards reliale conclusions on µ i 3

Diagonal env4 env5 env env1 env3 31

Diagonal P = µ ε i E i VCOV ( ε i Corr( Env ) 1 = ; Env * ) = * 3 = 4 Each environment has its own (residual) genetic variance (that is confounded with GxE variance). There is no genetic correlation etween environments 3

Compound Symmetry 6 5 4 3 7 1 5 3 1 4 7 3 5 3 1 1-1 - 6 4 7 6 5 4 3 1 6 6 8 7 5 6 1 5 4 3 1 4 4-1 8 6 5 3 4 3 1-1 33

Compound Symmetry * ) ; ( ) ( GE G G GE G GE G G GE G G G G GE G G G GE G G GE G i Env Env Corr VCOV ε = = = i i E P ε µ = Each environment has same (residual) genetic variance (that is confounded with GxE variance), and the genetic correlation is also the same etween all pairs of environments. 34

Unstructured env3 env env1 env4 env5 35

Unstructured P = µ ε i E i VCOV ( ε i ) = Corr( Env ; Env 1 1 31 41 * ) = 3 4 * * 3 43 4 Each environment has its own (residual) genetic variance (that is confounded with GxE variance), and the genetic correlation can change etween any pair of environments. 36

Factor Analytic i i i i i i i i z x E x E G E E β α µ µ α µ µ µ µ µ µ = = = = ) )( ( ) ; ( ) ( * * * * * 4 4 4 3 4 4 1 4 3 3 3 3 1 3 1 1 1 1 i Env Env Corr VCOV δ λ λ δ λ λ λ λ δ λ λ λ λ λ λ λ λ δ λ λ λ λ λ λ δ λ λ λ λ δ λ λ ε = = Heterogeneity of variances and correlations possile at the price of relatively few parameters 37

Finding a suitale model for the VCOV Use different summary statistics and diagnostic plots Summary statistics per environment Correlations etween environments Boxplots Scatter plots Biplots Fit different mixed models assuming different VCOV and compare the goodness of fit of them y some criterion (eg: AIC or BIC). 38

QTL x E WHY DO I NEED TO INCLUDE QTLxE? o GxE is common in plants and multi-environment evaluation for Plant Breeding required. o Modeling GxE to estimate means (BLUE) or predict BLUP is not enough? o It is possile that a QTLxE interaction exists so that some markers are favorale in one environment ut not in another one. o Identifying general QTL and environment specific QTL is helpful in selecting est genotypes. Additionally, correct error terms should e used when QTL are eing evaluated. 39

QTL x E Phenotypic data configuration Environments / time Covariales Environmental covariales Covariales Grid of genomic positions Marker Map Genotypes Phenotypic data GxE tale of means Geno-typic co-variales Genetic predictors (Genotypic covariales) Marker scores 4

QTL x E Genomewide scan with QTLxE 41

QTL x E STEPS TO COMPLETE A QTLxE ANALYSIS 1. Identify appropriate model for GxE.. Use the appropriate model to test each genomic position for the presence of a QTL (MR/SIM). 3. Use candidate QTL as cofactors to re-scan de genome (CIM). 4. Adust a final multi-qtl with ackward elimination of candidate QTL. Estimate QTL effects. 4

Information needed 1. Molecular marker scores High throughput panels, controlled conditions, repeatale, cheap, automatic scoring.. Genetic map More standard methods, small population sizes, consensus maps? Need some more development. 3. Phenotypes Crucial part, poor phenotypes means poor QTL mapping. 43

Phenotyping 1. Field-plot technique - Good techniques - Control experimental error. Experimental Design Diseases - Randomized complete lock design - Alpha-designs (RIBD, R-CD, etc.) - Augmented designs Plant Height Flag Leaf Length 3. Analysis - Post-hoc spatial corrections - Other modeling 44

Maize example (CIMMYT, MX) STRESS TRIALS 199 (Tlaltizapán, México) Well watered (WW) Intermediate stress (IS) Severe stress (SS) 1994 (Tlaltizapán, México) Intermediate stress (IS) Severe stress (SS) 1996 (Poza Rica, México) Low Nitrogen ( seasons) High Nitrogen Tlaltizapán Poza Rica Malosetti, 11 45

Genotypic performance Low correlation GxE Mean performance are different Good environments (NS9a = no stress) Bad environments (LN96a, LN96 = low N) Variaility is also different Higher NS9a Lower LN96 GxE? Possily yes, ut we can t really see it here... 46

Genotypic performance 3 groups of environments Best model for this data: Factor analytic Malosetti, 11 Model AIC SIC Deviance NParameters FA 17471 1754 17439 16 FA 17455 1753 1749 3 OUTSIDE 1753 17554 1755 9 UNSTRUCTURED 17456 17577 17384 36 HCS 1769 177 17674 9 CS 17918 1794 17914 DIAGONAL 1796 17933 1789 8 IDENTITY 1887 189 1885 47 1 Best model: FA (on asis of criterion SIC)

Marker information P i = µ x α ε i i We enrich the original model y including markers We include genetic predictors Additive effect are environment-specific! Partition G and GxE into Part explained y markers (=QTLs) Part NOT explained (residual G* and GxE* = ε) Need appropriate model for the residual ε (variancecovariance model) Malosetti, 11 48

QTLxE (SIM; VCOC=FA) Profile for environment specific QTLs Positive effect P1 allele (dark/light lue) Positive effect P allele (red/yellow) 6 5 -log1(p) 4 3 1 pos itive ne gati ve Malosetti, 11 Color code for p- values of QTL effects 49

SIM: CIM1: CIM SIM CIM 1 CIM After SIM some extra QTLs picked y CIM No maor change after second round of CIM, so stop Six candidate QTLs Malosetti, 11 5

Final Model Summary Trait: yld Population type: F Numer of genotypes: 11 Numer of environments: 8 Numer of linkage groups: 1 Numer of markers: 1 Variance-covariance model: FA List of QTLs Locus no. Locus name Linkage group Position -log1(p) QTLxE 19 L85 1 141. 13.76 yes 4 CP36 35.9 4.665 yes 73 L35 3 55.7 4.661 yes 11 L71 4 136.6 3.57 no 159 L43 6 15. 3.71 yes 37 C1P6 1 6.15 8.313 yes All 6 candidate QTLs retained in the final model But note that QTL on linkage group 4 is a main effect QTL (QTLxE term dropped) Malosetti, 11 51

QTL Location: linkage group 1 position 141 Environment Effect S.e. P %Expl. CI_LL CI_UL var. HN96-37.466 13.35.4 3.1. 66. IS9a 55.33 13.44. 7.. 66. IS94a 56.19 14.6. 7.. 66. LN96a.117 6.66.986. * * LN96 1.577 5.915.79. * * NS9a 63.76 18.95.1 5.. 66. SS9a 7.19 1.193. 15.1 14.55 157.45 SS94a 7.543 14.678.61 1.7 * * Effects changes from environment to environment Also sign of effect cross-over interaction Which allele to select? Location: linkage group 4 position 136.6 Environment Effect S.e. P %Expl. CI_LL CI_UL var. HN96-16.369 4.55..6. 167.5 IS9a -16.369 4.55..6. 167.5 IS94a -16.369 4.55..6. 167.5 LN96a -16.369 4.55. 3.1. 167.5 LN96-16.369 4.55. 3.4. 167.5 NS9a -16.369 4.55..3. 167.5 SS9a -16.369 4.55..8. 167.5 SS94a -16.369 4.55..6. 167.5 What is the difference with the previous one? Which allele to select? Malosetti, 11 5