A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations

Size: px

Start display at page:

Download "A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations"

Bethanie Dinah Ball
5 years ago
Views:

1 A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations Joint work with Karim Oualkacha (UQÀM), Yi Yang (McGill), Celia Greenwood (McGill) JSM, Vancouver 2018 sahirbhatnagar.com/ggmix 1 / 28

2 Motivation Genetic Analysis Workshop (GAW20, March 4-7, 2017, San Diego, US) 1 GOLDEN project: Genetics of Lipid Lowering Drugs and Diet Network Study 2 / 28

3 Motivation Our contribution in GAW20 3 / 28

4 Motivation Our contribution in GAW20 Contribution of the Causal modelling group 3 / 28

5 Motivation Our contribution in GAW20 consisted of investigation of causal relationship between DNA methylation (exposure) within some genes and HDL (outcome) 4 / 28

6 Motivation Our contribution in GAW20 consisted of investigation of causal relationship between DNA methylation (exposure) within some genes and HDL (outcome) DNA methylation in these genes has been shown association with HDL 4 / 28

7 Motivation Our contribution in GAW20 consisted of investigation of causal relationship between DNA methylation (exposure) within some genes and HDL (outcome) DNA methylation in these genes has been shown association with HDL We used Mendelian randomization to explore causal relationship We used SNPs around the analyzed genes as Instrumental Variables (IVs) to interrogate the causal relationship 4 / 28

shown association with HDL We used Mendelian randomization to explore causal relationship We used

8 Motivation Our contribution in GAW20 consisted of investigation of causal relationship between DNA methylation (exposure) within some genes and HDL (outcome) DNA methylation in these genes has been shown association with HDL We used Mendelian randomization to explore causal relationship We used SNPs around the analyzed genes as Instrumental Variables (IVs) to interrogate the causal relationship 4 / 28

9 Challenges in GAW20 Data Sets GAW20 SNPs data was high-dimensional There was a need for data regularization in order to select SNPs strongly associated with the exposure Penalized LS regression can be used (Lasso, SCAD, MCP or Elastic net) 5 / 28

10 Challenges in GAW20 Data Sets GAW20 SNPs data was high-dimensional There was a need for data regularization in order to select SNPs strongly associated with the exposure Penalized LS regression can be used (Lasso, SCAD, MCP or Elastic net) But, data consists of families! In the GAW20, all regularized methods either did not control for the family structure 5 / 28

11 Challenges in GAW20 Data Sets GAW20 SNPs data was high-dimensional There was a need for data regularization in order to select SNPs strongly associated with the exposure Penalized LS regression can be used (Lasso, SCAD, MCP or Elastic net) But, data consists of families! In the GAW20, all regularized methods either did not control for the family structure or used two-steps adjustment for the family structure (including our group) 5 / 28

12 Challenges in GAW20 Data Sets Two-steps adjustment: Step 1 : uses LMM to adjust for subjects relationship 1 Oualkacha et al. Gene. Epi. (2013) 6 / 28

13 Challenges in GAW20 Data Sets Two-steps adjustment: Step 1 : uses LMM to adjust for subjects relationship Step 2 : uses residuals from Step 1 in variable-selection LS-regression methods to select SNPs 1 Oualkacha et al. Gene. Epi. (2013) 6 / 28

14 Challenges in GAW20 Data Sets Two-steps adjustment: Step 1 : uses LMM to adjust for subjects relationship Step 2 : uses residuals from Step 1 in variable-selection LS-regression methods to select SNPs Two-steps procedure is a valid approach In association testing, (GRAMMAR) it is known to suffer from huge power loss 1 1 Oualkacha et al. Gene. Epi. (2013) 6 / 28

15 Proposal Aim: We believe that performing variable selection and controlling for familial and/or hidden relationships simultaneously in high-dimensional settings, are likely to be of great interest to the genetic scientists community 7 / 28

16 Proposal Aim: We believe that performing variable selection and controlling for familial and/or hidden relationships simultaneously in high-dimensional settings, are likely to be of great interest to the genetic scientists community Proposal: We propose, ggmix, a two-in-one procedure which controls for structured populations and performs variable selection in Linear Mixed Models 7 / 28

17 Data and Model Phenotype: Y = (y 1,..., y n ) R n SNPs: X = (X 1 ;..., X n ) T R n p, where p n Twice the Kinship matrix or Realized Relationship matrix: Φ R n n Regression Coefficients: β = (β 1,..., β p ) T R p Polygenic random effect: P = (P 1,..., P n ) R n Error: ε = (ε 1,..., ε n ) R n 8 / 28

18 Data and Model Phenotype: Y = (y 1,..., y n ) R n SNPs: X = (X 1 ;..., X n ) T R n p, where p n Twice the Kinship matrix or Realized Relationship matrix: Φ R n n Regression Coefficients: β = (β 1,..., β p ) T R p Polygenic random effect: P = (P 1,..., P n ) R n Error: ε = (ε 1,..., ε n ) R n We consider the following LMM with a single random effect: Y = Xβ + P + ε P N (0, ησ 2 Φ) ε N (0, (1 η)σ 2 I) σ 2 is the phenotype total variance η [0, 1] is the phenotype heritability (narrow sens) Y (β, η, σ 2 ) N (Xβ, ησ 2 Φ + (1 η)σ 2 I) 8 / 28

19 Likelihood The negative log-likelihood is given by l(θ) n 2 log(σ2 ) log (det(v)) + 1 2σ 2 (Y Xβ)T V 1 (Y Xβ) V = ηφ + (1 η)i 9 / 28

20 Likelihood The negative log-likelihood is given by l(θ) n 2 log(σ2 ) log (det(v)) + 1 2σ 2 (Y Xβ)T V 1 (Y Xβ) V = ηφ + (1 η)i Assume the spectral decomposition of Φ Φ = UDU U is an n n orthogonal matrix and D is an n n diagonal matrix 9 / 28

21 Likelihood The negative log-likelihood is given by l(θ) n 2 log(σ2 ) log (det(v)) + 1 2σ 2 (Y Xβ)T V 1 (Y Xβ) V = ηφ + (1 η)i Assume the spectral decomposition of Φ Φ = UDU U is an n n orthogonal matrix and D is an n n diagonal matrix One can write V = U(ηD + (1 η)i)u = UWU 9 / 28

22 Likelihood The negative log-likelihood is given by l(θ) n 2 log(σ2 ) log (det(v)) + 1 2σ 2 (Y Xβ)T V 1 (Y Xβ) V = ηφ + (1 η)i Assume the spectral decomposition of Φ Φ = UDU U is an n n orthogonal matrix and D is an n n diagonal matrix One can write V = U(ηD + (1 η)i)u = UWU with W = diag (w i ) n i=1, w i = ηd ii + (1 η) 9 / 28

23 Likelihood Projection of Y (and columns of X) into Span(U) leads to a simplified correlation structure for the transformed data: Ỹ = U Y Ỹ (β, η, σ 2 ) N ( Xβ, σ 2 W), with X = U X 10 / 28

24 Likelihood Projection of Y (and columns of X) into Span(U) leads to a simplified correlation structure for the transformed data: Ỹ = U Y Ỹ (β, η, σ 2 ) N ( Xβ, σ 2 W), with X = U X The negative log-likelihood can then be expressed as l(θ) n 2 log(σ2 ) n i=1 log (w i ) + 1 2σ 2 (Ỹ Xβ ) T W 1 ( Ỹ Xβ ) 10 / 28

25 Likelihood Projection of Y (and columns of X) into Span(U) leads to a simplified correlation structure for the transformed data: Ỹ = U Y Ỹ (β, η, σ 2 ) N ( Xβ, σ 2 W), with X = U X The negative log-likelihood can then be expressed as l(θ) n 2 log(σ2 ) n i=1 log (w i ) + 1 2σ 2 (Ỹ Xβ ) T W 1 ( Ỹ Xβ ) For fixed σ 2 and η, solving for β is a weighted least squares problem 10 / 28

26 Penalized Maximum Likelihood Estimator Define the objective function: Q λ (Θ) = l(θ) + λ j p j (β j ) p j ( ) is a penalty term on β 1,..., β p An estimate of the model parameters Θ λ is obtained by Θ λ = arg min Q λ (Θ) Θ 11 / 28

27 Block Relaxation (De Leeuw, 1994) To solve for the optimization problem we use a block relaxation technique Set k 0, initial values for the parameter vector Θ (0) and ɛ; for λ {λ max,..., λ min } do repeat For j = 1,..., p, β (k+1) j ( arg min Q λ β (k) j, η(k), σ 2 (k)) β j ( η (k+1) arg min Q λ β (k+1), η, σ 2 (k)) η (β (k+1), η (k+1), σ 2) σ 2 (k+1) arg min Q λ σ 2 k k + 1 until convergence criterion is satisfied: Θ (k+1) Θ (k) 2 < ɛ; end Algorithm 1: Block Relaxation Algorithm 12 / 28

28 Coordinate Gradient Descent Method We take advantage of smoothness of l(θ) We approximate Q λ (Θ) by a strictly convex quadratic function (using gradient) We use CGD to calculate a descent direction To achieve the descent property for the objective function, we employ further line search 1 Tseng P& Yun S. Math. Program., Ser. B, (2009) 13 / 28

29 Coordinate Gradient Descent Method We take advantage of smoothness of l(θ) We approximate Q λ (Θ) by a strictly convex quadratic function (using gradient) We use CGD to calculate a descent direction To achieve the descent property for the objective function, we employ further line search Theorem [Convergence] 1 : If {Θ (k), k = 0, 1, 2,...} is a sequence of iterates generated by the iteration map of Algorithm 1, then each cluster point (i.e. limit point) of {Θ (k), k = 0, 1, 2,...} is a stationary point of Q λ (Θ) 1 Tseng P& Yun S. Math. Program., Ser. B, (2009) 13 / 28

30 Choice of the tuning parameter We use the BIC: BIC λ = 2l( β, σ 2, η) + c df λ df λ is the number of non-zero elements in β λ plus two 1 Several authors 2 have used this criterion for variable selection in mixed models with c = log n Other authors 3 have proposed c = log(log(n)) log(n) 1 Zou et al. The Annals of Statistics, (2007) 2 Bondell et al. Biometrics (2010) 3 Wang et al. JRSS(Ser. B), (2009) 14 / 28

31 Simulation study We simulate genotypes from the BN-PSD Admixture Model a : percentage of causal SNPs X (test) : n 5000 matrix of SNPs randomly sampled across the genome X (causal) : n (a 5000) matrix of SNPs that are truly associated with the simulated phenotype, X (causal) X (test) β j : effect size for the j th SNP, simulated from a Uniform(0.3, 0.7) for j = 1,..., (a 5000) Y (β, η, σ 2 ) N (X (causal) β, ησ 2 Φ + (1 η)σ 2 I) / 28

32 RRM/Kinship matrix construction X (other) : n 10, 000 matrix of simulated SNPs X (kinship) : matrix of SNPs used to construct the RRM/Kinship matrix Scenario 1: X (kinship) = X (other) No overlap Scenario 2: X (kinship) = [X (other), X (causal) ] 100% overlap In each scenario we considered a = 0, 0.01, η = 0.1, 0.5 and σ 2 = 1 16 / 28

Empirical Kinship Matrix Empirical Kinship Matrix with Block Structure Block Structure Individuals 0 0.2 0.4 0.6 0.8 1 1.

33 Empirical Kinship Matrix Empirical Kinship Matrix with Block Structure Block Structure Individuals Kinship 2nd principal component st principal component (a) Kinship Matrix (b) 1st and 2nd PC 17 / 28

34 Correct Sparsity for Null Model Correct Sparsity Results for the Null Model Based on 200 simulations 10% Heritability 50% Heritability No causal SNPs in Kinship twostep lasso ggmix twostep lasso ggmix Method twostep lasso ggmix η = 10% 18 / 28

35 Correct Sparsity for Model with 1% Causal SNPs Correct Sparsity results for the Model with 1% Causal SNPs Based on 200 simulations 10% Heritability 50% Heritability twostep lasso ggmix twostep lasso ggmix All causal SNPs in Kinship No causal SNPs in Kinship Method twostep lasso ggmix η = 10% 19 / 28

36 True Positive vs. False Positive Rate True Positive Rate vs. False Positive Rate (Mean +/- 1 SD) for the Model with 1% Causal SNPs Based on 200 simulations True positive rate 10% Heritability 50% Heritability lasso twostep ggmix lasso twostep ggmix ggmix ggmix 0.6 lasso twostep lasso twostep All causal SNPs in Kinship No causal SNPs in Kinship False positive rate Method a twostep a lasso a ggmix 20 / 28

37 Heritability Estimates for Null Model Estimated Heritability for the Null Model Based on 200 simulations 10% Heritability 50% Heritability η^ No causal SNPs in Kinship twostep ggmix twostep ggmix horizontal dashed line is the true value 21 / 28

38 Heritability Estimates for Model with 1% Causal SNPs Estimated Heritability for the Model with 1% Causal SNPs Based on 200 simulations η^ % Heritability 50% Heritability All causal SNPs in Kinship No causal SNPs in Kinship twostep ggmix twostep ggmix horizontal dashed line is the true value 22 / 28

39 Error Variance for Null Model Estimated Error Variance for the Null Model Based on 200 simulations 10% Heritability 50% Heritability Error Variance 2 2 No causal SNPs in Kinship 1 1 twostep lasso ggmix twostep lasso ggmix horizontal dashed line is the true value 23 / 28

40 Error Variance for Model with 1% Causal SNPs Estimated Error Variance for the Model with 1% Causal SNPs Based on 200 simulations Error Variance % Heritability 50% Heritability All causal SNPs in Kinship No causal SNPs in Kinship twostep lasso ggmix twostep lasso ggmix horizontal dashed line is the true value 24 / 28

41 Model Error vs. Number Active for Model with 1% Causal SNPs Model Error vs. Number of Active Variable (Mean +/- 1 SD) for Model with 1% Causal SNPs Based on 200 simulations 10% Heritability 50% Heritability Model Error ggmix ggmix lasso lasso twostep twostep ggmix ggmix lasso lasso twostep twostep Number of active variables Method a twostep a lasso a ggmix All causal SNPs in Kinship No causal SNPs in Kinship 25 / 28

42 Discussion/Future work In some situations, prior information of the predictors (e.g. SNPs) groups structure is available Theoretical development of group-lasso in LMM is already done 26 / 28

43 Discussion/Future work In some situations, prior information of the predictors (e.g. SNPs) groups structure is available Theoretical development of group-lasso in LMM is already done In situations where the RRM matrix is of low rank (i.e. n >> # of SNPs used to construct RRM). ex: UK Biobank Computational time of fitting ggmix can be reduced using SVD decomposition of X (kinship) in order to construct Φ and in order to transforme the data Theoretical development of low-rank trick is already done 26 / 28

44 Discussion/Future work Capturing the subjects relationship using random effect requires VCs estimation Random effect modelling leads to a non-convex optimization problem Fixed effects models are good alternatives to random effects models for analysis of Longitudinal/Panel data 1 Capturing familial structure using a penalized FE model could be an interesting avenue to explore 1 Roger Koenker, JMA, (2004) 27 / 28

45 References 1. Christoph Lippert, Jennifer Listgarten, Ying Liu, Carl M Kadie, Robert I Davidson, and David Heckerman. Fast linear mixed models for genome-wide association studies. Nature methods, 8(10): , Matti Pirinen, Peter Donnelly, Chris CA Spencer, et al. Efficient computation with a linear mixed model on large-scale data sets with applications to genetic studies. The Annals of Applied Statistics, 7(1): , Paul Tseng and Sangwoon Yun. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 117(1): , Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for gen- eralized linear models via coordinate descent. Journal of statistical software, 33(1):1, sahirbhatnagar.com/ggmix 28 / 28

FaST linear mixed models for genome-wide association studies

FaST linear mixed models for genome-wide association studies Nature Methods FaS linear mixed models for genome-wide association studies Christoph Lippert, Jennifer Listgarten, Ying Liu, Carl M Kadie, Robert I Davidson & David Heckerman Supplementary Figure Supplementary