Multivariate Approaches: Joint Modeling of Imaging and Genetic Data

Size: px

Start display at page:

Download "Multivariate Approaches: Joint Modeling of Imaging and Genetic Data"

Baldwin Johnson
5 years ago
Views:

1 Multivariate Approaches: Joint Modeling of Imaging and Genetic Data Bertrand Thirion, INRIA Saclay-Île-de-France, Parietal team CEA, DSV, I2BM, Neurospin

2 Outline Introduction: Multivariate methods Neuroimaging genetics Limitations of mass univariate models Penalized multiple regression Multivariate multiple regression Examples on simulated and real data

3 Introduction: Multivariate methods Two main families: supervised and unsupervised Unsupervised: clustering, PCA, ICA Generic form: X = AD + E, where only X is known Try to model/understand some data its distribution p(x) without trying to fit any target Supervised: regularized regression, kernel machines, discriminant analysis, PLS Generic form: Y = XB + E, where Y and X are known Try to fit the target data Y given X

4 Introduction: Multivariate methods Supervised methods Unsupervised methods Using unsupervised techniques to solve supervised problems is simply inefficient

5 Neuroimaging Genetics: problem statement Subject s=1..s Brain image data Genetic data Q response variables: Y = {y1,..,yq} P predictor variables: X = {x1,.., xp}

6 Genetic-Neuroimaging studies Small q, small p Small q, large p [Potkin et al. 2009] q = 1 mean BOLD signal; p = 317, 503 SNPs Large q, small p [Joyner et al. 2009] q = 4 brain size measures, p = 11 SNPs [Filippini et al. 2009] q = 29, 812 voxels; p = 1 SNP Large q, large p [Stein et al. 2010] q = 31, 622 voxels; p = 448, 293 SNPs

7 Mass Univariate Linear Models (MULM) A commonly used approach is to model one genotype and one phenotype at a time: 1. Fit all univariate linear regression models yj = βjk xk + ε, j = 1,..., q, k = 1,..., p 2 possibilities: Allelic dosage categorical model 2. Search for a subset of p significant genotypes with indices {k1, k2,..., kp' } {1,..,p} with p' p by testing all (p q) null hypotheses of no association H0 : β jk = 0 3. Correct for multiple testing control experiment-wise FWER or FDR Possible dependence patterns among genotypes and phenotypes are ignored at the modeling stage

8 Problems with MULM If we test for each (voxel, SNP) pair (up to 1012 pairs) Power issue: (peak statistic) to detect an effect with a power of 80%, based on n=1000 subjects, the standardized effect needs to be greater than.26 (.19 with n=2000 subjects). Reproducibility issue: is directly related to the small power false negatives hamper the reproducibility of the analysis this is known to be one of the major issues in GWAS.

9 Multivariate predictive modelling Why modeling multiple genotypes and multiple phenotypes? A weak effect may be more apparent when other causal effects are already accounted for A false signal may be weakened by inclusion in the model of a stronger signal from a true causal association A weak effect may be more apparent if multiple phenotypes are affected Basic strategy: build a linear regression model that includes all genotypes (predictors) and all phenotypes (responses) and then perform variable selection The models covered here are: Penalized multiple linear regression (any p and q = 1) Penalized sparse canonical correlation analysis (any p and q) Penalized reduced-rank regression (any p and q)

10 Multiple genotypes (p > 1) and one phenotype (q = 1): multiple regression The multiple linear regression model with univariate response Fit the multiple linear regression model by solving Equivalently, minimize the error function (loss) When n > p, the OLS solution is given by

11 Penalized multivariate regression for genotype selection One step-approach: fit the multiple linear regression model while finding a subset of p important predictors with indices {k1, k2,..., kp' } {1, 2,..., p} with p' p all having non-zero regression coefficients. This is achieved by fitting a penalized regression model: The penalization ψ(β) imposes a constraint on β. We use convex functions to have unique optimum The coupling parameter λ controls the trade-off between the OLS (unpenalized) solution and the penalized solution.

12 Ridge regression The problem can be rewritten as or more compactly λ controls the amount of shrinkage Some properties are: Closed-form solution Useful when the data matrix X is singular and X T X non invertible Bias-variance trade-off better predictions Grouping effect correlated variables get similar coefficients No variable selection

13 LASSO regression Lasso regression finds β subject to Performs both continuous shrinkage and variable selection λ controls the amount of sparsity For instance, with p = 2:

14 Example of Lasso regularization path The higher the regularization, the sparser the solution Run it yourself

15 Elastic net regression: convex combination of L1 and L2 penalties Elastic net regression solves It retains the benefits of both individual penalties Setting the penalty simplifies to

Solving the penalized regression problem Convex but non-smooth problem: a unique optimal solution can always be found, though not through simple techniques (gradient descent etc.

16 Solving the penalized regression problem Convex but non-smooth problem: a unique optimal solution can always be found, though not through simple techniques (gradient descent etc.) 3 families of methods: Homotopy methods: Lars algorithm [Efron et al. 2004] Proximal methods [Nesterov 2004] Coordinate descent [Friedmann et al. 2009] Implementations available in R/python Glmnet R package, Scikit learn,

17 Variable selection in practice: λ A common procedure is nested cross validation (CV) For each value of λ within a given range: 1. Leave m samples out for testing 2. Use the remaining n m samples for training ﬁt the model 3. Compute the prediction error using the test sample 4. Repeat for all n/m folds and take the average prediction error The optimal λ minimizes the cross-validated prediction error Various search strategies can be used to explore the space Λ No optimal solution guaranteed: the Global problem (that includes optimizing λ) is NOT convex Cost = n/m * Λ

18 Variable selection in practice: caveat Learning the parameter is different from evaluating the performance of the predictive model! You need two have two cross-validation loops Test set (Xt, yt) Data set (X, y) Learning set (Xl, yl) Set Λ Prediction ŷt =Xt β Internal Test set (Xa, ya) Internal learning set (Xb, yb) For all λ in Λ Fit on (Xb, yb) Predict on (Xa, ya) Choose λ* in Λ Compute β prediction accuracy ŷt-yt 2

19 Variable selection in practice Stability selection is an alternative approach which avoids searching for an optimal regularization parameter [Meinshausen and Buhlmann, 2009] The procedure works as follow 1. Extract B subsamples (e.g. of size n/2) from the training data set 2. For each subsample, ﬁt the sparse regression model 3. Estimate the probability of each predictor being selected 4. Select all those predictors whose selection probability is above a pre-determined threshold Under some assumptions, this procedure controls the expected number of false positives Unlike CV, it does not heavily depend upon the regularization parameter λ See [Brunea et al. 2011] But assumes a very sparse solution

20 Example: Lasso regression, stability path Lasso path vitamin geneexpression dataset. The paths of the 6 nonpermuted genes are plotted as solid, red lines, while the paths of the 4082 permuted genes are shown as broken, black lines. [Meinshausen and Buhlmann, 2009] stability path of Lasso

21 Summary on multiple regression Latent variable models for one phenotype Simultaneous dimensionality reduction and variable selection

22 Modeling multiple genotypes and phenotypes Multivariate multiple linear regression: Y = XC + E If n were greater than p, C could be estimated by least squares or with adequate penalization columnwise C(R) would be full rank, R = min (p, q) Same solutions as with q regression models

23 Reduced Rank Regression / PLS Alternative approach: impose a rank condition on the regression coefficient matrix so that rank(c) min(p, q) If C has rank r, it can be written as a product of a (p r ) matrix B and (r q) matrix A, both of full rank The RRR model is written Y = XBA + E, For a fixed rank r, the matrices A and B are obtained by minimizing the weighted least squares criterion M = Tr {(Y XBA) Γ (Y XBA)'} for a given (q q) positive definite matrix Γ

24 Reduced Rank Regression/ PLS illustration Canonical Correlation Analysis (CCA) can be used to remove the variance confound, but requires regularization

25 RRR/PLS Solution The optimal A and B are obtained as H is the (q r ) matrix whose columns are the first r normalized eigenvectors associated with the r largest eigenvalues of the (q q) matrix

26 Sparse reduced rank regression (srrr) Vounou et al. (2010), le Floch et al. (2011) Add penalties to induce sparsity on A and/or B

27 Interpretation of PLS/RRR/CCA Latent variable models for multiple phenotype: Find latent variable pairs (tr, sr ) satisfying some optimal properties [Hoggart et al. (2008) Wu et al. (2009) Vounou et al. (2010) Le Floch et al. (2011)]

28 Statistical power comparison Monte Carlo simulation framework [Vounou et al., 2010] Generate an entire population P of 10k individuals Use a forwards-in-time simulation approach (FREGENE) Reproduce features observed in real human populations Genotypes coded as minor allele SNP-dosage Generate B Monte Carlo data sets of sample size n each: 1. Randomly sample n genotypes x from the population P 2. Simulate the n phenotypes y from a multivariate normal distribution calibrated on real data (ADNI data base) 3. Induce an association according to an additive genetic model p between 1000 to 40, predictive SNPs with small marginal effects q = 111 with 6 true responses

29 Simulations results SNP sensitivity with n = 500 SNP sensitivity with n = 1000

30 Simulations results (ctd) Large p: Ratio of SNP sensitivities (srrr/mulm) as a function of the total number of SNPs

31 More experiments: MULM vs elastic net Comparison of the performance of MULM and elastic net to predict one phenotype generated as Varying the SNR Top: using hybrid simulation (real snps) Bottom: using i.i.d. Data (simulated snps) Also: better support recovery with elastic net Varying the sparsity

32 Real data Q=19 Fnctional asymmetries in Q=19 ROIs P=1083 snps in 12 genetic regions N=94 Localizer dataset [Pinel et al. Subm.] One significant association detected using enet [Thirion et al. in prep]

33 Many thanks to Vincent Frouin, Jean-Baptiste Poline, Edouard Duchesnay, Edith le Floch and the genim group at Neurospin Gaël Varoquaux, Fabian Pedregosa, Alexandre Gramfort, Vincent Michel and scikit learn contributors Giovanni Montana for providing most of this material

34 Bibliography Filippini, N., Rao, A., Wetten, S., et al. (2009). Anatomically-distinct genetic associations of APOE epsilon4 allele load with regional cortical atrophy in Alzheimer's disease. NeuroImage, 44(3): Hoggart, C., Whittaker, J., De Iorio, M., and Balding, D. (2008). Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies. PLoS Genet, 4(7). Joyner, A. H., Roddey, J. C., Bloss, C. S., et al. (2009). A common MECP2 haplotype associates with reduced cortical surface area in humans in two independent populations. PNAS, 106(36): Meinshausen, N. and Buhlmann, P. (2009). Stability selection. Annnals of Statistics. Potkin, S. G., Turner, J. a., Guanti, G., et al. (2009). A genome-wide association study of schizophrenia using brain activation as a quantitative phenotype. Schizophrenia bulletin, 35(1): Shen, L., Kim, S., Risacher, S. L., Nho, K., et al. (2010). Whole genome association study of brain-wide imaging phenotypes for identifying quantitative trait loci in MCI and AD: A study of the ADNI cohort. NeuroImage, pages

35 Bibliography Stein, J. L., Hua, X., Lee, S., et al. (2010). Voxelwise genome-wide association study (vgwas). NeuroImage. Vounou, M., Nichols, T., and Montana, G. (2010). Discovering genetic associations with high-dimensional neuroimaging phenotypes: a sparse reduced-rank regression approach. NeuroImage Witten, D., Tibshirani, R., and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515. Wu, T., Chen, Y., Hastie, T., Sobel, E., and Lange, K. (2009). Genomewideassociation analysis by lasso penalized logistic regression. Bioinformatics, 25(6):714.

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Problems