Latent Variable models for GWAs Oliver Stegle Machine Learning and Computational Biology Research Group Max-Planck-Institutes Tübingen, Germany September 2011 O. Stegle Latent variable models for GWAs Tübingen 1
Motivation Why latent variables? Causal influences on phenotypes Genotype Primary variable of interest Known confounding factors Covariates Population structure Unknown (latent) confounders Sample handling Sample history Subtle environmental perturbations Genome? Phenome individuals ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT yy 1 yy 2... y N phenotypes SNPs O. Stegle Latent variable models for GWAs Tübingen 2
Motivation Why latent variables? Causal influences on phenotypes Genotype Primary variable of interest Known confounding factors Covariates Population structure Unknown (latent) confounders Sample handling Sample history Subtle environmental perturbations Genome? Phenome individuals ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT yy 1 yy 2... y N phenotypes SNPs Covariates Population y O. Stegle Latent variable models for GWAs Tübingen 2
Motivation Why latent variables? Causal influences on phenotypes Genotype Primary variable of interest Known confounding factors Covariates Population structure Unknown (latent) confounders Sample handling Sample history Subtle environmental perturbations Genome? Phenome individuals ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT yy 1 yy 2... y N phenotypes SNPs Covariates Population y Confounders y O. Stegle Latent variable models for GWAs Tübingen 2
Outline Outline O. Stegle Latent variable models for GWAs Tübingen 3
Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Outline Motivation Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Modeling hidden confounders in GWAs Model Applications Modeling unobserved cellular phenotypes in genetic analyses Model Applications A unifying view Summary O. Stegle Latent variable models for GWAs Tübingen 4
Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Manifolds and dimension reduction (from Olivier Grisel, Generated using the Modular Data Processing toolkit and matplotlib.) O. Stegle Latent variable models for GWAs Tübingen 5
Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Linear dimension reduction Map G dimensional data on K dimensional manifold; K << G Y }{{} NxG = }{{} H }{{} W NxK KxG + Ψ }{{} NxG H: latent factors in low-dimensional space W: weights for factors on data dimensions Ψ: noise, ψn,g N (0, σ 2 ). Challenge: neither W nor H known! Depending on assumptions on W and H: Principle component analysis (PCA) Independent component analysis (ICA)... O. Stegle Latent variable models for GWAs Tübingen 6
Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Linear dimension reduction Map G dimensional data on K dimensional manifold; K << G Y }{{} NxG = }{{} H }{{} W NxK KxG + Ψ }{{} NxG H: latent factors in low-dimensional space W: weights for factors on data dimensions Ψ: noise, ψ n,g N (0, σ 2 ). Challenge: neither W nor H known! Depending on assumptions on W and H: Principle component analysis (PCA) Independent component analysis (ICA)... O. Stegle Latent variable models for GWAs Tübingen 6
Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Linear dimension reduction Map G dimensional data on K dimensional manifold; K << G Y }{{} NxG = }{{} H }{{} W NxK KxG + Ψ }{{} NxG H: latent factors in low-dimensional space W: weights for factors on data dimensions Ψ: noise, ψ n,g N (0, σ 2 ). Challenge: neither W nor H known! Depending on assumptions on W and H: Principle component analysis (PCA) Independent component analysis (ICA)... O. Stegle Latent variable models for GWAs Tübingen 6
Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Linear dimension reduction PCA PCA is corresponds to a noise-free version of the model Y }{{} NxG = }{{} H }{{} W NxK KxG PCA components (H) correspond to directions of maximum data variance in the original dataset: Covariance matrix: C = YY T Eigenvalue/Eigen vectors Cvi = λ i v i Projection matrix P = [v1,..., v K ] Principle components Hn = P Y n. O. Stegle Latent variable models for GWAs Tübingen 7
Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Linear dimension reduction PCA PCA is corresponds to a noise-free version of the model Y }{{} NxG = }{{} H }{{} W NxK KxG PCA components (H) correspond to directions of maximum data variance in the original dataset: Covariance matrix: C = YY T Eigenvalue/Eigen vectors Cvi = λ i v i Projection matrix P = [v1,..., v K ] Principle components Hn = P Y n. O. Stegle Latent variable models for GWAs Tübingen 7
Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Linear dimension reduction PCA PCA is corresponds to a noise-free version of the model Y }{{} NxG = }{{} H }{{} W NxK KxG PCA components (H) correspond to directions of maximum data variance in the original dataset: Covariance matrix: C = YY T Eigenvalue/Eigen vectors Cvi = λ i v i Projection matrix P = [v1,..., v K ] Principle components Hn = P Y n. O. Stegle Latent variable models for GWAs Tübingen 7
Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Linear dimension reduction PCA PCA is corresponds to a noise-free version of the model Y }{{} NxG = }{{} H }{{} W NxK KxG PCA components (H) correspond to directions of maximum data variance in the original dataset: Covariance matrix: C = YY T Eigenvalue/Eigen vectors Cvi = λ i v i Projection matrix P = [v1,..., v K ] Principle components Hn = P Y n. O. Stegle Latent variable models for GWAs Tübingen 7
Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Linear dimension reduction PCA PCA is corresponds to a noise-free version of the model Y }{{} NxG = }{{} H }{{} W NxK KxG PCA components (H) correspond to directions of maximum data variance in the original dataset: Covariance matrix: C = YY T Eigenvalue/Eigen vectors Cvi = λ i v i Projection matrix P = [v1,..., v K ] Principle components Hn = P Y n. O. Stegle Latent variable models for GWAs Tübingen 7
Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Linear dimension reduction Bayesian PCA and GPLVM Assumption: data dimensions or sample dimension independent given H and W. Probabilistic PCA GPLVM N ( p(y H, W) = N y n hnw, σ 2 ) I n=1 N ( ) 2 p(h) = N h n 0, σ h I n=1 N ( 2 p(y W) = N y n 0, σ h WWT + σ 2 ) I n=1 [Tipping and Bishop, 1999] G ( p(y H, W) = N y :,g Hwg, σ 2 ) I g=1 G ( ) 2 p(w) = N w :,g 0, σ h I g=1 G p(y H) = N y:,g 0, σ 2 h g=1 } HHT +σ 2 I {{} NxN [Lawrence, 2005] O. Stegle Latent variable models for GWAs Tübingen 8
Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Linear dimension reduction Bayesian PCA and GPLVM Assumption: data dimensions or sample dimension independent given H and W. Probabilistic PCA GPLVM N ( p(y H, W) = N y n hnw, σ 2 ) I n=1 N ( ) 2 p(h) = N h n 0, σ h I n=1 N ( 2 p(y W) = N y n 0, σ h WWT + σ 2 ) I n=1 [Tipping and Bishop, 1999] G ( p(y H, W) = N y :,g Hwg, σ 2 ) I g=1 G ( ) 2 p(w) = N w :,g 0, σ h I g=1 G p(y H) = N y:,g 0, σ 2 h g=1 } HHT +σ 2 I {{} NxN [Lawrence, 2005] O. Stegle Latent variable models for GWAs Tübingen 8
Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) GPLVM Marginal likelihood p(y H) = G N ( y :,g 0, σh 2 HHT + σ 2 I ) g=1 Inference of most probable hidden factors H: Ĥ = argmax H log G N ( y :,g 0, σh 2 HHT + σ 2 I ) g=1 O. Stegle Latent variable models for GWAs Tübingen 9
Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) GPLVM Marginal likelihood p(y H) = G N ( y :,g 0, σh 2 HHT + σ 2 I ) g=1 Inference of most probable hidden factors H: Ĥ = argmax H log G N ( y :,g 0, σh 2 HHT + σ 2 I ) g=1 O. Stegle Latent variable models for GWAs Tübingen 9
Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Are linear relationships sufficient? (from Olivier Grisel, Generated using the Modular Data Processing toolkit and matplotlib.) Non-linear generalizations, introducing general kernel function instead of linear covariance Ĥ = argmax H log G N ( y :,g 0, σh 2 HHT + σ 2 I ) g=1 O. Stegle Latent variable models for GWAs Tübingen 10
Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Are linear relationships sufficient? (from Olivier Grisel, Generated using the Modular Data Processing toolkit and matplotlib.) Non-linear generalizations, introducing general kernel function instead of linear covariance Ĥ = argmax H log G N ( y :,g 0, σh 2 HHT + σ 2 I ) g=1 O. Stegle Latent variable models for GWAs Tübingen 10
Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Are linear relationships sufficient? (from Olivier Grisel, Generated using the Modular Data Processing toolkit and matplotlib.) Non-linear generalizations, introducing general kernel function instead of linear covariance Ĥ = argmax H log G N ( y :,g 0, σh 2 K H,H + σ 2 I ) g=1 O. Stegle Latent variable models for GWAs Tübingen 10
Modeling hidden confounders in GWAs Outline Motivation Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Modeling hidden confounders in GWAs Model Applications Modeling unobserved cellular phenotypes in genetic analyses Model Applications A unifying view Summary O. Stegle Latent variable models for GWAs Tübingen 11
Modeling hidden confounders in GWAs Confounders in eqtl studies Confounders in eqtl studies Experimental procedures Gene regulation (Translation) Standard to take known factors (gender, population structure) into account. It is key to account for hidden factors as well. Sample preparation Sample history Genome? Phenome individuals ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT yy1 yy2... yn y phenotypes SNPs Covariates Confounders Population y y O. Stegle Latent variable models for GWAs Tübingen 12
Modeling hidden confounders in GWAs [ Confounders in eqtl studies Confounders in eqtl studies Experimental procedures Gene regulation (Translation) Standard to take known factors (gender, population structure) into account. It is key to account for hidden factors as well. Sample preparation Sample history DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing transcription factors small RNAs O. Stegle Latent variable models for GWAs Tübingen 12
Modeling hidden confounders in GWAs [ Confounders in eqtl studies Confounders in eqtl studies Experimental procedures Gene regulation (Translation) Standard to take known factors (gender, population structure) into account. It is key to account for hidden factors as well. Sample preparation Sample history DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing transcription factors small RNAs O. Stegle Latent variable models for GWAs Tübingen 12
Modeling hidden confounders in GWAs [ Confounders in eqtl studies Confounders in eqtl studies Experimental procedures Gene regulation (Translation) Standard to take known factors (gender, population structure) into account. It is key to account for hidden factors as well. Sample preparation Sample history DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing transcription factors small RNAs O. Stegle Latent variable models for GWAs Tübingen 12
Modeling hidden confounders in GWAs Motivating examples HapMAP II, 3 populations, 90 individuals each, 40K genes, 3 million SNPs (here: small region in chromosome 7) 15 SLC35B4 (a) Standard eqtl LOD 10 5 FPR 0.01% FPR 0.01% (b) VBFA accounting for hidden factors VBFA eqtl LOD 0 15 10 5 SLC35B4 SLC35B4 FPR 0.01% FPR 0.01% 0 1.3354 1.3356 1.3358 1.336 1.3362 1.3364 1.3366 1.3368 1.337 1.3372 1.3374 Position in chr. 7 x 10 8 O. Stegle Latent variable models for GWAs Tübingen 13
Modeling hidden confounders in GWAs Model Association model Direct effects of confounders Start with standard association model. Include (K, few) global hidden factors (confounders) in the model. Factors H = {h :,1,..., h :,K } need to be learned from the expression data. How to control model complexity, i.e. choose the number of factors? Genome? Phenome individuals ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT yy1 yy2... yn y phenotypes SNPs genetic {}}{ N y :,g = (θ n,g s n ) + n=1 ψ }{{} noise Covariates Population y O. Stegle Latent variable models for GWAs Tübingen 14
Modeling hidden confounders in GWAs Model Association model Direct effects of confounders Start with standard association model. Include (K, few) global hidden factors (confounders) in the model. Factors H = {h :,1,..., h :,K } need to be learned from the expression data. How to control model complexity, i.e. choose the number of factors? Genome? Phenome individuals ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT yy1 yy2... yn y phenotypes SNPs genetic {}}{ N y :,g = (θ n,g s n ) + n=1 K w g,k h k k=1 }{{} hidden factors Covariates Confounders Population y + ψ }{{} noise y O. Stegle Latent variable models for GWAs Tübingen 14
Modeling hidden confounders in GWAs Model Association model Direct effects of confounders Start with standard association model. Include (K, few) global hidden factors (confounders) in the model. Factors H = {h :,1,..., h :,K } need to be learned from the expression data. How to control model complexity, i.e. choose the number of factors? Genome? Phenome individuals ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT yy1 yy2... yn y phenotypes SNPs genetic {}}{ N y :,g = (θ n,g s n ) + n=1 K w g,k h k k=1 }{{} hidden factors Covariates Confounders Population y + ψ }{{} noise y O. Stegle Latent variable models for GWAs Tübingen 14
Modeling hidden confounders in GWAs Model Association model Gaussian process formulation Data likelihood of the linear generative model assuming Gaussian noise ( ) G S K p(y X, H, θ, W, Θ K ) = N y :,g x s θ s + w g,k h k, σei 2 g=1 s=1 Specify Gaussian priors on the factor weight P (w g,k ) = N ( w g,k 0, σh 2 ) Integrating out the weights yields a Gaussian process marginal likelihood, factorizing over genes given H ( G S ) K P (Y X, H, Θ K ) = N y :,g x s θ s, σh 2 h k h T k + σei 2 g=1 s=1 k=1 k=1 O. Stegle Latent variable models for GWAs Tübingen 15
Modeling hidden confounders in GWAs Model Association model Gaussian process formulation Data likelihood of the linear generative model assuming Gaussian noise ( ) G S K p(y X, H, θ, W, Θ K ) = N y :,g x s θ s + w g,k h k, σei 2 g=1 s=1 Specify Gaussian priors on the factor weight P (w g,k ) = N ( w g,k 0, σh 2 ) Integrating out the weights yields a Gaussian process marginal likelihood, factorizing over genes given H ( G S ) K P (Y X, H, Θ K ) = N y :,g x s θ s, σh 2 h k h T k + σei 2 g=1 s=1 k=1 k=1 O. Stegle Latent variable models for GWAs Tübingen 15
Modeling hidden confounders in GWAs Model Association model Gaussian process formulation Data likelihood of the linear generative model assuming Gaussian noise ( ) G S K p(y X, H, θ, W, Θ K ) = N y :,g x s θ s + w g,k h k, σei 2 g=1 s=1 Specify Gaussian priors on the factor weight P (w g,k ) = N ( w g,k 0, σh 2 ) Integrating out the weights yields a Gaussian process marginal likelihood, factorizing over genes given H ( G S ) K P (Y X, H, Θ K ) = N y :,g x s θ s, σh 2 h k h T k + σei 2 g=1 s=1 k=1 k=1 O. Stegle Latent variable models for GWAs Tübingen 15
Modeling hidden confounders in GWAs Model Association model Including population structure P (Y X, H, Θ K ) = G N y :,g g=1 S K x s θ s, σh 2 h k h T k + σ2 ei k=1 }{{} NxN s=1 Including additional covariance term for population structure. Association test for single SNP. Challenge: Refitting σ g, σ h, H for every SNP. O. Stegle Latent variable models for GWAs Tübingen 16
Modeling hidden confounders in GWAs Model Association model Including population structure G P (Y X, H, Θ K ) = N g=1 y :,g S K x s θ s, σh 2 h k h T k + σ gxx T + σei 2 k=1 }{{} NxN s=1 Including additional covariance term for population structure. Association test for single SNP. Challenge: Refitting σ g, σ h, H for every SNP. O. Stegle Latent variable models for GWAs Tübingen 16
Modeling hidden confounders in GWAs Model Association model Including population structure P (Y X, H, Θ K ) = G N y K :,g x i θ i, σh 2 h k h T k + σ gxx T + σ 2 ei k=1 }{{} NxN g=1 Including additional covariance term for population structure. Association test for single SNP. Challenge: Refitting σ g, σ h, H for every SNP. O. Stegle Latent variable models for GWAs Tübingen 16
Modeling hidden confounders in GWAs Model Association model Including population structure P (Y X, H, Θ K ) = G N y K :,g x i θ i, σh 2 h k h T k + σ gxx T + σ 2 ei k=1 }{{} NxN g=1 Including additional covariance term for population structure. Association test for single SNP. Challenge: Refitting σ g, σ h, H for every SNP. O. Stegle Latent variable models for GWAs Tübingen 16
Modeling hidden confounders in GWAs Model Association model Fit parameter once on null model Ĥ, ˆσ g, σˆ h, ˆσ e = argmax P (Y X, H, Θ K ) Ĥ, ˆσ e, σˆ g, ˆσ e Association testing for fixed hyperparameters ( ) p(y :,g x i, H, Θ K ) = N y :,g xi θ i, α 2 ( ˆσ e ĤĤT + ˆσ g XX T ) + σei 2 O. Stegle Latent variable models for GWAs Tübingen 17
Modeling hidden confounders in GWAs Model Association model Fit parameter once on null model Ĥ, ˆσ g, σˆ h, ˆσ e = argmax P (Y X, H, Θ K ) Ĥ, ˆσ e, σˆ g, ˆσ e Association testing for fixed hyperparameters ( ) p(y :,g x i, H, Θ K ) = N y :,g xi θ i, α 2 ( ˆσ e ĤĤT + ˆσ g XX T ) + σei 2 O. Stegle Latent variable models for GWAs Tübingen 17
Modeling hidden confounders in GWAs Applications Evaluation on real data CIS gene: YJL213W Promoter S ATGACCTGAAACTGGGCGGATGACGTGGAACGGTATGACCTGAAACTGGGGGACTGACGTGGAACGGTATGACCTGAAACT TRANS gene: YJL213W S Promoter ATGA...ACTGGGCGGATGACGTGGAACGGTATGACCTGAAACTGGGGGACTGACGTGGAACGGTATGACCTGAAACT O. Stegle Latent variable models for GWAs Tübingen 18
Modeling hidden confounders in GWAs Applications Evaluation on real data CIS gene: YJL213W Promoter S ATGACCTGAAACTGGGCGGATGACGTGGAACGGTATGACCTGAAACTGGGGGACTGACGTGGAACGGTATGACCTGAAACT (a) Yeast TRANS S cis eqtls gene: YJL213W Promoter ATGA...ACTGGGCGGATGACGTGGAACGGTATGACCTGAAACTGGGGGACTGACGTGGAACGGTATGACCTGAAACT (b) Yeast O. Stegle Latent variable models for GWAs trans eqtls Tu bingen 18
Modeling hidden confounders in GWAs Applications Evaluation on real data Application to HapMap II expression analysis O. Stegle Latent variable models for GWAs Tübingen 19
Modeling hidden confounders in GWAs Applications Evaluation on real data Application to HapMap II expression analysis (a) Probes with a VBeQTL in pooled population O. Stegle (f) cis VBeQTL location and strength relative to gene start Latent variable models for GWAs Tu bingen 19
Modeling unobserved cellular phenotypes in genetic analyses Outline Motivation Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Modeling hidden confounders in GWAs Model Applications Modeling unobserved cellular phenotypes in genetic analyses Model Applications A unifying view Summary O. Stegle Latent variable models for GWAs Tübingen 20
Modeling unobserved cellular phenotypes in genetic analyses Model [ Association studies Regulatory and external factors Confounding factors Accounting for hidden confounders genetic {}}{ ( ) y g,j = b n,g θn,gsn,j + v gf j + w gx j + ψ n,g. }{{} }{{} }{{} known factors hidden factors noise Accounting for regulatory factors? DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing transcription factors small RNAs O. Stegle Latent variable models for GWAs Tübingen 21
Modeling unobserved cellular phenotypes in genetic analyses Model [ Association studies Regulatory and external factors Confounding factors Accounting for hidden confounders genetic {}}{ ( ) y g,j = b n,g θn,gsn,j + v gf j + w gx j + ψ n,g. }{{} }{{} }{{} known factors hidden factors noise Accounting for regulatory factors? DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing transcription factors small RNAs O. Stegle Latent variable models for GWAs Tübingen 21
Modeling unobserved cellular phenotypes in genetic analyses Model [ Association studies Regulatory factors Account for regulatory factors: Transcription factors Pathway components Hypothesis: intermediate cellular factors mediate the observed association signals of target genes. Measuring X? Difficult and expensive. Learn the unobserved factors X. DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing small RNAs [Parts et al., 2011] O. Stegle Latent variable models for GWAs Tübingen 22
Modeling unobserved cellular phenotypes in genetic analyses Model [ Association studies Regulatory factors Account for regulatory factors: Transcription factors Pathway components Hypothesis: intermediate cellular factors mediate the observed association signals of target genes. Measuring X? Difficult and expensive. Learn the unobserved factors X. DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing small RNAs [Parts et al., 2011] O. Stegle Latent variable models for GWAs Tübingen 22
Modeling unobserved cellular phenotypes in genetic analyses Model [ Association studies Regulatory factors Account for regulatory factors: Transcription factors Pathway components Hypothesis: intermediate cellular factors mediate the observed association signals of target genes. Measuring X? Difficult and expensive. Learn the unobserved factors X. DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing small RNAs [Parts et al., 2011] O. Stegle Latent variable models for GWAs Tübingen 22
Modeling unobserved cellular phenotypes in genetic analyses Model Association studies Transcription factor effects Inference of regulatory factors using a bilinear model Y }{{} Expr. = W }{{} Weights X }{{} Factors + Ψ }{{} Noise. DNA SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT W is sparse; each factor regulates only a subset of all genes. Incorporate biological prior knowledge to that factor are interpretable: Transcription factor binding affinities. Inferred factors summarize co-expression clusters. transcription mrna [ translation proteins [ observed & hidden external influenes experimental procedures environment gene regulation alternative splicing small RNAs O. Stegle Latent variable models for GWAs Tübingen 23
Modeling unobserved cellular phenotypes in genetic analyses Model Association studies Transcription factor effects Inference of regulatory factors using a bilinear model Y }{{} Expr. = W }{{} Weights X }{{} Factors + Ψ }{{} Noise. W is sparse; each factor regulates only a subset of all genes. Incorporate biological prior knowledge to that factor are interpretable: Transcription factor binding affinities. Inferred factors summarize co-expression clusters. TF: PHO4 Promoter gene: YJL213W ATGACCTGAAACTGGGCGGATGACGTGGAACGGTATGACCTGAAACTGGGGGACTGACGTGGAACGGTATGACCTGAAACT O. Stegle Latent variable models for GWAs Tübingen 23
Modeling unobserved cellular phenotypes in genetic analyses Model Association studies Transcription factor effects Inference of regulatory factors using a bilinear model Y }{{} Expr. = W }{{} Weights X }{{} Factors + Ψ }{{} Noise. Gene expression W is sparse; each factor regulates only a subset of all genes. Incorporate biological prior knowledge to that factor are interpretable: Transcription factor binding affinities. Inferred factors summarize co-expression clusters. PHO4 Known targets from Yeastract Segregants (218) PHO4 factor activation Segregants (218) O. Stegle Latent variable models for GWAs Tübingen 23 Genes (5493)
Modeling unobserved cellular phenotypes in genetic analyses Model Sparse factor analysis Probabilistic model Graphical model for Y = W X + Ψ. Indicators z g,k determine the sparsity pattern: P (w g,k z g,k = 0) = N ( w g,k 0, σ 2 0 ) P (w g,k z g,k = 1) = N ( w g,k 0, σ 2 1 ). Prior knowledge is encoded in π g,k = P (z g,k = 1). Standard conjugate priors for the remaining random variables. O. Stegle Latent variable models for GWAs Tübingen 24
Modeling unobserved cellular phenotypes in genetic analyses Model Sparse factor analysis Probabilistic model Graphical model for Y = W X + Ψ. Indicators z g,k determine the sparsity pattern: P (w g,k z g,k = 0) = N ( w g,k 0, σ 2 0 ) P (w g,k z g,k = 1) = N ( w g,k 0, σ 2 1 ). Prior knowledge is encoded in π g,k = P (z g,k = 1). Standard conjugate priors for the remaining random variables. O. Stegle Latent variable models for GWAs Tübingen 24
Modeling unobserved cellular phenotypes in genetic analyses Model Sparse factor analysis Probabilistic model Graphical model for Y = W X + Ψ. Indicators z g,k determine the sparsity pattern: P (w g,k z g,k = 0) = N ( w g,k 0, σ 2 0 ) P (w g,k z g,k = 1) = N ( w g,k 0, σ 2 1 ). Prior knowledge is encoded in π g,k = P (z g,k = 1). Standard conjugate priors for the remaining random variables. O. Stegle Latent variable models for GWAs Tübingen 24
Modeling unobserved cellular phenotypes in genetic analyses Model Sparse factor analysis Inference Scale is challenging: thousands of genes, hundreds of TFs. Efficient deterministic approximate inference: 1. Variational Bayesian inference for the core model. 2. Expectation Propagation for the sparsity submodel. O. Stegle Latent variable models for GWAs Tübingen 25
Modeling unobserved cellular phenotypes in genetic analyses Model Sparse factor analysis Inference Scale is challenging: thousands of genes, hundreds of TFs. Efficient deterministic approximate inference: 1. Variational Bayesian inference for the core model. 2. Expectation Propagation for the sparsity submodel. O. Stegle Latent variable models for GWAs Tübingen 25
Modeling unobserved cellular phenotypes in genetic analyses Model Sparse factor analysis Comparison to MCMC on (small!) simulated dataset Recovery of the true simulated network structure. O. Stegle Latent variable models for GWAs Tübingen 26
Modeling unobserved cellular phenotypes in genetic analyses Model Sparse factor analysis Comparison to MCMC on (small!) simulated dataset Recovery of the true simulated factor activations. O. Stegle Latent variable models for GWAs Tübingen 26
Modeling unobserved cellular phenotypes in genetic analyses Applications Application to yeast Dataset Applied the approach to 108 yeast strains (crosses), genotyped and expression profiled in 2 conditions (ethanol and glucose). Alternative types of prior information available TF binding affinities (Yeastract). Pathway information (KEGG). [Smith and Kruglyak, 2008] O. Stegle Latent variable models for GWAs Tübingen 27
Modeling unobserved cellular phenotypes in genetic analyses Applications [ Application to yeast Factor associations Biological hypotheses Genetic variation (SNPs) may regulate factor activations. Similarly for the environment condition (glucose/ethanol). Interaction effects between SNPs, factors and genes. DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing small RNAs O. Stegle Latent variable models for GWAs Tübingen 28
Modeling unobserved cellular phenotypes in genetic analyses Applications [ Application to yeast Factor associations Biological hypotheses Genetic variation (SNPs) may regulate factor activations. Similarly for the environment condition (glucose/ethanol). Interaction effects between SNPs, factors and genes. DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing small RNAs O. Stegle Latent variable models for GWAs Tübingen 28
Modeling unobserved cellular phenotypes in genetic analyses Applications [ Application to yeast Factor associations Biological hypotheses Genetic variation (SNPs) may regulate factor activations. Similarly for the environment condition (glucose/ethanol). Interaction effects between SNPs, factors and genes. DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing small RNAs O. Stegle Latent variable models for GWAs Tübingen 28
Modeling unobserved cellular phenotypes in genetic analyses Applications [ Application to yeast Factor associations Biological hypotheses Genetic variation (SNPs) may regulate factor activations. Similarly for the environment condition (glucose/ethanol). Interaction effects between SNPs, factors and genes. DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing small RNAs O. Stegle Latent variable models for GWAs Tübingen 28
Modeling unobserved cellular phenotypes in genetic analyses Applications Application to yeast Factor associations Genome-wide association density. # of associations 10 1 Yeastract KEGG Freeform Trans genes # of associations 100 101 I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI O. Stegle Latent Chromosomal variable models for position GWAs Tübingen 29
Modeling unobserved Growth cellular condition phenotypes in genetic analyses Application to yeast Factor interactions Segregating loci (2956) Genotype... Segregants (218)... 20 Applications Interaction tests between all gene/snp/factor triplets. Segregant count 15 10 5 0 1.0 0.5 0.0 0.5 PHO4 activation (c) PHO4 factor activation Genotype-factor interaction Segregating loci (2956) Genes (5493) Genotype... Gene expression... Segregants (218)...... YJL213W expression 4 3 2 1 0 1 2 3 4 Interaction LOD = 35.0 (PHO84 locus) BY Glu RM Glu BY Eth RM Eth 1.0 0.5 0.0 0.5 PHO4 activation O. Stegle Latent variable models for GWAs Tübingen 30
Modeling unobserved cellular phenotypes in genetic analyses Applications Application to yeast Factor interactions Example interactions. (a) YAP1-IRA2 interaction (b) PHO4-PHO84 interaction (c) MAT-YOX1 interaction O. Stegle Latent variable models for GWAs Tübingen 30
Modeling unobserved cellular phenotypes in genetic analyses Applications Application to yeast Factor interactions Genome-wide interaction density. # of interacting genes 1500 OAF1 1000 500 100 AMN1 Environment Yeastract KEGG Freeform PHO84 HAP1 I II III IV V VI VII VIII IX X XI XII Chromosomal position XIII XIV XV XVI IRA2 MKT1 CIN5 O. Stegle Latent variable models for GWAs Tübingen 30
A unifying view Outline Motivation Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Modeling hidden confounders in GWAs Model Applications Modeling unobserved cellular phenotypes in genetic analyses Model Applications A unifying view Summary O. Stegle Latent variable models for GWAs Tübingen 31
A unifying view Matrix variate normal distributions S Confounders induces structure between samples. Gene regulation induces structure between genes. Matrix variate models for joint correction. samples 1 genes confounders hidden known 4 3 2 O. Stegle Latent variable models for GWAs Tübingen 32
A unifying view Matrix variate normal distributions Confounders induces structure between samples. Gene regulation induces structure between genes. Matrix variate models for joint correction. samples S genes O. Stegle Latent variable models for GWAs Tübingen 32
A unifying view Matrix variate normal distributions Confounders induces structure between samples. Gene regulation induces structure between genes. Matrix variate models for joint correction. O. Stegle Latent variable models for GWAs Tübingen 32
A unifying view Matrix variate normal distributions S samples genes Row covariances (samples) K Column covariances (genes) Λ 1 O. Stegle Latent variable models for GWAs Tübingen 33
A unifying view Matrix variate normal distributions S samples genes Row covariances (samples) K 0 sampless 20 40 60 Column covariances (genes) Λ 1 80 100 0 20 40 60 80 100 sampless O. Stegle Latent variable models for GWAs Tübingen 33
A unifying view Matrix variate normal distributions S samples genes Row covariances (samples) K sampless 0 20 40 60 Column covariances (genes) Λ 1 genes 0 10 20 30 80 40 100 50 0 20 40 60 80 100 sampless 0 10 20 30 40 50 O. Stegle Latent variable models for GWAs genes Tübingen 33
A unifying view Matrix variate normal distributions S samples genes Row covariances (samples) K sampless 0 20 40 60 Column covariances (genes) Λ 1 genes 0 10 20 30 80 40 100 50 0 20 40 60 80 100 sampless 0 10 20 30 40 50 O. Stegle Latent variable models for GWAs genes Tübingen 33
A unifying view Matrix variate distributions Matrix variate distribution with column covariance Λ 1 and row covariance K p(y µ = 0, Λ 1, K) exp ( 0.5tr [ ΛY T K 1 Y ]) Equivalent to Kronecker covariance structure on vectorized data p(vecy µ = 0, Λ 1, K) = N ( vecy 0, Λ 1 K ) O. Stegle Latent variable models for GWAs Tübingen 34
A unifying view Matrix variate distributions Matrix variate distribution with column covariance Λ 1 and row covariance K p(y µ = 0, Λ 1, K) exp ( 0.5tr [ ΛY T K 1 Y ]) Equivalent to Kronecker covariance structure on vectorized data p(vecy µ = 0, Λ 1, K) = N ( vecy 0, Λ 1 K ) Drawing samples from a MN Kronecker matrix product 1. sample R Y r=1 g=1 G N (0, 1) 2. Y = chol(k) Y 3. Y = Y chol(λ 1 ) T a 11 B a 12 B... a 1H B a 21 B a 22 B... a 2H B A B =.......... a G1 B a G2 B... a GH B O. Stegle Latent variable models for GWAs Tübingen 34
A unifying view 38 37 39 36 40 35 41 34 42 33 43 32 44 31 45 30 46 29 47 28 48 27 49 26 0 25 1 24 2 23 3 22 4 21 5 20 6 19 7 18 8 17 9 16 10 15 11 14 12 13 Simulated example better recovery of the true graph Recovery of simulated network. Weak confounding variation (20% variance in row covariance). Precision 1.0 0.8 0.6 0.4 GLasso Kronecker GLasso Ideal GLasso 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall ground truth GLasso O. Stegle Latent variable models for GWAs Tübingen 35
A unifying view 38 37 39 36 40 35 41 34 42 33 43 32 44 31 45 30 46 29 47 28 48 27 49 26 0 25 1 24 2 23 3 22 4 21 5 20 6 19 7 18 8 17 9 16 10 15 11 14 12 13 38 37 38 37 39 36 39 36 40 35 40 35 41 34 41 34 42 33 42 33 43 32 43 32 44 31 44 31 45 30 45 30 46 29 46 29 47 28 47 28 48 27 48 27 49 26 49 26 0 25 0 25 1 24 1 24 2 23 2 23 3 22 3 22 4 21 4 21 5 20 5 20 6 19 6 19 7 18 7 18 8 17 8 17 9 16 9 16 10 15 10 15 11 14 11 14 12 13 12 13 Simulated example better recovery of the true graph Recovery of simulated network. Weak confounding variation (20% variance in row covariance). Precision 1.0 0.8 0.6 0.4 GLasso Kronecker GLasso Ideal GLasso ground truth GLasso 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Kornecker GLasso Ideal GLasso O. Stegle Latent variable models for GWAs Tübingen 35
Summary Outline Motivation Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Modeling hidden confounders in GWAs Model Applications Modeling unobserved cellular phenotypes in genetic analyses Model Applications A unifying view Summary O. Stegle Latent variable models for GWAs Tübingen 36
Summary Summary Latent variables can have a dramatic effect in GWAs. In expression studies, confounders can be estimated from data. Cellular features can be learnt from the expression profiles. Duality of regulatory relationships and confounders in matrix variate normal models. O. Stegle Latent variable models for GWAs Tübingen 37
Summary References I N. Lawrence. Probabilistic non-linear principal component analysis with gaussian process latent variable models. The Journal of Machine Learning Research, 6:1783 1816, 2005. L. Parts, O. Stegle, J. Winn, and R. Durbin. Joint genetic analysis of gene expression data with inferred cellular phenotypes. PLoS genetics, 7(1):e1001276, 2011. E. Smith and L. Kruglyak. Gene environment interaction in yeast gene expression. PLoS biology, 6(4):e83, 2008. M. Tipping and C. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society. Series B, Statistical Methodology, pages 611 622, 1999. O. Stegle Latent variable models for GWAs Tübingen 38