Latent Variable models for GWAs

Similar documents
GWAS IV: Bayesian linear (variance component) models

Introduction to Bioinformatics

GWAS V: Gaussian processes

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Causal Graphical Models in Systems Genetics

Pattern Recognition and Machine Learning

Causal Model Selection Hypothesis Tests in Systems Genetics

Inferring Transcriptional Regulatory Networks from High-throughput Data

Probabilistic Models for Learning Data Representations. Andreas Damianou

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

Learning in Bayesian Networks

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Chapter 15 Active Reading Guide Regulation of Gene Expression

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Complete all warm up questions Focus on operon functioning we will be creating operon models on Monday

Bayesian learning of sparse factor loadings

Association studies and regression

Linear Regression (1/1/17)

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Bayesian Machine Learning

Inferring Genetic Architecture of Complex Biological Processes

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Computational Approaches to Statistical Genetics

Machine Learning Techniques for Computer Vision

Predicting Protein Functions and Domain Interactions from Protein Interactions

Computational Biology: Basics & Interesting Problems

Research Statement on Statistics Jun Zhang

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

25 : Graphical induced structured input/output models

Network Biology-part II

BTRY 7210: Topics in Quantitative Genomics and Genetics

Factor Analysis (10/2/13)

Bioinformatics 2 - Lecture 4

Physical network models and multi-source data integration

Modelling Transcriptional Regulation with Gaussian Processes

Identifying Signaling Pathways

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

Latent Variable Methods for the Analysis of Genomic Data

The Monte Carlo Method: Bayesian Networks

A Penalized Regression Model for the Joint Estimation of eqtl Associations and Gene Network Structure

Non-Parametric Bayes

Quantile based Permutation Thresholds for QTL Hotspots. Brian S Yandell and Elias Chaibub Neto 17 March 2012

PCA and admixture models

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Bayesian Regression (1/31/13)

Discovering molecular pathways from protein interaction and ge

Predicting causal effects in large-scale systems from observational data

CS-E3210 Machine Learning: Basic Principles

Introduction to Probabilistic Machine Learning

Latent Variable Models and EM Algorithm

Written Exam 15 December Course name: Introduction to Systems Biology Course no

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling

Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Integrated Non-Factorized Variational Inference

Causal Discovery by Computer

Relevance Vector Machines

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Linear Factor Models. Sargur N. Srihari

Variable Selection in Structured High-dimensional Covariate Spaces

Recent Advances in Bayesian Inference Techniques

REVIEW SESSION. Wednesday, September 15 5:30 PM SHANTZ 242 E

What is the expectation maximization algorithm?

Graphical Models and Kernel Methods

Eukaryotic Gene Expression

Markov Chain Monte Carlo Algorithms for Gaussian Processes

Asymptotic distribution of the largest eigenvalue with application to genetic data

Bayesian Sparse Correlated Factor Analysis

BME 5742 Biosystems Modeling and Control

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

TOWARD ROBUST GROUP-WISE EQTL MAPPING VIA INTEGRATING MULTI-DOMAIN HETEROGENEOUS DATA. Wei Cheng. Chapel Hill 2015

Nonparameteric Regression:

Lecture 6: April 19, 2002

Prediction of double gene knockout measurements

CELL BIOLOGY. by the numbers. Ron Milo. Rob Phillips. illustrated by. Nigel Orme

Quantitative Biology Lecture 3

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Inferring Latent Functions with Gaussian Processes in Differential Equations

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Proteomics. 2 nd semester, Department of Biotechnology and Bioinformatics Laboratory of Nano-Biotechnology and Artificial Bioengineering

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Based on slides by Richard Zemel

Dimension Reduction (PCA, ICA, CCA, FLD,

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

Gene Regulatory Networks II Computa.onal Genomics Seyoung Kim

GAUSSIAN PROCESS REGRESSION

Grouping of correlated feature vectors using treelets

Latent Variable Models and EM algorithm

Causal inference (with statistical uncertainty) based on invariance: exploiting the power of heterogeneous data

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Modelling gene expression dynamics with Gaussian processes

Linear Regression and Discrimination

Introduction to Gaussian Processes

Prokaryotic Gene Expression (Learning Objectives)

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Chris Bishop s PRML Ch. 8: Graphical Models

Gaussian with mean ( µ ) and standard deviation ( σ)

Transcription:

Latent Variable models for GWAs Oliver Stegle Machine Learning and Computational Biology Research Group Max-Planck-Institutes Tübingen, Germany September 2011 O. Stegle Latent variable models for GWAs Tübingen 1

Motivation Why latent variables? Causal influences on phenotypes Genotype Primary variable of interest Known confounding factors Covariates Population structure Unknown (latent) confounders Sample handling Sample history Subtle environmental perturbations Genome? Phenome individuals ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT yy 1 yy 2... y N phenotypes SNPs O. Stegle Latent variable models for GWAs Tübingen 2

Motivation Why latent variables? Causal influences on phenotypes Genotype Primary variable of interest Known confounding factors Covariates Population structure Unknown (latent) confounders Sample handling Sample history Subtle environmental perturbations Genome? Phenome individuals ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT yy 1 yy 2... y N phenotypes SNPs Covariates Population y O. Stegle Latent variable models for GWAs Tübingen 2

Motivation Why latent variables? Causal influences on phenotypes Genotype Primary variable of interest Known confounding factors Covariates Population structure Unknown (latent) confounders Sample handling Sample history Subtle environmental perturbations Genome? Phenome individuals ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT yy 1 yy 2... y N phenotypes SNPs Covariates Population y Confounders y O. Stegle Latent variable models for GWAs Tübingen 2

Outline Outline O. Stegle Latent variable models for GWAs Tübingen 3

Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Outline Motivation Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Modeling hidden confounders in GWAs Model Applications Modeling unobserved cellular phenotypes in genetic analyses Model Applications A unifying view Summary O. Stegle Latent variable models for GWAs Tübingen 4

Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Manifolds and dimension reduction (from Olivier Grisel, Generated using the Modular Data Processing toolkit and matplotlib.) O. Stegle Latent variable models for GWAs Tübingen 5

Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Linear dimension reduction Map G dimensional data on K dimensional manifold; K << G Y }{{} NxG = }{{} H }{{} W NxK KxG + Ψ }{{} NxG H: latent factors in low-dimensional space W: weights for factors on data dimensions Ψ: noise, ψn,g N (0, σ 2 ). Challenge: neither W nor H known! Depending on assumptions on W and H: Principle component analysis (PCA) Independent component analysis (ICA)... O. Stegle Latent variable models for GWAs Tübingen 6

Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Linear dimension reduction Map G dimensional data on K dimensional manifold; K << G Y }{{} NxG = }{{} H }{{} W NxK KxG + Ψ }{{} NxG H: latent factors in low-dimensional space W: weights for factors on data dimensions Ψ: noise, ψ n,g N (0, σ 2 ). Challenge: neither W nor H known! Depending on assumptions on W and H: Principle component analysis (PCA) Independent component analysis (ICA)... O. Stegle Latent variable models for GWAs Tübingen 6

Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Linear dimension reduction Map G dimensional data on K dimensional manifold; K << G Y }{{} NxG = }{{} H }{{} W NxK KxG + Ψ }{{} NxG H: latent factors in low-dimensional space W: weights for factors on data dimensions Ψ: noise, ψ n,g N (0, σ 2 ). Challenge: neither W nor H known! Depending on assumptions on W and H: Principle component analysis (PCA) Independent component analysis (ICA)... O. Stegle Latent variable models for GWAs Tübingen 6

Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Linear dimension reduction PCA PCA is corresponds to a noise-free version of the model Y }{{} NxG = }{{} H }{{} W NxK KxG PCA components (H) correspond to directions of maximum data variance in the original dataset: Covariance matrix: C = YY T Eigenvalue/Eigen vectors Cvi = λ i v i Projection matrix P = [v1,..., v K ] Principle components Hn = P Y n. O. Stegle Latent variable models for GWAs Tübingen 7

Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Linear dimension reduction PCA PCA is corresponds to a noise-free version of the model Y }{{} NxG = }{{} H }{{} W NxK KxG PCA components (H) correspond to directions of maximum data variance in the original dataset: Covariance matrix: C = YY T Eigenvalue/Eigen vectors Cvi = λ i v i Projection matrix P = [v1,..., v K ] Principle components Hn = P Y n. O. Stegle Latent variable models for GWAs Tübingen 7

Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Linear dimension reduction PCA PCA is corresponds to a noise-free version of the model Y }{{} NxG = }{{} H }{{} W NxK KxG PCA components (H) correspond to directions of maximum data variance in the original dataset: Covariance matrix: C = YY T Eigenvalue/Eigen vectors Cvi = λ i v i Projection matrix P = [v1,..., v K ] Principle components Hn = P Y n. O. Stegle Latent variable models for GWAs Tübingen 7

Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Linear dimension reduction PCA PCA is corresponds to a noise-free version of the model Y }{{} NxG = }{{} H }{{} W NxK KxG PCA components (H) correspond to directions of maximum data variance in the original dataset: Covariance matrix: C = YY T Eigenvalue/Eigen vectors Cvi = λ i v i Projection matrix P = [v1,..., v K ] Principle components Hn = P Y n. O. Stegle Latent variable models for GWAs Tübingen 7

Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Linear dimension reduction PCA PCA is corresponds to a noise-free version of the model Y }{{} NxG = }{{} H }{{} W NxK KxG PCA components (H) correspond to directions of maximum data variance in the original dataset: Covariance matrix: C = YY T Eigenvalue/Eigen vectors Cvi = λ i v i Projection matrix P = [v1,..., v K ] Principle components Hn = P Y n. O. Stegle Latent variable models for GWAs Tübingen 7

Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Linear dimension reduction Bayesian PCA and GPLVM Assumption: data dimensions or sample dimension independent given H and W. Probabilistic PCA GPLVM N ( p(y H, W) = N y n hnw, σ 2 ) I n=1 N ( ) 2 p(h) = N h n 0, σ h I n=1 N ( 2 p(y W) = N y n 0, σ h WWT + σ 2 ) I n=1 [Tipping and Bishop, 1999] G ( p(y H, W) = N y :,g Hwg, σ 2 ) I g=1 G ( ) 2 p(w) = N w :,g 0, σ h I g=1 G p(y H) = N y:,g 0, σ 2 h g=1 } HHT +σ 2 I {{} NxN [Lawrence, 2005] O. Stegle Latent variable models for GWAs Tübingen 8

Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Linear dimension reduction Bayesian PCA and GPLVM Assumption: data dimensions or sample dimension independent given H and W. Probabilistic PCA GPLVM N ( p(y H, W) = N y n hnw, σ 2 ) I n=1 N ( ) 2 p(h) = N h n 0, σ h I n=1 N ( 2 p(y W) = N y n 0, σ h WWT + σ 2 ) I n=1 [Tipping and Bishop, 1999] G ( p(y H, W) = N y :,g Hwg, σ 2 ) I g=1 G ( ) 2 p(w) = N w :,g 0, σ h I g=1 G p(y H) = N y:,g 0, σ 2 h g=1 } HHT +σ 2 I {{} NxN [Lawrence, 2005] O. Stegle Latent variable models for GWAs Tübingen 8

Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) GPLVM Marginal likelihood p(y H) = G N ( y :,g 0, σh 2 HHT + σ 2 I ) g=1 Inference of most probable hidden factors H: Ĥ = argmax H log G N ( y :,g 0, σh 2 HHT + σ 2 I ) g=1 O. Stegle Latent variable models for GWAs Tübingen 9

Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) GPLVM Marginal likelihood p(y H) = G N ( y :,g 0, σh 2 HHT + σ 2 I ) g=1 Inference of most probable hidden factors H: Ĥ = argmax H log G N ( y :,g 0, σh 2 HHT + σ 2 I ) g=1 O. Stegle Latent variable models for GWAs Tübingen 9

Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Are linear relationships sufficient? (from Olivier Grisel, Generated using the Modular Data Processing toolkit and matplotlib.) Non-linear generalizations, introducing general kernel function instead of linear covariance Ĥ = argmax H log G N ( y :,g 0, σh 2 HHT + σ 2 I ) g=1 O. Stegle Latent variable models for GWAs Tübingen 10

Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Are linear relationships sufficient? (from Olivier Grisel, Generated using the Modular Data Processing toolkit and matplotlib.) Non-linear generalizations, introducing general kernel function instead of linear covariance Ĥ = argmax H log G N ( y :,g 0, σh 2 HHT + σ 2 I ) g=1 O. Stegle Latent variable models for GWAs Tübingen 10

Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Are linear relationships sufficient? (from Olivier Grisel, Generated using the Modular Data Processing toolkit and matplotlib.) Non-linear generalizations, introducing general kernel function instead of linear covariance Ĥ = argmax H log G N ( y :,g 0, σh 2 K H,H + σ 2 I ) g=1 O. Stegle Latent variable models for GWAs Tübingen 10

Modeling hidden confounders in GWAs Outline Motivation Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Modeling hidden confounders in GWAs Model Applications Modeling unobserved cellular phenotypes in genetic analyses Model Applications A unifying view Summary O. Stegle Latent variable models for GWAs Tübingen 11

Modeling hidden confounders in GWAs Confounders in eqtl studies Confounders in eqtl studies Experimental procedures Gene regulation (Translation) Standard to take known factors (gender, population structure) into account. It is key to account for hidden factors as well. Sample preparation Sample history Genome? Phenome individuals ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT yy1 yy2... yn y phenotypes SNPs Covariates Confounders Population y y O. Stegle Latent variable models for GWAs Tübingen 12

Modeling hidden confounders in GWAs [ Confounders in eqtl studies Confounders in eqtl studies Experimental procedures Gene regulation (Translation) Standard to take known factors (gender, population structure) into account. It is key to account for hidden factors as well. Sample preparation Sample history DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing transcription factors small RNAs O. Stegle Latent variable models for GWAs Tübingen 12

Modeling hidden confounders in GWAs [ Confounders in eqtl studies Confounders in eqtl studies Experimental procedures Gene regulation (Translation) Standard to take known factors (gender, population structure) into account. It is key to account for hidden factors as well. Sample preparation Sample history DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing transcription factors small RNAs O. Stegle Latent variable models for GWAs Tübingen 12

Modeling hidden confounders in GWAs [ Confounders in eqtl studies Confounders in eqtl studies Experimental procedures Gene regulation (Translation) Standard to take known factors (gender, population structure) into account. It is key to account for hidden factors as well. Sample preparation Sample history DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing transcription factors small RNAs O. Stegle Latent variable models for GWAs Tübingen 12

Modeling hidden confounders in GWAs Motivating examples HapMAP II, 3 populations, 90 individuals each, 40K genes, 3 million SNPs (here: small region in chromosome 7) 15 SLC35B4 (a) Standard eqtl LOD 10 5 FPR 0.01% FPR 0.01% (b) VBFA accounting for hidden factors VBFA eqtl LOD 0 15 10 5 SLC35B4 SLC35B4 FPR 0.01% FPR 0.01% 0 1.3354 1.3356 1.3358 1.336 1.3362 1.3364 1.3366 1.3368 1.337 1.3372 1.3374 Position in chr. 7 x 10 8 O. Stegle Latent variable models for GWAs Tübingen 13

Modeling hidden confounders in GWAs Model Association model Direct effects of confounders Start with standard association model. Include (K, few) global hidden factors (confounders) in the model. Factors H = {h :,1,..., h :,K } need to be learned from the expression data. How to control model complexity, i.e. choose the number of factors? Genome? Phenome individuals ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT yy1 yy2... yn y phenotypes SNPs genetic {}}{ N y :,g = (θ n,g s n ) + n=1 ψ }{{} noise Covariates Population y O. Stegle Latent variable models for GWAs Tübingen 14

Modeling hidden confounders in GWAs Model Association model Direct effects of confounders Start with standard association model. Include (K, few) global hidden factors (confounders) in the model. Factors H = {h :,1,..., h :,K } need to be learned from the expression data. How to control model complexity, i.e. choose the number of factors? Genome? Phenome individuals ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT yy1 yy2... yn y phenotypes SNPs genetic {}}{ N y :,g = (θ n,g s n ) + n=1 K w g,k h k k=1 }{{} hidden factors Covariates Confounders Population y + ψ }{{} noise y O. Stegle Latent variable models for GWAs Tübingen 14

Modeling hidden confounders in GWAs Model Association model Direct effects of confounders Start with standard association model. Include (K, few) global hidden factors (confounders) in the model. Factors H = {h :,1,..., h :,K } need to be learned from the expression data. How to control model complexity, i.e. choose the number of factors? Genome? Phenome individuals ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT yy1 yy2... yn y phenotypes SNPs genetic {}}{ N y :,g = (θ n,g s n ) + n=1 K w g,k h k k=1 }{{} hidden factors Covariates Confounders Population y + ψ }{{} noise y O. Stegle Latent variable models for GWAs Tübingen 14

Modeling hidden confounders in GWAs Model Association model Gaussian process formulation Data likelihood of the linear generative model assuming Gaussian noise ( ) G S K p(y X, H, θ, W, Θ K ) = N y :,g x s θ s + w g,k h k, σei 2 g=1 s=1 Specify Gaussian priors on the factor weight P (w g,k ) = N ( w g,k 0, σh 2 ) Integrating out the weights yields a Gaussian process marginal likelihood, factorizing over genes given H ( G S ) K P (Y X, H, Θ K ) = N y :,g x s θ s, σh 2 h k h T k + σei 2 g=1 s=1 k=1 k=1 O. Stegle Latent variable models for GWAs Tübingen 15

Modeling hidden confounders in GWAs Model Association model Gaussian process formulation Data likelihood of the linear generative model assuming Gaussian noise ( ) G S K p(y X, H, θ, W, Θ K ) = N y :,g x s θ s + w g,k h k, σei 2 g=1 s=1 Specify Gaussian priors on the factor weight P (w g,k ) = N ( w g,k 0, σh 2 ) Integrating out the weights yields a Gaussian process marginal likelihood, factorizing over genes given H ( G S ) K P (Y X, H, Θ K ) = N y :,g x s θ s, σh 2 h k h T k + σei 2 g=1 s=1 k=1 k=1 O. Stegle Latent variable models for GWAs Tübingen 15

Modeling hidden confounders in GWAs Model Association model Gaussian process formulation Data likelihood of the linear generative model assuming Gaussian noise ( ) G S K p(y X, H, θ, W, Θ K ) = N y :,g x s θ s + w g,k h k, σei 2 g=1 s=1 Specify Gaussian priors on the factor weight P (w g,k ) = N ( w g,k 0, σh 2 ) Integrating out the weights yields a Gaussian process marginal likelihood, factorizing over genes given H ( G S ) K P (Y X, H, Θ K ) = N y :,g x s θ s, σh 2 h k h T k + σei 2 g=1 s=1 k=1 k=1 O. Stegle Latent variable models for GWAs Tübingen 15

Modeling hidden confounders in GWAs Model Association model Including population structure P (Y X, H, Θ K ) = G N y :,g g=1 S K x s θ s, σh 2 h k h T k + σ2 ei k=1 }{{} NxN s=1 Including additional covariance term for population structure. Association test for single SNP. Challenge: Refitting σ g, σ h, H for every SNP. O. Stegle Latent variable models for GWAs Tübingen 16

Modeling hidden confounders in GWAs Model Association model Including population structure G P (Y X, H, Θ K ) = N g=1 y :,g S K x s θ s, σh 2 h k h T k + σ gxx T + σei 2 k=1 }{{} NxN s=1 Including additional covariance term for population structure. Association test for single SNP. Challenge: Refitting σ g, σ h, H for every SNP. O. Stegle Latent variable models for GWAs Tübingen 16

Modeling hidden confounders in GWAs Model Association model Including population structure P (Y X, H, Θ K ) = G N y K :,g x i θ i, σh 2 h k h T k + σ gxx T + σ 2 ei k=1 }{{} NxN g=1 Including additional covariance term for population structure. Association test for single SNP. Challenge: Refitting σ g, σ h, H for every SNP. O. Stegle Latent variable models for GWAs Tübingen 16

Modeling hidden confounders in GWAs Model Association model Including population structure P (Y X, H, Θ K ) = G N y K :,g x i θ i, σh 2 h k h T k + σ gxx T + σ 2 ei k=1 }{{} NxN g=1 Including additional covariance term for population structure. Association test for single SNP. Challenge: Refitting σ g, σ h, H for every SNP. O. Stegle Latent variable models for GWAs Tübingen 16

Modeling hidden confounders in GWAs Model Association model Fit parameter once on null model Ĥ, ˆσ g, σˆ h, ˆσ e = argmax P (Y X, H, Θ K ) Ĥ, ˆσ e, σˆ g, ˆσ e Association testing for fixed hyperparameters ( ) p(y :,g x i, H, Θ K ) = N y :,g xi θ i, α 2 ( ˆσ e ĤĤT + ˆσ g XX T ) + σei 2 O. Stegle Latent variable models for GWAs Tübingen 17

Modeling hidden confounders in GWAs Model Association model Fit parameter once on null model Ĥ, ˆσ g, σˆ h, ˆσ e = argmax P (Y X, H, Θ K ) Ĥ, ˆσ e, σˆ g, ˆσ e Association testing for fixed hyperparameters ( ) p(y :,g x i, H, Θ K ) = N y :,g xi θ i, α 2 ( ˆσ e ĤĤT + ˆσ g XX T ) + σei 2 O. Stegle Latent variable models for GWAs Tübingen 17

Modeling hidden confounders in GWAs Applications Evaluation on real data CIS gene: YJL213W Promoter S ATGACCTGAAACTGGGCGGATGACGTGGAACGGTATGACCTGAAACTGGGGGACTGACGTGGAACGGTATGACCTGAAACT TRANS gene: YJL213W S Promoter ATGA...ACTGGGCGGATGACGTGGAACGGTATGACCTGAAACTGGGGGACTGACGTGGAACGGTATGACCTGAAACT O. Stegle Latent variable models for GWAs Tübingen 18

Modeling hidden confounders in GWAs Applications Evaluation on real data CIS gene: YJL213W Promoter S ATGACCTGAAACTGGGCGGATGACGTGGAACGGTATGACCTGAAACTGGGGGACTGACGTGGAACGGTATGACCTGAAACT (a) Yeast TRANS S cis eqtls gene: YJL213W Promoter ATGA...ACTGGGCGGATGACGTGGAACGGTATGACCTGAAACTGGGGGACTGACGTGGAACGGTATGACCTGAAACT (b) Yeast O. Stegle Latent variable models for GWAs trans eqtls Tu bingen 18

Modeling hidden confounders in GWAs Applications Evaluation on real data Application to HapMap II expression analysis O. Stegle Latent variable models for GWAs Tübingen 19

Modeling hidden confounders in GWAs Applications Evaluation on real data Application to HapMap II expression analysis (a) Probes with a VBeQTL in pooled population O. Stegle (f) cis VBeQTL location and strength relative to gene start Latent variable models for GWAs Tu bingen 19

Modeling unobserved cellular phenotypes in genetic analyses Outline Motivation Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Modeling hidden confounders in GWAs Model Applications Modeling unobserved cellular phenotypes in genetic analyses Model Applications A unifying view Summary O. Stegle Latent variable models for GWAs Tübingen 20

Modeling unobserved cellular phenotypes in genetic analyses Model [ Association studies Regulatory and external factors Confounding factors Accounting for hidden confounders genetic {}}{ ( ) y g,j = b n,g θn,gsn,j + v gf j + w gx j + ψ n,g. }{{} }{{} }{{} known factors hidden factors noise Accounting for regulatory factors? DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing transcription factors small RNAs O. Stegle Latent variable models for GWAs Tübingen 21

Modeling unobserved cellular phenotypes in genetic analyses Model [ Association studies Regulatory and external factors Confounding factors Accounting for hidden confounders genetic {}}{ ( ) y g,j = b n,g θn,gsn,j + v gf j + w gx j + ψ n,g. }{{} }{{} }{{} known factors hidden factors noise Accounting for regulatory factors? DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing transcription factors small RNAs O. Stegle Latent variable models for GWAs Tübingen 21

Modeling unobserved cellular phenotypes in genetic analyses Model [ Association studies Regulatory factors Account for regulatory factors: Transcription factors Pathway components Hypothesis: intermediate cellular factors mediate the observed association signals of target genes. Measuring X? Difficult and expensive. Learn the unobserved factors X. DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing small RNAs [Parts et al., 2011] O. Stegle Latent variable models for GWAs Tübingen 22

Modeling unobserved cellular phenotypes in genetic analyses Model [ Association studies Regulatory factors Account for regulatory factors: Transcription factors Pathway components Hypothesis: intermediate cellular factors mediate the observed association signals of target genes. Measuring X? Difficult and expensive. Learn the unobserved factors X. DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing small RNAs [Parts et al., 2011] O. Stegle Latent variable models for GWAs Tübingen 22

Modeling unobserved cellular phenotypes in genetic analyses Model [ Association studies Regulatory factors Account for regulatory factors: Transcription factors Pathway components Hypothesis: intermediate cellular factors mediate the observed association signals of target genes. Measuring X? Difficult and expensive. Learn the unobserved factors X. DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing small RNAs [Parts et al., 2011] O. Stegle Latent variable models for GWAs Tübingen 22

Modeling unobserved cellular phenotypes in genetic analyses Model Association studies Transcription factor effects Inference of regulatory factors using a bilinear model Y }{{} Expr. = W }{{} Weights X }{{} Factors + Ψ }{{} Noise. DNA SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT W is sparse; each factor regulates only a subset of all genes. Incorporate biological prior knowledge to that factor are interpretable: Transcription factor binding affinities. Inferred factors summarize co-expression clusters. transcription mrna [ translation proteins [ observed & hidden external influenes experimental procedures environment gene regulation alternative splicing small RNAs O. Stegle Latent variable models for GWAs Tübingen 23

Modeling unobserved cellular phenotypes in genetic analyses Model Association studies Transcription factor effects Inference of regulatory factors using a bilinear model Y }{{} Expr. = W }{{} Weights X }{{} Factors + Ψ }{{} Noise. W is sparse; each factor regulates only a subset of all genes. Incorporate biological prior knowledge to that factor are interpretable: Transcription factor binding affinities. Inferred factors summarize co-expression clusters. TF: PHO4 Promoter gene: YJL213W ATGACCTGAAACTGGGCGGATGACGTGGAACGGTATGACCTGAAACTGGGGGACTGACGTGGAACGGTATGACCTGAAACT O. Stegle Latent variable models for GWAs Tübingen 23

Modeling unobserved cellular phenotypes in genetic analyses Model Association studies Transcription factor effects Inference of regulatory factors using a bilinear model Y }{{} Expr. = W }{{} Weights X }{{} Factors + Ψ }{{} Noise. Gene expression W is sparse; each factor regulates only a subset of all genes. Incorporate biological prior knowledge to that factor are interpretable: Transcription factor binding affinities. Inferred factors summarize co-expression clusters. PHO4 Known targets from Yeastract Segregants (218) PHO4 factor activation Segregants (218) O. Stegle Latent variable models for GWAs Tübingen 23 Genes (5493)

Modeling unobserved cellular phenotypes in genetic analyses Model Sparse factor analysis Probabilistic model Graphical model for Y = W X + Ψ. Indicators z g,k determine the sparsity pattern: P (w g,k z g,k = 0) = N ( w g,k 0, σ 2 0 ) P (w g,k z g,k = 1) = N ( w g,k 0, σ 2 1 ). Prior knowledge is encoded in π g,k = P (z g,k = 1). Standard conjugate priors for the remaining random variables. O. Stegle Latent variable models for GWAs Tübingen 24

Modeling unobserved cellular phenotypes in genetic analyses Model Sparse factor analysis Probabilistic model Graphical model for Y = W X + Ψ. Indicators z g,k determine the sparsity pattern: P (w g,k z g,k = 0) = N ( w g,k 0, σ 2 0 ) P (w g,k z g,k = 1) = N ( w g,k 0, σ 2 1 ). Prior knowledge is encoded in π g,k = P (z g,k = 1). Standard conjugate priors for the remaining random variables. O. Stegle Latent variable models for GWAs Tübingen 24

Modeling unobserved cellular phenotypes in genetic analyses Model Sparse factor analysis Probabilistic model Graphical model for Y = W X + Ψ. Indicators z g,k determine the sparsity pattern: P (w g,k z g,k = 0) = N ( w g,k 0, σ 2 0 ) P (w g,k z g,k = 1) = N ( w g,k 0, σ 2 1 ). Prior knowledge is encoded in π g,k = P (z g,k = 1). Standard conjugate priors for the remaining random variables. O. Stegle Latent variable models for GWAs Tübingen 24

Modeling unobserved cellular phenotypes in genetic analyses Model Sparse factor analysis Inference Scale is challenging: thousands of genes, hundreds of TFs. Efficient deterministic approximate inference: 1. Variational Bayesian inference for the core model. 2. Expectation Propagation for the sparsity submodel. O. Stegle Latent variable models for GWAs Tübingen 25

Modeling unobserved cellular phenotypes in genetic analyses Model Sparse factor analysis Inference Scale is challenging: thousands of genes, hundreds of TFs. Efficient deterministic approximate inference: 1. Variational Bayesian inference for the core model. 2. Expectation Propagation for the sparsity submodel. O. Stegle Latent variable models for GWAs Tübingen 25

Modeling unobserved cellular phenotypes in genetic analyses Model Sparse factor analysis Comparison to MCMC on (small!) simulated dataset Recovery of the true simulated network structure. O. Stegle Latent variable models for GWAs Tübingen 26

Modeling unobserved cellular phenotypes in genetic analyses Model Sparse factor analysis Comparison to MCMC on (small!) simulated dataset Recovery of the true simulated factor activations. O. Stegle Latent variable models for GWAs Tübingen 26

Modeling unobserved cellular phenotypes in genetic analyses Applications Application to yeast Dataset Applied the approach to 108 yeast strains (crosses), genotyped and expression profiled in 2 conditions (ethanol and glucose). Alternative types of prior information available TF binding affinities (Yeastract). Pathway information (KEGG). [Smith and Kruglyak, 2008] O. Stegle Latent variable models for GWAs Tübingen 27

Modeling unobserved cellular phenotypes in genetic analyses Applications [ Application to yeast Factor associations Biological hypotheses Genetic variation (SNPs) may regulate factor activations. Similarly for the environment condition (glucose/ethanol). Interaction effects between SNPs, factors and genes. DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing small RNAs O. Stegle Latent variable models for GWAs Tübingen 28

Modeling unobserved cellular phenotypes in genetic analyses Applications [ Application to yeast Factor associations Biological hypotheses Genetic variation (SNPs) may regulate factor activations. Similarly for the environment condition (glucose/ethanol). Interaction effects between SNPs, factors and genes. DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing small RNAs O. Stegle Latent variable models for GWAs Tübingen 28

Modeling unobserved cellular phenotypes in genetic analyses Applications [ Application to yeast Factor associations Biological hypotheses Genetic variation (SNPs) may regulate factor activations. Similarly for the environment condition (glucose/ethanol). Interaction effects between SNPs, factors and genes. DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing small RNAs O. Stegle Latent variable models for GWAs Tübingen 28

Modeling unobserved cellular phenotypes in genetic analyses Applications [ Application to yeast Factor associations Biological hypotheses Genetic variation (SNPs) may regulate factor activations. Similarly for the environment condition (glucose/ethanol). Interaction effects between SNPs, factors and genes. DNA transcription mrna [ translation proteins SNPs ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT observed & hidden external influenes experimental procedures environment gene regulation alternative splicing small RNAs O. Stegle Latent variable models for GWAs Tübingen 28

Modeling unobserved cellular phenotypes in genetic analyses Applications Application to yeast Factor associations Genome-wide association density. # of associations 10 1 Yeastract KEGG Freeform Trans genes # of associations 100 101 I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI O. Stegle Latent Chromosomal variable models for position GWAs Tübingen 29

Modeling unobserved Growth cellular condition phenotypes in genetic analyses Application to yeast Factor interactions Segregating loci (2956) Genotype... Segregants (218)... 20 Applications Interaction tests between all gene/snp/factor triplets. Segregant count 15 10 5 0 1.0 0.5 0.0 0.5 PHO4 activation (c) PHO4 factor activation Genotype-factor interaction Segregating loci (2956) Genes (5493) Genotype... Gene expression... Segregants (218)...... YJL213W expression 4 3 2 1 0 1 2 3 4 Interaction LOD = 35.0 (PHO84 locus) BY Glu RM Glu BY Eth RM Eth 1.0 0.5 0.0 0.5 PHO4 activation O. Stegle Latent variable models for GWAs Tübingen 30

Modeling unobserved cellular phenotypes in genetic analyses Applications Application to yeast Factor interactions Example interactions. (a) YAP1-IRA2 interaction (b) PHO4-PHO84 interaction (c) MAT-YOX1 interaction O. Stegle Latent variable models for GWAs Tübingen 30

Modeling unobserved cellular phenotypes in genetic analyses Applications Application to yeast Factor interactions Genome-wide interaction density. # of interacting genes 1500 OAF1 1000 500 100 AMN1 Environment Yeastract KEGG Freeform PHO84 HAP1 I II III IV V VI VII VIII IX X XI XII Chromosomal position XIII XIV XV XVI IRA2 MKT1 CIN5 O. Stegle Latent variable models for GWAs Tübingen 30

A unifying view Outline Motivation Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Modeling hidden confounders in GWAs Model Applications Modeling unobserved cellular phenotypes in genetic analyses Model Applications A unifying view Summary O. Stegle Latent variable models for GWAs Tübingen 31

A unifying view Matrix variate normal distributions S Confounders induces structure between samples. Gene regulation induces structure between genes. Matrix variate models for joint correction. samples 1 genes confounders hidden known 4 3 2 O. Stegle Latent variable models for GWAs Tübingen 32

A unifying view Matrix variate normal distributions Confounders induces structure between samples. Gene regulation induces structure between genes. Matrix variate models for joint correction. samples S genes O. Stegle Latent variable models for GWAs Tübingen 32

A unifying view Matrix variate normal distributions Confounders induces structure between samples. Gene regulation induces structure between genes. Matrix variate models for joint correction. O. Stegle Latent variable models for GWAs Tübingen 32

A unifying view Matrix variate normal distributions S samples genes Row covariances (samples) K Column covariances (genes) Λ 1 O. Stegle Latent variable models for GWAs Tübingen 33

A unifying view Matrix variate normal distributions S samples genes Row covariances (samples) K 0 sampless 20 40 60 Column covariances (genes) Λ 1 80 100 0 20 40 60 80 100 sampless O. Stegle Latent variable models for GWAs Tübingen 33

A unifying view Matrix variate normal distributions S samples genes Row covariances (samples) K sampless 0 20 40 60 Column covariances (genes) Λ 1 genes 0 10 20 30 80 40 100 50 0 20 40 60 80 100 sampless 0 10 20 30 40 50 O. Stegle Latent variable models for GWAs genes Tübingen 33

A unifying view Matrix variate normal distributions S samples genes Row covariances (samples) K sampless 0 20 40 60 Column covariances (genes) Λ 1 genes 0 10 20 30 80 40 100 50 0 20 40 60 80 100 sampless 0 10 20 30 40 50 O. Stegle Latent variable models for GWAs genes Tübingen 33

A unifying view Matrix variate distributions Matrix variate distribution with column covariance Λ 1 and row covariance K p(y µ = 0, Λ 1, K) exp ( 0.5tr [ ΛY T K 1 Y ]) Equivalent to Kronecker covariance structure on vectorized data p(vecy µ = 0, Λ 1, K) = N ( vecy 0, Λ 1 K ) O. Stegle Latent variable models for GWAs Tübingen 34

A unifying view Matrix variate distributions Matrix variate distribution with column covariance Λ 1 and row covariance K p(y µ = 0, Λ 1, K) exp ( 0.5tr [ ΛY T K 1 Y ]) Equivalent to Kronecker covariance structure on vectorized data p(vecy µ = 0, Λ 1, K) = N ( vecy 0, Λ 1 K ) Drawing samples from a MN Kronecker matrix product 1. sample R Y r=1 g=1 G N (0, 1) 2. Y = chol(k) Y 3. Y = Y chol(λ 1 ) T a 11 B a 12 B... a 1H B a 21 B a 22 B... a 2H B A B =.......... a G1 B a G2 B... a GH B O. Stegle Latent variable models for GWAs Tübingen 34

A unifying view 38 37 39 36 40 35 41 34 42 33 43 32 44 31 45 30 46 29 47 28 48 27 49 26 0 25 1 24 2 23 3 22 4 21 5 20 6 19 7 18 8 17 9 16 10 15 11 14 12 13 Simulated example better recovery of the true graph Recovery of simulated network. Weak confounding variation (20% variance in row covariance). Precision 1.0 0.8 0.6 0.4 GLasso Kronecker GLasso Ideal GLasso 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall ground truth GLasso O. Stegle Latent variable models for GWAs Tübingen 35

A unifying view 38 37 39 36 40 35 41 34 42 33 43 32 44 31 45 30 46 29 47 28 48 27 49 26 0 25 1 24 2 23 3 22 4 21 5 20 6 19 7 18 8 17 9 16 10 15 11 14 12 13 38 37 38 37 39 36 39 36 40 35 40 35 41 34 41 34 42 33 42 33 43 32 43 32 44 31 44 31 45 30 45 30 46 29 46 29 47 28 47 28 48 27 48 27 49 26 49 26 0 25 0 25 1 24 1 24 2 23 2 23 3 22 3 22 4 21 4 21 5 20 5 20 6 19 6 19 7 18 7 18 8 17 8 17 9 16 9 16 10 15 10 15 11 14 11 14 12 13 12 13 Simulated example better recovery of the true graph Recovery of simulated network. Weak confounding variation (20% variance in row covariance). Precision 1.0 0.8 0.6 0.4 GLasso Kronecker GLasso Ideal GLasso ground truth GLasso 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Kornecker GLasso Ideal GLasso O. Stegle Latent variable models for GWAs Tübingen 35

Summary Outline Motivation Dimension reduction and the Gaussian Process Latent Variable Model (GPLVM) Modeling hidden confounders in GWAs Model Applications Modeling unobserved cellular phenotypes in genetic analyses Model Applications A unifying view Summary O. Stegle Latent variable models for GWAs Tübingen 36

Summary Summary Latent variables can have a dramatic effect in GWAs. In expression studies, confounders can be estimated from data. Cellular features can be learnt from the expression profiles. Duality of regulatory relationships and confounders in matrix variate normal models. O. Stegle Latent variable models for GWAs Tübingen 37

Summary References I N. Lawrence. Probabilistic non-linear principal component analysis with gaussian process latent variable models. The Journal of Machine Learning Research, 6:1783 1816, 2005. L. Parts, O. Stegle, J. Winn, and R. Durbin. Joint genetic analysis of gene expression data with inferred cellular phenotypes. PLoS genetics, 7(1):e1001276, 2011. E. Smith and L. Kruglyak. Gene environment interaction in yeast gene expression. PLoS biology, 6(4):e83, 2008. M. Tipping and C. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society. Series B, Statistical Methodology, pages 611 622, 1999. O. Stegle Latent variable models for GWAs Tübingen 38