Latent Variable Methods for the Analysis of Genomic Data

Size: px

Start display at page:

Download "Latent Variable Methods for the Analysis of Genomic Data"

Emory May
6 years ago
Views:

1 John D. Storey Center for Statistics and Machine Learning & Lewis-Sigler Institute for Integrative Genomics Latent Variable Methods for the Analysis of Genomic Data

2 Data m variables and n observations high5dimensional"variables" observa9ons" primary"vars" observa9ons" gene expression SNP genotypes intercept + treatment intercept + clinical outcome intercept + trait

3 High-dimensional regression model = + Y = B X + E Goal is to do inference on each row of B, which connects variation in X to each row of Y

4 Example steps for simultaneous inference 1 Formulate model Y = BX + E, which has null version Y = B 0 X + E 2 Fit models to obtain B and B 0 3 Compare goodness of fits B and B 0 for each genomic variable to obtain m test statistics 4 Calculate a null distribution to obtain m p-values 5 Perform some type of multiple testing correction (such as FDR or local FDR)

5 Incorporating latent variables 1 Formulate model Y = BX + E, which has null version Y = B 0 X + E 2 Fit models to obtain B and B 0 3 Compare goodness of fits B and B 0 for each genomic variable to obtain m test statistics 4 Calculate a null distribution to obtain m p-values 5 Perform some type of multiple testing correction (such as FDR or local FDR) We have been developing models, theory, and methods that replaces Y = BX + E with Y = BX + ΦM + E where M are low-dimensional latent variables.

6 Known variable + latent variable model observed" unobserved" unobserved,"inference"object" =" +" g(e[y X,M]) = B X + Φ M M accounts for dependence, unknown sources of systematic variation, structure, etc.

7 Gene expression studies observed" unobserved" unobserved,"inference"object" =" +" g(e[y X,M]) = B X + Φ M Y m n gene expression data, X d n study design matrix M batch effects, biological heterogeneity, etc. Y X, M Normal, Poisson, or NegBin

8 Genome-wide association studies observed" unobserved" unobserved,"inference"object" =" +" g(e[x Y,M]) = B Y + Φ M X m n SNP genotypes, Y 1 n trait variable M population structure X Y,M Binomial

9 Surrogate variable analysis (SVA) Introduced in Leek and Storey (2007) PLoS Genetics observed" unobserved" unobserved,"inference"object" =" +" E[Y X,M] = B X + Φ M Y m n = (y ij ) response variables where y ij R X d n study design matrix Y X, M Normal (not necessary, but helpful)

10 Generalized surrogate variable analysis (gsva) observed" unobserved" unobserved,"inference"object" =" +" g(e[y X,M]) = B X + Φ M Y m n response variables X d n study design matrix Y X, M exponential family distribution g( ) link function

11 Part 1. Surrogate variable analysis (SVA) and cross-dimensional inference (CDI) Leek JT and Storey JD. (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics, 3: e161. Leek JT and Storey JD. (2008) A general framework for multiple testing dependence. Proceedings of the National Academy of Sciences, 105: Desai KH and Storey JD. (2012) Cross-dimensional inference of dependent high-dimensional data. Journal of the American Statistical Association, 107(497): Storey JD, Desai KH, and Chen X. (2016) Empirical Bayes inference of dependent high-dimensional data. Submitted.

12 Data high5dimensional"variables" observa9ons" primary"vars" observa9ons" X m n = (x ij ) x ij R gene expression Y d n study design

13 Motivating study: Inflammation and the Host Response to Injury Patient 1 Patient 2. Patient 168 MOF multiple organ failure score Leukocyte) mrna) Expression)) Clinical Data ~400 clinical variables

14 Poor replication Phase&1& Frequency) Frequency) Phase&2& P6value) P6value) Phase&3& Frequency) Frequency) Phase&4& P6value) P6value)

15 Accounting for latent variables Phase 1 Phase 2 Phase 3 Phase 4 Frequency) Frequency) Frequency) Frequency) Frequency) P6value) P6value) P6value) P6value) Surrogate)Variable)Analysis) Frequency) Frequency) Frequency) P6value) P6value) P6value) P6value)

16 SVA model observed" unobserved" unobserved,"inference"object" =" +" E[Y X,M] = B X + Φ M Y m n gene expression data X d n study design matrix Y X,M Normal

17 Normal model =" +" Y = BX + E e j iid N(0,Σ)

18 SVA and Cross Dimensional Inference (CDI) Leek and Storey (2007) introduce the SVA model, motivated as a way to capture unmodeled systemic variation Leek and Storey (2008) show that the SVA model also captures general forms of dependence among variables Desai and Storey (2012) study the SVA model under the Normal distribution and derive a penalized maximum likelihood approach for estimating it called cross-dimensional inference (CDI) Storey, Desai, and Chen (2016) show that CDI applied to the SVA model yields the Bayes estimator under the multivariate Normal scenario with Normal prior on B Storey, Desai, and Chen (2016) provide an empirical Bayes CDI approach for estimating the SVA model

19 Part 2. Genotype-conditional association test Song M, Hao W, and Storey JD. (2015) Testing for genetic associations in arbitrarily structured populations. Nature Genetics, 47: Hao W, Song M, and Storey JD. (2016) Probabilistic models of genetic variation in structured populations applied to global human studies. Bioinformatics, 32: Gopalan P, Hao W, Blei DM, and Storey JD. (2016) Scaling probabilistic models of genetic variation to millions of humans. Nature Genetics, in press.

20 Data high5dimensional"variables" observa9ons" primary"vars" observa9ons" X m n = (x ij ) x ij {0,1,2} SNP genotypes y = (y 1,y 2,...,y n ) trait of interest

21 Trait models Quantitative trait: Binary trait: y j = α + m i=1 β ix ij + λ j + ɛ j log ( Pr(yj =1) Pr(y j =0) ) = α + m i=1 β ix ij + λ j β i is the genetic effect of SNP i on the trait λ j is the random non-genetic effect ɛ j N(0,σ 2 ) is the random noise variation j

22 a Confounding in GWAS π 1 (z) SNP x 1 π 2 (z) SNP x 2 Structure, Lifestyle, Environment z. non-genetic λ(z) Trait y SNP x i π i (z). π m (z) SNP x m

23 Confounding in GWAS Quantitative trait: Binary trait: y j = α + m i=1 β ix ij + λ j + ɛ j log ( Pr(yj =1) Pr(y j =0) ) = α + m i=1 β ix ij + λ j x j = (x 1j,...,x m,j ) T, λ j, and σ 2 j [from ɛ j N(0,σ 2 j )] may all be functions of z j

24 Genotype conditional association test (GCAT) Perform a hypothesis test of b i = 0 vs. b i 0 in the following model: ( ) E[xij y j,z j ] logit = a i + b i y j + logit(π ij ) 2 1 formulate and estimate a linear basis for logit(π ij ), called logistic factors 2 fit a logistic regression of SNP i trait + logistic factors 3 perform a generalized likelihood ratio test of the trait s coefficient

25 Theorem Quantitative trait: Binary trait: GC model: y j = α + m i=1 β ix ij + λ j + ɛ j log ( Pr(yj =1) Pr(y j =0) 2 ) = α + m i=1 β ix ij + λ j ( ) E[xij y j,z j ] logit = a i + b i y j + logit(π ij ) Suppose that trait values y j are distributed according to either of the above trait models and SNPs x ij π ij Binomial(2,π ij ) as described earlier. Then β i = 0 in either trait model implies that b i = 0 in the GC model.

26 Latent variable model observed" unobserved" unobserved,"inference"object" =" +" logit(e[x Y, M]/2) = B Y + Φ M X Y,M Binomial

27 Oracle - - GCAT LMM EMMAX LMM GEMMA PCA Balding Nichols Observed False Positives Expected False Positives 0 HGDP PSD α = Spatial a = Expected False Positives TGP

28 Part 3. Jackstraw method Chung NC and Storey JD. (2015) Statistical significance of variables driving systematic variation. Bioinformatics, 31:

29 Focus on latent variable part of the model observed" unobserved" =" +" Y = Φ M + E

30 How to do inference on Φ? observed" unobserved" unobserved,"inference"object" =" +" Y = Φ M + E

31 Estimating M Let Y be row-wise mean-centered and Y = ΦM + E. SVD of Y = UDV T, where V T r is first r rows of V T. Leek (2010) Biometrics shows that under reasonable assumptions: [V ] T r V r σ 2 avgi MM T rowspace 0 as number of variables (rows of Y ) goes to for fixed sample size (columns of Y ). In other words, V T r is a reasonable estimate of M.

32 Models Utilized Y = ΦM + E = ΓV T r + E We perform inference on the rows of Γ: γ 1,...,γ m. Suppose we want to test γ i = 0 vs. γ i 0 for each i = 1,...,m.

33 Preserve structure and create synthetic nulls n observations m variables Y Permute s rows Y* V T r V *T r F-statistics F-statistics

34 Preserve structure and create synthetic nulls n observations m variables Y Permute s rows Y* Y* Y* Y* Y* Y* Y* V T r F-statistics V *T LF* r LF* LF* LF* LF* LF* F-statistics Compare to get p-values

Yeast cell cycle analysis (a) (a) (c) Minutes Principal Magnitude Component Values 0.4 0.0 0.2 0.

35 Yeast cell cycle analysis (a) (a) (c) Minutes Principal Magnitude Component Values st PC 2nd PC Minutes Post Elucidation (b) Percent Variance Explained Principal Component

36 Part 4. Consistently estimating M Chen X and Storey JD. (2015) Consistent estimation of low-dimensional latent structure in high-dimensional data. arxiv:

37 Consistently estimating M =" E[Y M] = Φ M Y m n = (y ij ) distributed according to a single parameter exponential family distribution We show how to estimate with probability 1 the row-space of M and consistently estimate its dimension

38 Acknowledgements Funding: National Institutes of Health Office of Naval Research Co-authors: Xiongzhi Chee Chen Neo Chung Keyur Desai Wei Hao Jeff Leek Minsun Song Lab members: Andrew Bass Irineo Cabreros Chee Chen Sean Hackett Wei Hao Emily Nelson Alejandro Ochoa

Implicit Causal Models for Genome-wide Association Studies

Implicit Causal Models for Genome-wide Association Studies Dustin Tran Columbia University David M. Blei Columbia University Abstract Progress in probabilistic generative models has accelerated, developing