Latent Variable Methods for the Analysis of Genomic Data

Similar documents
Implicit Causal Models for Genome-wide Association Studies

Association studies and regression

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Sta$s$cs for Genomics ( )

Confounder Adjustment in Multiple Hypothesis Testing

PCA and admixture models

Asymptotic distribution of the largest eigenvalue with application to genetic data

An Adaptive Association Test for Microbiome Data

High-Throughput Sequencing Course

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

Methods for Cryptic Structure. Methods for Cryptic Structure

Marginal Screening and Post-Selection Inference

Latent Variable models for GWAs

GWAS IV: Bayesian linear (variance component) models

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS. BY DANIELA M. WITTEN 1 AND ROBERT TIBSHIRANI 2 Stanford University

Variable selection and machine learning methods in causal inference

CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING

In many areas of science, there has been a rapid increase in the

Non-specific filtering and control of false positives

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Chapter 1. Modeling Basics

scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017

Introduction to Bioinformatics

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Inference with Transposable Data: Modeling the Effects of Row and Column Correlations

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Probabilistic Models for Learning Data Representations. Andreas Damianou

Correction for Hidden Confounders in the Genetic Analysis of Gene Expression

25 : Graphical induced structured input/output models

Classification. Chapter Introduction. 6.2 The Bayes classifier

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL

3 Comparison with Other Dummy Variable Methods

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Sample Size Estimation for Studies of High-Dimensional Data

Dimensionality Reduction for Exponential Family Data

Machine Learning Linear Classification. Prof. Matteo Matteucci

GWAS with mixed models

Multi-level Models: Idea

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Generalized Linear Models (1/29/13)

A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm

Introduction to Logistic Regression

Quantile based Permutation Thresholds for QTL Hotspots. Brian S Yandell and Elias Chaibub Neto 17 March 2012

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Causal Model Selection Hypothesis Tests in Systems Genetics

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

25 : Graphical induced structured input/output models

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Niche Modeling. STAMPS - MBL Course Woods Hole, MA - August 9, 2016

Bayesian Regression (1/31/13)

Vector Space Models. wine_spectral.r

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen

The gpca Package for Identifying Batch Effects in High-Throughput Genomic Data

Measures of Association and Variance Estimation

MACAU 2.0 User Manual

Supplementary Information

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Statistics 572 Semester Review

Applied Statistics Qualifier Examination (Part II of the STAT AREA EXAM) January 25, 2017; 11:00AM-1:00PM

Lecture 1 Introduction to Multi-level Models

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

Econometrics Lecture 5: Limited Dependent Variable Models: Logit and Probit

Probabilistic Graphical Models

Introduction to QTL mapping in model organisms

Package KMgene. November 22, 2017

PQL Estimation Biases in Generalized Linear Mixed Models

Figure 36: Respiratory infection versus time for the first 49 children.

Fractal functional regression for classification of gene expression data by wavelets

Research Statement on Statistics Jun Zhang

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Estimating subgroup specific treatment effects via concave fusion

LARGE NUMBERS OF EXPLANATORY VARIABLES. H.S. Battey. WHAO-PSI, St Louis, 9 September 2018

Logistic Regression. Machine Learning Fall 2018

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

STA 216, GLM, Lecture 16. October 29, 2007

High-throughput Testing

Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies

Computational Systems Biology: Biology X

SUPPLEMENTARY INFORMATION

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Bayes methods for categorical data. April 25, 2017

Stat 642, Lecture notes for 04/12/05 96

Unlocking RNA-seq tools for zero inflation and single cell applications using observation weights

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

Introduction to Signal Detection and Classification. Phani Chavali

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Transcription:

John D. Storey Center for Statistics and Machine Learning & Lewis-Sigler Institute for Integrative Genomics Latent Variable Methods for the Analysis of Genomic Data http://genomine.org/talks/

Data m variables and n observations high5dimensional"variables" observa9ons" primary"vars" observa9ons" gene expression SNP genotypes intercept + treatment intercept + clinical outcome intercept + trait

High-dimensional regression model = + Y = B X + E Goal is to do inference on each row of B, which connects variation in X to each row of Y

Example steps for simultaneous inference 1 Formulate model Y = BX + E, which has null version Y = B 0 X + E 2 Fit models to obtain B and B 0 3 Compare goodness of fits B and B 0 for each genomic variable to obtain m test statistics 4 Calculate a null distribution to obtain m p-values 5 Perform some type of multiple testing correction (such as FDR or local FDR)

Incorporating latent variables 1 Formulate model Y = BX + E, which has null version Y = B 0 X + E 2 Fit models to obtain B and B 0 3 Compare goodness of fits B and B 0 for each genomic variable to obtain m test statistics 4 Calculate a null distribution to obtain m p-values 5 Perform some type of multiple testing correction (such as FDR or local FDR) We have been developing models, theory, and methods that replaces Y = BX + E with Y = BX + ΦM + E where M are low-dimensional latent variables.

Known variable + latent variable model observed" unobserved" unobserved,"inference"object" =" +" g(e[y X,M]) = B X + Φ M M accounts for dependence, unknown sources of systematic variation, structure, etc.

Gene expression studies observed" unobserved" unobserved,"inference"object" =" +" g(e[y X,M]) = B X + Φ M Y m n gene expression data, X d n study design matrix M batch effects, biological heterogeneity, etc. Y X, M Normal, Poisson, or NegBin

Genome-wide association studies observed" unobserved" unobserved,"inference"object" =" +" g(e[x Y,M]) = B Y + Φ M X m n SNP genotypes, Y 1 n trait variable M population structure X Y,M Binomial

Surrogate variable analysis (SVA) Introduced in Leek and Storey (2007) PLoS Genetics observed" unobserved" unobserved,"inference"object" =" +" E[Y X,M] = B X + Φ M Y m n = (y ij ) response variables where y ij R X d n study design matrix Y X, M Normal (not necessary, but helpful)

Generalized surrogate variable analysis (gsva) observed" unobserved" unobserved,"inference"object" =" +" g(e[y X,M]) = B X + Φ M Y m n response variables X d n study design matrix Y X, M exponential family distribution g( ) link function

Part 1. Surrogate variable analysis (SVA) and cross-dimensional inference (CDI) Leek JT and Storey JD. (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics, 3: e161. Leek JT and Storey JD. (2008) A general framework for multiple testing dependence. Proceedings of the National Academy of Sciences, 105: 18718 18723. Desai KH and Storey JD. (2012) Cross-dimensional inference of dependent high-dimensional data. Journal of the American Statistical Association, 107(497): 135 151. Storey JD, Desai KH, and Chen X. (2016) Empirical Bayes inference of dependent high-dimensional data. Submitted.

Data high5dimensional"variables" observa9ons" primary"vars" observa9ons" X m n = (x ij ) x ij R gene expression Y d n study design

Motivating study: Inflammation and the Host Response to Injury Patient 1 Patient 2. Patient 168 MOF multiple organ failure score Leukocyte) mrna) Expression)) Clinical Data ~400 clinical variables

Poor replication Phase&1& Frequency) Frequency) Phase&2& P6value) P6value) Phase&3& Frequency) Frequency) Phase&4& P6value) P6value)

Accounting for latent variables Phase 1 Phase 2 Phase 3 Phase 4 Frequency) Frequency) Frequency) Frequency) Frequency) P6value) P6value) P6value) P6value) Surrogate)Variable)Analysis) Frequency) Frequency) Frequency) P6value) P6value) P6value) P6value)

SVA model observed" unobserved" unobserved,"inference"object" =" +" E[Y X,M] = B X + Φ M Y m n gene expression data X d n study design matrix Y X,M Normal

Normal model =" +" Y = BX + E e j iid N(0,Σ)

SVA and Cross Dimensional Inference (CDI) Leek and Storey (2007) introduce the SVA model, motivated as a way to capture unmodeled systemic variation Leek and Storey (2008) show that the SVA model also captures general forms of dependence among variables Desai and Storey (2012) study the SVA model under the Normal distribution and derive a penalized maximum likelihood approach for estimating it called cross-dimensional inference (CDI) Storey, Desai, and Chen (2016) show that CDI applied to the SVA model yields the Bayes estimator under the multivariate Normal scenario with Normal prior on B Storey, Desai, and Chen (2016) provide an empirical Bayes CDI approach for estimating the SVA model

Part 2. Genotype-conditional association test Song M, Hao W, and Storey JD. (2015) Testing for genetic associations in arbitrarily structured populations. Nature Genetics, 47: 550 554. Hao W, Song M, and Storey JD. (2016) Probabilistic models of genetic variation in structured populations applied to global human studies. Bioinformatics, 32: 713 721. Gopalan P, Hao W, Blei DM, and Storey JD. (2016) Scaling probabilistic models of genetic variation to millions of humans. Nature Genetics, in press.

Data high5dimensional"variables" observa9ons" primary"vars" observa9ons" X m n = (x ij ) x ij {0,1,2} SNP genotypes y = (y 1,y 2,...,y n ) trait of interest

Trait models Quantitative trait: Binary trait: y j = α + m i=1 β ix ij + λ j + ɛ j log ( Pr(yj =1) Pr(y j =0) ) = α + m i=1 β ix ij + λ j β i is the genetic effect of SNP i on the trait λ j is the random non-genetic effect ɛ j N(0,σ 2 ) is the random noise variation j

a Confounding in GWAS π 1 (z) SNP x 1 π 2 (z) SNP x 2 Structure, Lifestyle, Environment z. non-genetic λ(z) Trait y SNP x i π i (z). π m (z) SNP x m

Confounding in GWAS Quantitative trait: Binary trait: y j = α + m i=1 β ix ij + λ j + ɛ j log ( Pr(yj =1) Pr(y j =0) ) = α + m i=1 β ix ij + λ j x j = (x 1j,...,x m,j ) T, λ j, and σ 2 j [from ɛ j N(0,σ 2 j )] may all be functions of z j

Genotype conditional association test (GCAT) Perform a hypothesis test of b i = 0 vs. b i 0 in the following model: ( ) E[xij y j,z j ] logit = a i + b i y j + logit(π ij ) 2 1 formulate and estimate a linear basis for logit(π ij ), called logistic factors 2 fit a logistic regression of SNP i trait + logistic factors 3 perform a generalized likelihood ratio test of the trait s coefficient

Theorem Quantitative trait: Binary trait: GC model: y j = α + m i=1 β ix ij + λ j + ɛ j log ( Pr(yj =1) Pr(y j =0) 2 ) = α + m i=1 β ix ij + λ j ( ) E[xij y j,z j ] logit = a i + b i y j + logit(π ij ) Suppose that trait values y j are distributed according to either of the above trait models and SNPs x ij π ij Binomial(2,π ij ) as described earlier. Then β i = 0 in either trait model implies that b i = 0 in the GC model.

Latent variable model observed" unobserved" unobserved,"inference"object" =" +" logit(e[x Y, M]/2) = B Y + Φ M X Y,M Binomial

Oracle - - GCAT LMM EMMAX LMM GEMMA PCA Balding Nichols 50-50 0 0 60 40 200 Observed False Positives Expected False Positives 0 HGDP 20 100 20 0 40 100 50 PSD α = 0.1 50 25 0 0 25 400 Spatial a = 0.1 50 300 200 0 100 0 500 50 400 25 300 0 200 25 100 50 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 Expected False Positives 50 100 150 200 250 0 TGP 75 0 50 100 150 200 250

Part 3. Jackstraw method Chung NC and Storey JD. (2015) Statistical significance of variables driving systematic variation. Bioinformatics, 31: 545 554.

Focus on latent variable part of the model observed" unobserved" =" +" Y = Φ M + E

How to do inference on Φ? observed" unobserved" unobserved,"inference"object" =" +" Y = Φ M + E

Estimating M Let Y be row-wise mean-centered and Y = ΦM + E. SVD of Y = UDV T, where V T r is first r rows of V T. Leek (2010) Biometrics shows that under reasonable assumptions: [V ] T r V r σ 2 avgi MM T rowspace 0 as number of variables (rows of Y ) goes to for fixed sample size (columns of Y ). In other words, V T r is a reasonable estimate of M.

Models Utilized Y = ΦM + E = ΓV T r + E We perform inference on the rows of Γ: γ 1,...,γ m. Suppose we want to test γ i = 0 vs. γ i 0 for each i = 1,...,m.

Preserve structure and create synthetic nulls n observations m variables Y Permute s rows Y* V T r V *T r F-statistics F-statistics

Preserve structure and create synthetic nulls n observations m variables Y Permute s rows Y* Y* Y* Y* Y* Y* Y* V T r F-statistics V *T LF* r LF* LF* LF* LF* LF* F-statistics Compare to get p-values

Yeast cell cycle analysis (a) (a) (c) Minutes Principal Magnitude Component Values 0.4 0.0 0.2 0.4 1st PC 2nd PC 0 100 200 300 400 Minutes Post Elucidation (b) Percent Variance Explained 0 5 10 15 20 25 2 4 6 8 10 12 Principal Component

Part 4. Consistently estimating M Chen X and Storey JD. (2015) Consistent estimation of low-dimensional latent structure in high-dimensional data. arxiv: 1510.03497.

Consistently estimating M =" E[Y M] = Φ M Y m n = (y ij ) distributed according to a single parameter exponential family distribution We show how to estimate with probability 1 the row-space of M and consistently estimate its dimension

Acknowledgements Funding: National Institutes of Health Office of Naval Research Co-authors: Xiongzhi Chee Chen Neo Chung Keyur Desai Wei Hao Jeff Leek Minsun Song Lab members: Andrew Bass Irineo Cabreros Chee Chen Sean Hackett Wei Hao Emily Nelson Alejandro Ochoa