Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Similar documents
Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Lecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants. Summer Institute in Statistical Genetics 2017

Computational Systems Biology: Biology X

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Causality II: How does causal inference fit into public health and what it is the role of statistics?

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

Bayesian Inference of Interactions and Associations

Power and sample size calculations for designing rare variant sequencing association studies.

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Lecture 21: October 19

An Adaptive Association Test for Microbiome Data

Exam: high-dimensional data analysis January 20, 2014

Multiple Testing. Gary W. Oehlert. January 28, School of Statistics University of Minnesota

A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Statistical testing. Samantha Kleinberg. October 20, 2009

SNP-SNP Interactions in Case-Parent Trios

Correlation and regression

HYPOTHESIS TESTING. Hypothesis Testing

Latent Variable Methods for the Analysis of Genomic Data


Association studies and regression

Lecture 10: Alternatives to OLS with limited dependent variables. PEA vs APE Logit/Probit Poisson

Estimating direct effects in cohort and case-control studies

Statistical methods for comparing multiple groups. Lecture 7: ANOVA. ANOVA: Definition. ANOVA: Concepts

High-throughput Testing

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Non-specific filtering and control of false positives

Lecture 5: ANOVA and Correlation

Lectures 5 & 6: Hypothesis Testing

An Introduction to Mplus and Path Analysis

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

Applied Machine Learning Annalisa Marsico

STA441: Spring Multiple Regression. More than one explanatory variable at the same time

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

Tests for the Odds Ratio in Logistic Regression with One Binary X (Wald Test)

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

STA 216, GLM, Lecture 16. October 29, 2007

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Bioinformatics. Genotype -> Phenotype DNA. Jason H. Moore, Ph.D. GECCO 2007 Tutorial / Bioinformatics.

An introduction to biostatistics: part 1

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions

Advanced Statistical Methods: Beyond Linear Regression

Lecture 1 Introduction to Multi-level Models

Marginal Screening and Post-Selection Inference

Regression models. Categorical covariate, Quantitative outcome. Examples of categorical covariates. Group characteristics. Faculty of Health Sciences

Bayesian Methods for Highly Correlated Data. Exposures: An Application to Disinfection By-products and Spontaneous Abortion

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

Lecture 7 Time-dependent Covariates in Cox Regression

Part IV Statistics in Epidemiology

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

Probabilistic Causal Models

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

An Introduction to Path Analysis

Lecture 12: Effect modification, and confounding in logistic regression

Bayesian Regression (1/31/13)

In some settings, the effect of a particular exposure may be

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

COMPARING SEVERAL MEANS: ANOVA

Sample Size Estimation for Studies of High-Dimensional Data

Statistical Applications in Genetics and Molecular Biology

Psych 230. Psychological Measurement and Statistics

Statistical aspects of prediction models with high-dimensional data

Introduction to the Analysis of Variance (ANOVA)

STAT 135 Lab 9 Multiple Testing, One-Way ANOVA and Kruskal-Wallis

One-week Course on Genetic Analysis and Plant Breeding January 2013, CIMMYT, Mexico LOD Threshold and QTL Detection Power Simulation

Two Sample Problems. Two sample problems

More Statistics tutorial at Logistic Regression and the new:

Tests for Two Correlated Proportions in a Matched Case- Control Design

Statistics in medicine

Lecture Outline. Biost 518 Applied Biostatistics II. Choice of Model for Analysis. Choice of Model. Choice of Model. Lecture 10: Multiple Regression:

Binary Logistic Regression

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

3 Comparison with Other Dummy Variable Methods

Machine Learning Linear Classification. Prof. Matteo Matteucci

Review of Statistics 101

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Simple Regression Model. January 24, 2011

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Confidence Intervals for the Odds Ratio in Logistic Regression with One Binary X

Estimating the Marginal Odds Ratio in Observational Studies

The t-test: A z-score for a sample mean tells us where in the distribution the particular mean lies

CHL 5225H Advanced Statistical Methods for Clinical Trials: Multiplicity

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

A Bayesian multi-dimensional couple-based latent risk model for infertility

Introduction to the Analysis of Variance (ANOVA) Computing One-Way Independent Measures (Between Subjects) ANOVAs

An Analysis of College Algebra Exam Scores December 14, James D Jones Math Section 01

Causal Hazard Ratio Estimation By Instrumental Variables or Principal Stratification. Todd MacKenzie, PhD

Transcription:

Lecture 7: Interaction Analysis Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 39

Lecture Outline Beyond main SNP effects Introduction to Concept of Statistical Interaction Standard Gene-Environment Interaction Testing Some More Sophisticated GxE Tests Even Fancier Methods High order Interactions 2 / 39

Interaction Interaction means different things to different people: Biological Mechanistic Additive Synergism and Antogonisms Statistical (Primarily Multiplicative ) Others a lot of general vagueness Statistical (multiplicative) interactions: effect modification (one variable changes the effect of the other on outcome); deviation from additivity 3 / 39

Statistical Interaction Multiplicative interactions: combined effect exceeds the additive effects of individual variables Example Additive Effects (Not interaction) Multiplicative Effects (interaction) Biomarker Levels (mg/ml) 0.0 0.5 1.0 1.5 2.0 2.5 Biomarker Levels (mg/ml) 0 1 2 3 4 5 Smoke Drugs Smoke AND Drink Individuals with Particular Exposures Smoke Drugs Smoke AND Drink Individuals with Particular Exposures 4 / 39

GxE Testing Gene-Environment Interactions (G E) Complex diseases are caused by interplay of genes & environment. Identification of G E aids in disease prevention. 5 / 39

GxE Testing What is environment (E)? Environment is just as loaded as Interaction NIEHS (NIH): Basically, chemical exposures or objective measures (e.g. metabolites) not primary smoking but second hand is OK Anything that is not genetics (G): BMI, race, education, gender, diet, etc. Treatment? Another SNP (Gene-gene interaction, epistasis): main difference between this and GxE is issue of scale (number of pair-wise tests) Operationally: often doesn t matter, but particular scenarios can change assumptions (e.g. independence between E and G) 6 / 39

GxE Testing Marginal Analysis of GxE Interactions Idea: Assess statistical interaction between a single exposure of interest and each SNP Testing Approaches: Two-way interaction in regression model (standard) Alternative designs Testing joint G and GxE effects Others. Multiple comparisons correction: FDR or Bonferroni 7 / 39

GxE Testing Standard 2-way interaction analysis: Model (quantitative trait): y i = β 0 + β g G i + β e E i + β ix G i E i + ε i Then to test for interaction effect: H 0 : β ix = 0. If H 0 is true, then G and E can have effects (in the presence of each other), but their effects do not modify each other: G i = 0 E[y i ] = β 0 + β e E i G i = 1 E[y i ] = β 0 + β g 1 + β e E i If H 0 is false (reject null), then total effect of G and E differs depending on other variable: G i = 0 E[y i ] = β 0 + β e E i G i = 1 E[y i ] = β 0 + β g + (β e + β i x)e i 8 / 39

GxE Testing Standard 2-way interaction analysis: Operationally Regress y on G, E and product of G and E. Then can test H 0 : β i x = 0 using any 1-df test. Things to be careful... Scale: particularly for continuous y Interaction testing is harder because the null model still has genetics in it. Under H 0 y i = β 0 + β g G i + β e E i + ε i If this model is not correctly specified or captured, then there can be considerable inflation of type I error. 9 / 39

GxE Testing Power of GxE Tests is Low Power is bad for GxE analysis: Needs many times as many subjects to test for interaction that is equally powerful. Power 0.0 0.2 0.4 0.6 0.8 1.0 Main Genetic Effect Interaction Power as function of sample size: α = 0.05 level, disease pop. risk of 0.01%, SNP with MAF of 0.25, environment with prevalence of 20%, both main SNP and interaction effect are 1.25 (OR). 0 2000 4000 6000 N 10 / 39

GxE Testing Alternative Strategies? Exploit additional assumptions Case-only analysis Multi-SNP by E Testing (extension of gene/pathway analysis, but harder) Intelligently selecting which SNPs to test Many more fancy things constantly being developed 11 / 39

Advanced GxE Testing Joint G and GxE Testing Joint Test of G + GxE Main Idea Instead of testing just H 0 : β ix = 0, we test H 0 : β g = β ix = 0 via 2-df test. Primarily useful for gene discovery: significance does not explicitly inform interaction analysis. References Gauderman and Siegmund (2001) Hum Herid 52:34 46. Selinger-Leneman et al. (2003) Gen Epi 24:200 7. Kraft et al. (2007) Hum Herid 63:111-9. Huang et al. (2011) Genome Med 3:42. 12 / 39

Advanced GxE Testing Joint G and GxE Testing Joint G and GxE Testing: Toy data Consider the data - a binary response Y, a binary environmental variable E and a binary gene G: Y = 1 Y = 0 G = 1 G = 0 G = 1 G = 0 E = 1 112 64 E = 1 100 100 E = 0 112 112 E = 0 100 100 13 / 39

Advanced GxE Testing Joint G and GxE Testing Joint G and GxE Testing logit(pr(y = 1 G, E)) = α 0 + α 1 G + α 2 E P-value for H 0 : α 1 = 0 is 0.070. Not significant! logit(pr(y = 1 G, E)) = β 0 + β 1 G + β 2 E + β 3 GE P-value for H 0 : β 3 = 0 is 0.051. Not significant! But... P-value for H 0 : β 1 = β 3 = 0 is 0.029. Significant! 14 / 39

Advanced GxE Testing Case-Only Analysis Case-Only Analysis Suppose we have case-control study. Case-Only Analysis involves analyzing *only* the cases. Key Assumption: Genotype MUST be independent of environment Almost necessarily true for randomized treatment E Often true for traditional exposures (e.g. toxicants, pollution), but can be weird confounding issues Need to be careful for some E like BMI, alcohol use, smoking, etc. Generally: need to consider this situationally and with care Assuming the above, then case-only analysis proceeds by looking at the odds-ratio relating environment to genotype. 15 / 39

Advanced GxE Testing Case-Only Analysis Case-Only Analysis G = 0 G = 1 E = 0 E = 1 E = 0 E = 1 Y = 0 p 01 p 02 p 03 p 04 Y = 1 p 11 p 12 p 13 p 14 For multiplicative interaction: logitp(y = 1 G, E) = β 0 + β g G + β e E + β ix G E OR 11 exp(β ix ) = OR 10 OR 01 = p 11p 14 / p 01p 04 p 12 p 13 p 02 p 03 GxE odds ratio in cases = GxE odds ratio in controls GxE odds ratio in controls = 1 under G-E independence!!! 16 / 39

Advanced GxE Testing Case-Only Analysis Case-Only Analysis Instead, we model dependency between genotype and environment: P(G = 1 E, Y = 1) P(G = 0 E, Y = 1) P(Y = 1 G = 1, E)/P(Y = 0 G = 1, E) P(Y = 0, G = 1, E) = P(Y = 1 G = 0, E)/P(Y = 0 G = 0, E) P(Y = 0, G = 0, E) = exp(β 0 + β g + β e E + β ix E) P(G = 1 Y = 0, E) exp(β 0 + β e E) P(G = 0 Y = 0, E) P(G = 1) = exp(β g + β ix E) P(G = 0) with last line holding due to G-E independence (in controls). 17 / 39

Advanced GxE Testing Case-Only Analysis Power of Case-Only Analysis Case-only analysis can lead to improved power, but be careful of assumptions. 18 / 39

Advanced GxE Testing Multi-SNP by E Interactions Multi-SNP by E Interactions Instead of looking at one-snp at a time, can we again conduct analysis at multi-snp level? Idea: 1. Group SNPs in gene/pathway/region 2. Test joint interaction between all SNPs and an environmental variable Many approaches for main SNP effects are intuitively applicable, but fail! Interaction term = G E is correlated with both E and G; this makes permutation methods more challenging We have to correctly capture null model 19 / 39

Advanced GxE Testing Multi-SNP by E Interactions Multi-SNP by E Interactions Consider the following generalized linear model: g (µ i ) = X i α 1 + α 2 E i + G i α 3 + E i G i β Outcome: Y i, has distribution from exponential family and µ i = E(Y i X i ). q non-genetic covariates: X i. environmental factor: E i. group of p variants: G i = (G i1,, G ip ). p G E interaction terms: S i = (E i G i1,, E i G ip ). We are interested in testing if there is any G E: H 0 : β = 0. 20 / 39

Advanced GxE Testing Multi-SNP by E Interactions Averaging/Collapsing Tests for Interactions Idea: let G be a (weighted) average of genotypes within a gene/region/pathway. To test for main effects: H 1m : g (µ i ) = α 1 + α 2E i + α 3G i H 0m : α 3 = 0 Can we use it to test for interactions? H 1x : g (µ i ) = α1 + α2e i + α3g i H 0x : β = 0 + β E i G i 21 / 39

Advanced GxE Testing Multi-SNP by E Interactions Bias analysis for Collapsing G E tests Intuition Null model has to be correctly specified for valid inference. Collapsing G E tests may not give valid inference as main effects of the SNVs may not be sufficiently accounted for. Continuous Outcome: No, even if G E. G and E are independent: Model for mean of Y is valid; Model for variance of Y is not valid. G and E not independent: Model for mean of Y is not valid; Model for variance of Y is not valid. 22 / 39

Advanced GxE Testing Multi-SNP by E Interactions Bias analysis for Collapsing G E tests Binary Outcome: Yes if disease is rare and G E. G and E are independent: Model for mean of Y is valid; Model for variance of Y is valid approximately. G and E not independent: Model for mean of Y is not valid; Model for variance of Y is valid approximately. 23 / 39

Advanced GxE Testing Multi-SNP by E Interactions GESAT: Model To test if there is any G E (H 0 : β = 0): H 0 : logit [P(Y i = 1 E i, X i, G i )] = X i α 1 + α 2 E i + G i α 3 H A : logit [P(Y i = 1 E i, X i, G i )] = X i α 1 + G i (α 3 + E i β) + α 2 E i In principle, we can do the same thing as with SKAT, but... Difficulties Need to fit null model: Need to estimate main effect of variants Lots of variants LD and rarity make fitting difficult Modifications are necessary. GESAT: Extension of SKAT (global test) for GxE 24 / 39

Advanced GxE Testing Multi-SNP by E Interactions GESAT: Test Statistic Assume (β 1,, β p ) are random and independent with mean zero and common variance τ. Testing H 0 reduces to testing H 0 : τ = 0. Following Lin (1997), the score test statistic is T = (Y µ) SS (Y µ) = [Y µ ( α)] SS [Y µ ( α)]. µ = µ ( α) is estimated under the null model, g (µ i X i, E i, G i ) = X i α 1 + α 2 E i + G i α 3 = X i α. Use ridge regression to estimate α, impose a penalty only on α 3. Under H 0, T p v=1 d v χ 2 1 approximately. Invert characteristic function to get p-value (Davies, 1980). 25 / 39

Advanced GxE Testing Pre-Screening Which SNPs to Test? Genome-wide analysis: screen association between all SNPs and outcome Candidate genes or pathways (functional groups) SNPs with significant main effects More sophisticated algorithms: data adaptive procedures that use two-stage screening Which set to use can influence multiple testing adjustments. Not always clear how many tests to adjust for if considering main effects too. 26 / 39

Advanced GxE Testing Additional Structure on Problem Additional Work Already a lot of work assuming independence Can model E better: multi-e analysis Not always clear which E to use: smoking can be yes/no, never/ever, pack-years, cotinine etc. Mixtures of toxicants: many toxicants or exposures happen in conjunction Monotonicity constraints Omnibus strategies Weighted hypothesis testing Innovative screening strategies 27 / 39

High Order Interactions Higher order interactions Given that 2-order interactions are already hard to fine, why are we interested in higher order interactions? power, computational, and interpretation, we should only be interested in higher order interactions when we focus attention on a few targeted regions (e.g. genes), selected because of studies (carried out on other data sets), biology,... 28 / 39

High Order Interactions It is not a surprise that... The power is small. As such we may want to see these methods as hypothesis generating - i.e. we may identify a limited number of interactions that we can follow up on in new studies. 29 / 39

High Order Interactions Models SNPs as 3 level categorical variables: C low low low high low high low high high D A Decision tree models. True A B False B False True False True False Boolean rules like: You are at increased risk if you have at least one mutant for SNP1 or two mutants for SNP2. Classical interaction model g[e(y G)] = β 0 + β 1 G 1 + β 2 G 2 + β 3 G 3 + β 4 G 1 G 2 +β 5 G 1 G 3 + β 6 G 2 G 3 + β 7 G 1 G 2 G 3, 30 / 39

High Order Interactions Models MDR SNPs as 3 level categorical variables: C low low low high low high low high high D A CART Decision tree models. True A B False B False True False True False Logic Regression Boolean rules like: You are at increased risk if you have at least one mutant for SNP1 or two mutants for SNP2. Classical interaction model g[e(y G)] = β 0 + β 1 G 1 + β 2 G 2 + β 3 G 3 + β 4 G 1 G 2 31 / 39

High Order Interactions MDR Multifactor Dimensionality Reduction [Ritchie et al. (2001) Am J Hum Gen 69:138 47] [Hahn et al. (2003) Bioinformatics 19:376 82] modification of [Nelson et al. (2001) Genome Res 11:458 70] Complex interactions are hard to detect because of sparse data via standard parametric models Inaccurate parameter estimates and large standard errors with relatively small sample sizes. Reduce the dimensionality and identify SNP combinations that lead to high risk of disease. Hunting for: low low low high low high low high high 32 / 39

High Order Interactions MDR MDR 33 / 39

High Order Interactions MDR MDR For a particular model with M SNPs (or environmental factors): 10-fold Cross-validation 1. Consider each cell (if factors are SNPs, there are 3 M ). 2. On 9/10th of the data decide whether a cell is high or low risk (for a case-control study the typical cut-off in each cell would be the case/control ratio in the study). 3. Evaluate the prediction on the remaining 1/10th of the data. 4. Check how many of the MDR models are the same. Not entirely clear how this is done - if each cell should be consistent, this would work against models that have (m)any cells that are close to 50/50. Repeat this a number of times - to achieve stability of the cross-validation. If you have enough computing power, always a good idea. Select the model with the lowest prediction error, provided the consistency is better than by chance. 34 / 39

High Order Interactions MDR Sporadic breast cancer 200 women with sporadic primary invasive breast cancer with age-matched hospital based controls, 10 estrogen metabolism SNPs 35 / 39

High Order Interactions MDR Issues While making things binary helps, computation can explode if the number of SNPs in the study is substantial. The selected models do not adhere to the usual parsimony that we like in statistics: if a model with, say, 4 factors is ɛ better than a model with 3 factors, MDR will pick 4 factors. Usually we would prefer 3. Conceivably this could be changed fairly easy. The MDR implementation of cross-validation makes this worse, however (next slide). The models are very hard to interpret. To me, it would make more sense to identify a smaller number of cells with extreme high or extreme low risk. 36 / 39

High Order Interactions MDR Bias in their implementation of Cross Validation Consider the number of models with M SNPs out of a total T. 0 1 2 3 4 5 6 7 8 10 1 10 45 120 210 252 210 120 45 25 1 30 435 4060 27405 142506 593775 2035800 5852925 Imagine what happens if there is no signal, and every model is equally likely, which size would we most likely end up with... The consistency reduces this problem a little, but not by much. Think about the situation where there is one SNP with a strong effect... 37 / 39

High Order Interactions MDR Take home message well beyond MDR When using cross-validation for model selection, if the number of models of size M is different for different M, you can use crossvalidation to find the best model of each size, but you cannot use it to find the best size. You need another test dataset for that! Even more generally: beware of fancy methods, particularly anything for interaction analysis!! 38 / 39

High Order Interactions MDR A sobering note There likely have been more papers written about methods to identify GxE and GxG interactions, than the number of interactions that have successfully been identified. 39 / 39