Causal inference in biomedical sciences: causal models involving genotypes. Mendelian randomization genes as Instrumental Variables

Similar documents
Mendelian randomization (MR)

Mendelian randomization as an instrumental variable approach to causal inference

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Lecture 4 Multiple linear regression

Statistical inference in Mendelian randomization: From genetic association to epidemiological causation

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Recent Challenges for Mendelian Randomisation Analyses

Causal Hazard Ratio Estimation By Instrumental Variables or Principal Stratification. Todd MacKenzie, PhD

A Comparison of Robust Methods for Mendelian Randomization Using Multiple Genetic Variants

On the Choice of Parameterisation and Priors for the Bayesian Analyses of Mendelian Randomisation Studies.

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

Estimating direct effects in cohort and case-control studies

4.1 Example: Exercise and Glucose

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Estimating Structural Mean Models with Multiple Instrumental Variables using the Generalised Method of Moments

Linear Regression (1/1/17)

Case-Control Association Testing. Case-Control Association Testing

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Causal exposure effect on a time-to-event response using an IV.

Extending the MR-Egger method for multivariable Mendelian randomization to correct for both measured and unmeasured pleiotropy

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Statistical Analysis of Randomized Experiments with Nonignorable Missing Binary Outcomes

Unbiased estimation of exposure odds ratios in complete records logistic regression

Statistics in medicine

Association studies and regression

Instrumental variables & Mendelian randomization

Causal Inference for Binary Outcomes

IV-estimators of the causal odds ratio for a continuous exposure in prospective and retrospective designs

Lecture Discussion. Confounding, Non-Collapsibility, Precision, and Power Statistics Statistical Methods II. Presented February 27, 2018

Investigating mediation when counterfactuals are not metaphysical: Does sunlight exposure mediate the effect of eye-glasses on cataracts?

Comparative effectiveness of dynamic treatment regimes

Using Genomic Structural Equation Modeling to Model Joint Genetic Architecture of Complex Traits

An introduction to biostatistics: part 1

Mendelian randomization: From genetic association to epidemiological causation

Correlation and regression

Specification Errors, Measurement Errors, Confounding

Distinctive aspects of non-parametric fitting

Measurement Error in Spatial Modeling of Environmental Exposures

Computational Systems Biology: Biology X

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Propensity Score Methods for Causal Inference

A Comparison of Methods for Estimating the Causal Effect of a Treatment in Randomized. Clinical Trials Subject to Noncompliance.

Exam ECON5106/9106 Fall 2018

Structural Nested Mean Models for Assessing Time-Varying Effect Moderation. Daniel Almirall

Final Exam. Economics 835: Econometrics. Fall 2010

Estimating and contextualizing the attenuation of odds ratios due to non-collapsibility

Effect Modification and Interaction

Correlation and Simple Linear Regression

BTRY 7210: Topics in Quantitative Genomics and Genetics

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

DEALING WITH MULTIVARIATE OUTCOMES IN STUDIES FOR CAUSAL EFFECTS

Econometrics with Observational Data. Introduction and Identification Todd Wagner February 1, 2017

Comparison of Three Approaches to Causal Mediation Analysis. Donna L. Coffman David P. MacKinnon Yeying Zhu Debashis Ghosh

An Introduction to Causal Analysis on Observational Data using Propensity Scores

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Introduction to Analysis of Genomic Data Using R Lecture 6: Review Statistics (Part II)

Methods for Cryptic Structure. Methods for Cryptic Structure

Survival Analysis for Case-Cohort Studies

Causality II: How does causal inference fit into public health and what it is the role of statistics?

Sensitivity analysis and distributional assumptions

Lecture 12: Effect modification, and confounding in logistic regression

Marginal Structural Cox Model for Survival Data with Treatment-Confounder Feedback

Introduction to Causal Bayesian Inference Chris Holmes University of Oxford

University of Bristol - Explore Bristol Research

Asymptotic distribution of the largest eigenvalue with application to genetic data

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Identification Analysis for Randomized Experiments with Noncompliance and Truncation-by-Death

A Decision Theoretic Approach to Causality

SNP Association Studies with Case-Parent Trios

Part IV Statistics in Epidemiology

Missing Covariate Data in Matched Case-Control Studies

Multiple linear regression S6

Casual Mediation Analysis

Journal of Biostatistics and Epidemiology

Causal Effect Estimation Under Linear and Log- Linear Structural Nested Mean Models in the Presence of Unmeasured Confounding

Robust instrumental variable methods using multiple candidate instruments with application to Mendelian randomization

Instrumental Variables

Causal Inference with Counterfactuals

p(d g A,g B )p(g B ), g B

arxiv: v1 [stat.me] 3 Feb 2016

Lecture 2: Poisson and logistic regression

Advanced Quantitative Research Methodology, Lecture Notes: Research Designs for Causal Inference 1

Two-Sample Instrumental Variable Analyses using Heterogeneous Samples

WORKSHOP ON PRINCIPAL STRATIFICATION STANFORD UNIVERSITY, Luke W. Miratrix (Harvard University) Lindsay C. Page (University of Pittsburgh)

Régression en grande dimension et épistasie par blocs pour les études d association

Case-control studies

Analysis of Longitudinal Data. Patrick J. Heagerty PhD Department of Biostatistics University of Washington

Harvard University. Harvard University Biostatistics Working Paper Series

Accounting for Baseline Observations in Randomized Clinical Trials

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data?

Cross-Sectional Regression after Factor Analysis: Two Applications

Bounds on Causal Effects in Three-Arm Trials with Non-compliance. Jing Cheng Dylan Small

Accounting for Baseline Observations in Randomized Clinical Trials

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

1 Preliminary Variance component test in GLM Mediation Analysis... 3

Transcription:

Causal inference in biomedical sciences: causal models involving genotypes Causal models for observational data Instrumental variables estimation and Mendelian randomization Krista Fischer Estonian Genome Center, University of Tartu, Estonia 36th Finnish Summer School on Probability Theory and Statistics A general association structure with one genotype and two phenotypes References 1 / 14 2 / 14 Mendelian randomization genes as Instrumental Variables Most of the exposures of interest in chronic disease epidemiology cannot be randomized. Sometimes, however, nature will randomize for us: there is a SNP (Single nucleotide polymorphism, a DNA marker) that affects the exposure of interest, but not directly the outcome. Example: a SNP that is associated with the enzyme involved in alcohol metabolism, genetic lactose intolerance, etc. However, the crucial assumption that the SNP cannot affect outcome in any other way than throughout the exposure, cannot be tested statistically! 3 / 14 A causal graph with exposure X, outcome, confounder U and an instrument Z : δ β Simple regression will yield a biased estimate of the causal effect of X on, as the graph implies: = α y + βx + U + ɛ, E(ɛ X, U) =0 so E( X) =α y + βx + E(U X). Thus the coefficient of X will also depend on and the association between X and U. 4 / 14 δ β δ β = α y + βx + U + ɛ, E(ɛ X, U) =0 How can Z help? If E(X Z )=α x + δz, we get E( Z )=α y +βe(x Z )+E(U Z )=α y +β(α x +δz )=α y+βδz. As δ and βδ are estimable, also β becomes estimable. 1. Regress X on Z, obtain an estimate ˆδ 2. Regress on Z, obtain an estimate ˆ δβ 3. Obtain ˆβ = ˆ δβ ˆδ 4. Valid, if Z is not associated with U and does not have any effect on (other than mediated by X) 5. Standard error estimation: use the sandwich estimator, implemented for instance in R, library(sem), function tsls(). 5 / 14 6 / 14

Mendelian randomization example FTO genotype, BMI and Blood Glucose level (related to Type 2 Diabetes risk; Estonian Biobank, n=3635, aged 45+) IV estimation in R (using library(sem)): > summary(tsls(glc~bmi, ~fto,data=fen),digits=2) 2SLS Estimates Model Formula: Glc ~ bmi Instruments: ~fto Average difference in Blood Glucose level (Glc, mmol/l) per BMI unit is estimated as 0.085 (SE=0.005) Average BMI difference per FTO risk allele is estimated as 0.50 (SE=0.09) Average difference in Glc level per FTO risk allele is estimated as 0.13 (SE=0.04) Instrumental variable estimate of the mean Glc difference per BMI unit is 0.209 (se=0.078) 7 / 14 Residuals: Min. 1st Qu. Median Mean 3rd Qu. Max. -6.3700-1.0100-0.0943 0.0000 0.8170 13.2000 Estimate Std. Error t value Pr(> t ) (Intercept) -1.210 2.106-0.6 0.566 bmi 0.209 0.078 2.7 0.008 ** 8 / 14 IV estimation: can untestable assumptions be tested? > summary(lm(glc~bmi+fto,data=fen)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.985 0.106 18.75 <2e-16 *** bmi 0.088 0.004 23.36 <2e-16 *** fto 0.049 0.030 1.66 0.097. For Type 2 Diabetes: > summary(glm(t2d~bmi+fto,data=fen,family=binomial)) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -7.515 0.187-40.18 <2e-16 *** bmi 0.185 0.006 31.66 <2e-16 *** fto 0.095 0.047 2.01 0.044 * Does FTO have a direct effect on Glc or T2D? A significant FTO effect would not be a proof here (nor does non-significance prove the opposite)! (WH?) 9 / 14 A general association structure with one genotype and two phenotypes β β β β β If β gy 0, the genotype G is said to have a pleiotropic effect on variables X and. 10 / 14 One genotype and two phenotypes β β β β β Note that if one fits a linear regression model for, with G as an only covariate, one estimates: E( G) = E(const + β xy X + β gy G + β uy U G) = const + β xy (β gx G)+β gy G = const +(β gx β xy + β gy )G So when one uses the MR approach here (incorrectly assuming no direct effect of G on ), one estimates: β gx β xy + β gy β gx = β xy + β gy β gx Can we test pleiotropy? A naïve approach would be to fit a linear regression model for, with both X and G as covariates. But in this case we estimate: E( X, G) =const + β gy G + β xy X + β uy E(U X, G). As it is possible to show that (assuming standardized variables): we get E( X, G) =const+ E(U X, G) =const + [ β xy + β uxβ uy 1 β 2 gx β ux (1 β 2 gx) (X β gxg), ] X + [ β gy β gx β ux β uy 1 β 2 gx ] G, 11 / 14 12 / 14

One genotype and two phenotypes: linear models for Some references What do we estimate by fitting different linear models for? Estimable coefficients Covariates coef. of X coef. of G X, G, U β xy β gy G β gy + β gx β xy X β xy + β gx β gy + β ux β uy X, G β xy + βux βuy 1 β 2 gx β gy β gx β ux β uy 1 β 2 gx An excellent overview of Mendelian randomization: Sheehan, N., Didelez, V., Burton, P., Tobin, M., Mendelian Randomization and Causal Inference in Observational Epidemiology, PLoS Med. 2008 August; 5(8). http://www.ncbi.nlm.nih.gov/pmc/articles/pmc2522255 A recent review on causality in genetics: Vansteelandt, S., Lange, C., Causation and causal inference for genetic effects. Human Genet, 2012 131:1665-1676 13 / 14 14 / 14

Example: FTO genotype (G), BMI (X) and other outcomes (data: Estonian Biobank) N=12,740 Effect of FTO on BMI: E(X G) h = 0.08 (se=0.01) scaled BMI on coef of X in E( X) FTO on E( G) BMI-adjusted effect of FTO on E( X,G) SBP 0.40 (0.008) 0.025 (0.013) -0.008 (0.011) HDL -0.34 (0.011) -0.008 (0.016) 0.020 (0.016) TG 0.068 (0.002) 0.041 (0.016) -0.009 (0.015) T2D* 1.02 (0.030) 0.160 (0.041) 0.090 (0.040) *logistic regression scaled BMI on coef of X in E( X) FTO on E( G) BMI-adjusted effect of FTO on E( X,G) SBP 0.40 (0.008) 0.025 (0.013) -0.008 (0.011) HDL -0.34 (0.011) -0.008 (0.016) 0.020 (0.016) TG 0.068 (0.002) 0.041 (0.016) -0.009 (0.015) T2D* 1.02 (0.030) 0.160 (0.041) 0.090 (0.040) *logistic regression Direct effect? A simple simulated example N=50000, all non-zero coefficients are highly significant True parameters Coef of X Coefficient of G MR Model: X Estimated parameters X G G G,X X inst G 0.2 0.1 0.3 0.4 0.50 0.20 0.16 0.06 0.8 0.2 0.1 0.3 0 0.30 0.20 0.16 0.10 0.8 0.2 0 0.3 0.4 0.50 0.20 0.06-0.04 0.3 0 0.1 0.3 0.4 0.50 0 0.10 0.10 NA 0.2 0.1 0 0.4 0.20 0.20 0.10 0.07 0.5 MultiPhen analysis idea (O Reilly et al, PLoS One 2012) The idea: with correlated phenotypes use genotype as the outcome, phenotypes as covariates (proportional odds regression) Thus in our setting, regress G on X and. However, if E( X,G) = g 1 (X)+ g 2 (G), (for some g 1 and g 2 ) regardless of causal mechanism, E(G X,) = h 1 (X)+ h 2 () (for some h 1 and h 2 ) MultiPHEN is a useful tool for detecting associations with correlated phenotypes but NOT for causal parameter estimates A simple simulated example Mendelian randomization: more on assumptions N=50000, all non-zero coefficients are highly significant True parameters Coef of X Coefficient of G MR MultiPHEN Model: X Estimated parameters X G G G,X X inst G G adj X G X adj 0.2 0.1 0.3 0.4 0.50 0.20 0.16 0.06 0.8 0.13 0.10 0.2 0.1 0.3 0 0.30 0.20 0.16 0.10 0.8 0.13 0.15 0.2 0 0.3 0.4 0.50 0.20 0.06-0.04 0.3-0.06 0.23 0 0.1 0.3 0.4 0.50 0 0.10 0.10 NA 0.15-0.09 0.2 0.1 0 0.4 0.20 0.20 0.10 0.07 0.5 0.09 0.15 0.2 0 0.3 0 0.30 0.20 0.06 0 0.3 0 0.20 The causal effect is defined via potential outcomes: E( 0 G,X) = (X X 0 ) 0 potential exposure-free outcome (if X 0 =0) or outcome at a potential baseline exposure level Assuming the same effect of X at each level of G no exposure effect heterogeneity One way to understand this assumption is via principal stratification easily understood in the context of noncompliance analysis of randomized trials

Classical vs Mendelian Randomization Estimating Complier Average Causal Effect (CACE): assumptions A Randomized Clinical Trial (RCT) Unobserved confounders U Mendelian Randomization (MR) Unobserved confounders U As R has no direct effect on, there is: No assignment effect in never takers No assignment effect in always takers The estimated causal effect is only valid for compliers R X G X Treatment Control Random assignment Received treatment Outcome Genotype Exposure phenotype Outcome phenotype Always takers p 1A = p 0A Compliers p 1C p 0C Association between R and is unconfounded and present only when X- association is present Association between G and is unconfounded and present only when X- association is present Never takers p 1N = p 0N Outcome probabilities: p is =P(=1 R=i, Stratum=s), with R-assigned treatment Principal stratification and Mendelian randomization (ignoring heterozygotes) Always takers : Overweight regardless of their genotype Compliers : Overweight when having risk alleles of the FTO genotype, normal weight otherwise Never takers : Normal weight even when having the FTO genotype A/A T/T Do the principal strata exist? There is a proven causal associationbetween FTO genotype and overweight status This means, there must exist individuals, whose overweight is caused by their FTO risk alleles So there also exist individuals who have normal weight only because they do not have FTO risk alleles Any differencesin the T2D risk between people with different genotype can only come from this stratum of compliers T2D, overweight and FTO example The estimated effectis valid in the stratum of compliers (estimated as 10% of the individuals) Extending this to other principal strata involves assumptions on no exposure effect heterogeneity Summary on causal analysis in genomics data Association is not causality -oldtruth, butstillneedsto be reminded while analyzing omics data In most cases, causal inference relies on statistically untestable assumptions. The assumptions should be verified based on external knowledge (biology). There are no forbidden models, but it is important to understand the interpretation of model parameters given realistic assumptions. There are always unobserved confounders between health phenotypes!