Causal inference in biomedical sciences: causal models involving genotypes Causal models for observational data Instrumental variables estimation and Mendelian randomization Krista Fischer Estonian Genome Center, University of Tartu, Estonia 36th Finnish Summer School on Probability Theory and Statistics A general association structure with one genotype and two phenotypes References 1 / 14 2 / 14 Mendelian randomization genes as Instrumental Variables Most of the exposures of interest in chronic disease epidemiology cannot be randomized. Sometimes, however, nature will randomize for us: there is a SNP (Single nucleotide polymorphism, a DNA marker) that affects the exposure of interest, but not directly the outcome. Example: a SNP that is associated with the enzyme involved in alcohol metabolism, genetic lactose intolerance, etc. However, the crucial assumption that the SNP cannot affect outcome in any other way than throughout the exposure, cannot be tested statistically! 3 / 14 A causal graph with exposure X, outcome, confounder U and an instrument Z : δ β Simple regression will yield a biased estimate of the causal effect of X on, as the graph implies: = α y + βx + U + ɛ, E(ɛ X, U) =0 so E( X) =α y + βx + E(U X). Thus the coefficient of X will also depend on and the association between X and U. 4 / 14 δ β δ β = α y + βx + U + ɛ, E(ɛ X, U) =0 How can Z help? If E(X Z )=α x + δz, we get E( Z )=α y +βe(x Z )+E(U Z )=α y +β(α x +δz )=α y+βδz. As δ and βδ are estimable, also β becomes estimable. 1. Regress X on Z, obtain an estimate ˆδ 2. Regress on Z, obtain an estimate ˆ δβ 3. Obtain ˆβ = ˆ δβ ˆδ 4. Valid, if Z is not associated with U and does not have any effect on (other than mediated by X) 5. Standard error estimation: use the sandwich estimator, implemented for instance in R, library(sem), function tsls(). 5 / 14 6 / 14
Mendelian randomization example FTO genotype, BMI and Blood Glucose level (related to Type 2 Diabetes risk; Estonian Biobank, n=3635, aged 45+) IV estimation in R (using library(sem)): > summary(tsls(glc~bmi, ~fto,data=fen),digits=2) 2SLS Estimates Model Formula: Glc ~ bmi Instruments: ~fto Average difference in Blood Glucose level (Glc, mmol/l) per BMI unit is estimated as 0.085 (SE=0.005) Average BMI difference per FTO risk allele is estimated as 0.50 (SE=0.09) Average difference in Glc level per FTO risk allele is estimated as 0.13 (SE=0.04) Instrumental variable estimate of the mean Glc difference per BMI unit is 0.209 (se=0.078) 7 / 14 Residuals: Min. 1st Qu. Median Mean 3rd Qu. Max. -6.3700-1.0100-0.0943 0.0000 0.8170 13.2000 Estimate Std. Error t value Pr(> t ) (Intercept) -1.210 2.106-0.6 0.566 bmi 0.209 0.078 2.7 0.008 ** 8 / 14 IV estimation: can untestable assumptions be tested? > summary(lm(glc~bmi+fto,data=fen)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.985 0.106 18.75 <2e-16 *** bmi 0.088 0.004 23.36 <2e-16 *** fto 0.049 0.030 1.66 0.097. For Type 2 Diabetes: > summary(glm(t2d~bmi+fto,data=fen,family=binomial)) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -7.515 0.187-40.18 <2e-16 *** bmi 0.185 0.006 31.66 <2e-16 *** fto 0.095 0.047 2.01 0.044 * Does FTO have a direct effect on Glc or T2D? A significant FTO effect would not be a proof here (nor does non-significance prove the opposite)! (WH?) 9 / 14 A general association structure with one genotype and two phenotypes β β β β β If β gy 0, the genotype G is said to have a pleiotropic effect on variables X and. 10 / 14 One genotype and two phenotypes β β β β β Note that if one fits a linear regression model for, with G as an only covariate, one estimates: E( G) = E(const + β xy X + β gy G + β uy U G) = const + β xy (β gx G)+β gy G = const +(β gx β xy + β gy )G So when one uses the MR approach here (incorrectly assuming no direct effect of G on ), one estimates: β gx β xy + β gy β gx = β xy + β gy β gx Can we test pleiotropy? A naïve approach would be to fit a linear regression model for, with both X and G as covariates. But in this case we estimate: E( X, G) =const + β gy G + β xy X + β uy E(U X, G). As it is possible to show that (assuming standardized variables): we get E( X, G) =const+ E(U X, G) =const + [ β xy + β uxβ uy 1 β 2 gx β ux (1 β 2 gx) (X β gxg), ] X + [ β gy β gx β ux β uy 1 β 2 gx ] G, 11 / 14 12 / 14
One genotype and two phenotypes: linear models for Some references What do we estimate by fitting different linear models for? Estimable coefficients Covariates coef. of X coef. of G X, G, U β xy β gy G β gy + β gx β xy X β xy + β gx β gy + β ux β uy X, G β xy + βux βuy 1 β 2 gx β gy β gx β ux β uy 1 β 2 gx An excellent overview of Mendelian randomization: Sheehan, N., Didelez, V., Burton, P., Tobin, M., Mendelian Randomization and Causal Inference in Observational Epidemiology, PLoS Med. 2008 August; 5(8). http://www.ncbi.nlm.nih.gov/pmc/articles/pmc2522255 A recent review on causality in genetics: Vansteelandt, S., Lange, C., Causation and causal inference for genetic effects. Human Genet, 2012 131:1665-1676 13 / 14 14 / 14
Example: FTO genotype (G), BMI (X) and other outcomes (data: Estonian Biobank) N=12,740 Effect of FTO on BMI: E(X G) h = 0.08 (se=0.01) scaled BMI on coef of X in E( X) FTO on E( G) BMI-adjusted effect of FTO on E( X,G) SBP 0.40 (0.008) 0.025 (0.013) -0.008 (0.011) HDL -0.34 (0.011) -0.008 (0.016) 0.020 (0.016) TG 0.068 (0.002) 0.041 (0.016) -0.009 (0.015) T2D* 1.02 (0.030) 0.160 (0.041) 0.090 (0.040) *logistic regression scaled BMI on coef of X in E( X) FTO on E( G) BMI-adjusted effect of FTO on E( X,G) SBP 0.40 (0.008) 0.025 (0.013) -0.008 (0.011) HDL -0.34 (0.011) -0.008 (0.016) 0.020 (0.016) TG 0.068 (0.002) 0.041 (0.016) -0.009 (0.015) T2D* 1.02 (0.030) 0.160 (0.041) 0.090 (0.040) *logistic regression Direct effect? A simple simulated example N=50000, all non-zero coefficients are highly significant True parameters Coef of X Coefficient of G MR Model: X Estimated parameters X G G G,X X inst G 0.2 0.1 0.3 0.4 0.50 0.20 0.16 0.06 0.8 0.2 0.1 0.3 0 0.30 0.20 0.16 0.10 0.8 0.2 0 0.3 0.4 0.50 0.20 0.06-0.04 0.3 0 0.1 0.3 0.4 0.50 0 0.10 0.10 NA 0.2 0.1 0 0.4 0.20 0.20 0.10 0.07 0.5 MultiPhen analysis idea (O Reilly et al, PLoS One 2012) The idea: with correlated phenotypes use genotype as the outcome, phenotypes as covariates (proportional odds regression) Thus in our setting, regress G on X and. However, if E( X,G) = g 1 (X)+ g 2 (G), (for some g 1 and g 2 ) regardless of causal mechanism, E(G X,) = h 1 (X)+ h 2 () (for some h 1 and h 2 ) MultiPHEN is a useful tool for detecting associations with correlated phenotypes but NOT for causal parameter estimates A simple simulated example Mendelian randomization: more on assumptions N=50000, all non-zero coefficients are highly significant True parameters Coef of X Coefficient of G MR MultiPHEN Model: X Estimated parameters X G G G,X X inst G G adj X G X adj 0.2 0.1 0.3 0.4 0.50 0.20 0.16 0.06 0.8 0.13 0.10 0.2 0.1 0.3 0 0.30 0.20 0.16 0.10 0.8 0.13 0.15 0.2 0 0.3 0.4 0.50 0.20 0.06-0.04 0.3-0.06 0.23 0 0.1 0.3 0.4 0.50 0 0.10 0.10 NA 0.15-0.09 0.2 0.1 0 0.4 0.20 0.20 0.10 0.07 0.5 0.09 0.15 0.2 0 0.3 0 0.30 0.20 0.06 0 0.3 0 0.20 The causal effect is defined via potential outcomes: E( 0 G,X) = (X X 0 ) 0 potential exposure-free outcome (if X 0 =0) or outcome at a potential baseline exposure level Assuming the same effect of X at each level of G no exposure effect heterogeneity One way to understand this assumption is via principal stratification easily understood in the context of noncompliance analysis of randomized trials
Classical vs Mendelian Randomization Estimating Complier Average Causal Effect (CACE): assumptions A Randomized Clinical Trial (RCT) Unobserved confounders U Mendelian Randomization (MR) Unobserved confounders U As R has no direct effect on, there is: No assignment effect in never takers No assignment effect in always takers The estimated causal effect is only valid for compliers R X G X Treatment Control Random assignment Received treatment Outcome Genotype Exposure phenotype Outcome phenotype Always takers p 1A = p 0A Compliers p 1C p 0C Association between R and is unconfounded and present only when X- association is present Association between G and is unconfounded and present only when X- association is present Never takers p 1N = p 0N Outcome probabilities: p is =P(=1 R=i, Stratum=s), with R-assigned treatment Principal stratification and Mendelian randomization (ignoring heterozygotes) Always takers : Overweight regardless of their genotype Compliers : Overweight when having risk alleles of the FTO genotype, normal weight otherwise Never takers : Normal weight even when having the FTO genotype A/A T/T Do the principal strata exist? There is a proven causal associationbetween FTO genotype and overweight status This means, there must exist individuals, whose overweight is caused by their FTO risk alleles So there also exist individuals who have normal weight only because they do not have FTO risk alleles Any differencesin the T2D risk between people with different genotype can only come from this stratum of compliers T2D, overweight and FTO example The estimated effectis valid in the stratum of compliers (estimated as 10% of the individuals) Extending this to other principal strata involves assumptions on no exposure effect heterogeneity Summary on causal analysis in genomics data Association is not causality -oldtruth, butstillneedsto be reminded while analyzing omics data In most cases, causal inference relies on statistically untestable assumptions. The assumptions should be verified based on external knowledge (biology). There are no forbidden models, but it is important to understand the interpretation of model parameters given realistic assumptions. There are always unobserved confounders between health phenotypes!