Causal Graphical Models in Systems Genetics

1 Causal Graphical Models in Systems Genetics 2013 Network Analysis Short Course - UCLA Human Genetics Elias Chaibub Neto and Brian S Yandell July 17, 2013

Motivation and basic concepts 2

3 Motivation Suppose the expression of gene G is associated with a clinical phenotype C. We want to know whether: G C or if G C or if C G. We cannot distinguish between these models using data since f (G) f (C G) = f (G, C) = f (C) f (G C), and their likelihood scores are identical.

4 Schadt et al. (2005) However, if G and C map to the same QTL, we can use genetics to infer the causal ordering among the phenotypes. 20 15 lod 10 5 0 Q 0 20 40 60 80 100 Chromosome causal Q G reactive G Q indepen G Q C C C

5 Schadt et al. (2005) For a drug company, it is important to determine which genes are causal, and which genes are reactive, since: Causal genes have the potential to become drug targets. Whereas reactive genes are of lesser interest.

6 Genetics and causal inference The integration of genetics and phenotype data allows us to infer causal relations between phenotypes for two reasons: 1. In experimental crosses, the association of a QTL and a phenotype is causal. 2. A causal QTL can be used to determine the causal order between phenotypes using the concept of conditional independence.

7 Causal relations between QTLs and phenotypes In experim. crosses, the association of a QTL and a phenotype is causal. Why is it so? QTL mapping is analogous to a randomized experiment (Li et al. 2006). Randomization is considered the gold standard for causal inference. Causality can be inferred from a randomized experiment since: 1. Application of a treatment to an experimental unit precedes the observation of the outcome (genotype precedes phenotype). 2. Because the treatment levels are randomized across the experimental units, the effects of confounding variables get averaged out (the mendelian randomization of alleles during meiosis average out the effects of other unlinked loci on the phenotype).

8 Causal relations between QTLs and phenotypes A 20 15 X chr1 Phenotype lod 10 5 0 Effect of QTL A, after the effect of B is averaged out 1 2 Chromosome Effect of QTL B, after the effect of A is averaged out X chr2 B phenotype 2 0 2 4 6 8 Aa AA phenotype 2 0 2 4 6 8 Bb BB

Conditional independence as the key to causal ordering Model: Q G C Marginal dependence: G 0 2 4 6 C 2 2 6 Aa AA Aa AA Conditional independence: C 2 0 2 4 6 8 res(c G) 3 1 1 3 0 2 4 6 G Aa AA 9

Causal ordering between phenotypes causal reactive G G indepen G Q Q Q C C C res(c G) 10 5 0 5 10 Aa AA res(c G) 10 0 10 20 Aa AA res(c G) 15 5 0 5 10 Aa AA res(g C) 10 0 5 15 Aa AA res(g C) 5 0 5 Aa AA res(g C) 0 5 10 20 Aa AA

11 Causal ordering between phenotypes In general (although it is not always true): Models that share the same set of conditional independence relations (Markov equivalent models) cannot be distinguished using the data (they have equivalent likelihood functions). Whereas, models with distinct sets of conditional independence relations, can be distinguished.

12 Causality tests for pairs of phenotypes causal G Q C reactive G Q C indepen G Q C

13 Pairwise models as collapsed versions of more complex networks (a) Q Y 1 44 Y Y 444 Y (b) Y Q Y Y 1 44 (c) Q Y 6 6 Y 1 Y Y Y 4 4 Y 2 Y Y Y 2 Y 2 (d) Q (e) 7 7 Q Y Y Y 6 6 Y Y 1 Y Y 1 Y Y 2 Y 2 Y Q Q 44 44 Y 1 Y 2 Y 1 Y 2 Q Q Q 55 66 55 Y 1 Y 2 Y 1 Y 2 Y 1 Y 2 A causal relation might be direct or mediated by other phenotypes. Pairwise models are misspecified.

14 Schadt et al. 2005 By using this approach, Schadt et al. 2005, has been able to identify, and experimentally validate genes related to obesity in a mouse cross. So, what is the issue then? Model selection via AIC or BIC scores do not provide a measure of uncertainty associated with the model selection call. With noisy data, model selection can lead to a large number of false positives.

The issue, and illustration For each one of the 1,000 simulations we: Generate noisy data from model: Q Y 1 Y 2. Fit models M 1 : Q Y 1 Y 2 and M 2 : Q Y 2 Y 1. Compute log-likelihood ratio LR 12. If LR 12 > 0, select M 1. If LR 12 < 0, select M 2. R 2 (Y 2 = Q + ε) 0.0 0.1 0.2 0.3 0.4 0.5 false positives (318) true positives (682) 0.0 0.1 0.2 0.3 0.4 0.5 R 2 (Y 1 = Q + ε) 15

16 Issue: no measure of uncertainty for a model selection call We want a statistical procedure that attaches a measure of uncertainty to a model selection call. However, given the characteristics of our application problem, it: 1. Needs to handle misspecified models. 2. Needs to handle non-nested models: M 1 M Q 2 7 7 Q Y 1 Y 2 Y 1 Y 2 3. Should, ideally, be fully analytical for the sake of computational efficiency.

17 Assessing the significance of a model selection call Vuong s model selection test (Vuong 1989) satisfies these three criteria. R 2 (Y 2 = Q + ε) 0.0 0.1 0.2 0.3 0.4 0.5 false positives (1) true positives (65) no calls (934) 0.0 0.1 0.2 0.3 0.4 0.5 R 2 (Y 1 = Q + ε)

18 Vuong s model selection test (Vuong 1989) Consider 2 competing models M 1 M 2. Vuong s test the hypothesis: H 0 : M 1 is not closer to the true model than M 2, H 1 : M 1 is closer to the true model than M 2. where, under H 0, the scaled log-likelihood-ratio test statistic Z 12 = L ˆR 12 n ˆσ12.12 d N(0, 1), with L ˆR 12 = n i=1 (log ˆf 1,i log ˆf 2,i ), and ˆσ 12.12 is the sample variance of the log-likelihood ratio scores.

19 Causal Model Selection Tests (CMST) Vuong s test handles model selection for 2 models only. However, we want to use data from experimental crosses to distinguish among 4 models: M 1 Q 7 7 M 2 Q M 3 Q 7 7 M 4 Q 7 7 Y 1 Y 2 Y 1 Y 2 Y 1 Y 2 Y 1 Y 2 Likelihood equivalent models: M4 a Q M Q 4 b M4 c Q Y 1 Y 2 Y 1 Y 2 Y 1 Y 2

20 Causal Model Selection Tests (CMST) Combine several separate Vuong s tests into a single one. 3 versions: 1. Parametric CMST: intersection-union test of 3 Vuong s tests, M 1 M 2, M 1 M 3, M 1 M 4, testing: H 0 : M 1 is not closer to the true model than M 2, M 3, or M 4. H 1 : M 1 is closer to the true model than M 2, M 3, and M 4. 2. Non-parametric CMST: intersection-union test of 3 paired sign tests (Clark s test). 3. Joint-parametric CMST: extension of the parametric CMST test which accounts for the correlation among the test statistics of the Vuong s tests.

21 Yeast data analysis Budding yeast genetical genomics data set (Brem and Kruglyak 2005). Data on 112 strains with: Expression measurements on 5,740 transcripts. Dense genotype data on 2,956 markers. Most importantly: We evaluated the precision of the causal predictions using validated causal relationships extracted from a data-base of 247 knock-out experiments in yeast (Hughes et al. 2000, Zhu et al. 2008).

22 Knockout signatures In each experiment, one gene was knocked-out, and the expression levels of the remainder genes in control and knocked-out strains were interrogated for differential expression. The set of differentially expressed genes form the knock-out signature (ko-signature) of the knocked-out gene (ko-gene). The ko-signature represents a validated set of causal relations.

23 Validation using yeast knockout signatures To leverage the ko information, we: Determined which of the 247 ko-genes also showed a significant QTL in our data-set. For each ko-gene showing significant linkages, we determined which other genes co-mapped to the ko-gene s QTL, generating, in this way, a list of putative targets of the ko-gene. For each ko-gene/putative targets list, we applied all methods using the ko-gene as the Y 1 phenotype, the putative target genes as the Y 2 phenotypes and the ko-gene s QTL as the causal anchor.

24 Validation using yeast knockout signatures In total, 135 ko-genes showed significant linkages (both cis- and trans-). The number of genes in the target lists varied from ko-gene to ko-gene, but, in total, there were 31,936 targets.

25 Validation using yeast knockout signatures Performance in terms of biologically validated TP, FP and precision: TP: a statistically significant causal relation between a ko-gene and a putative target gene when the putative target gene belongs to the ko-signature of the ko-gene. FP: a statistically significant causal relation between a ko-gene and a putative target gene when the target gene doesn t belong to the ko-signature. The validated precision, is computed as the ratio of true positives by the sum of true and false positives.

26 Results: cis and trans ko-genes Number of true positives 0 50 100 150 200 250 300 350 True Positives 0.02 0.06 0.10 Nominal significance level Number of false positives 0 1000 2000 3000 4000 5000 False Positives 0.02 0.06 0.10 Nominal significance level Precision 0.00 0.05 0.10 0.15 0.20 0.25 Precision 0.02 0.06 0.10 Nominal significance level black: BIC, blue: joint CMST BIC, green: par CMST BIC, red: non par CMST BIC

27 Results: cis ko-genes only 27 out of the 135 candidate regulator ko-genes mapped in cis. True Positives False Positives Precision Number of true positives 0 20 40 60 80 100 120 140 0.02 0.06 0.10 Nominal significance level Number of false positives 0 100 200 300 400 500 0.02 0.06 0.10 Nominal significance level Precision 0.0 0.1 0.2 0.3 0.4 0.02 0.06 0.10 Nominal significance level black: BIC, blue: joint CMST BIC, green: par CMST BIC, red: non par CMST BIC

28 Precision side by side Cis and trans Cis only Precision 0.0 0.1 0.2 0.3 0.4 0.5 0.02 0.06 0.10 Nominal significance level Precision 0.0 0.1 0.2 0.3 0.4 0.5 0.02 0.06 0.10 Nominal significance level black: BIC, blue: joint CMST BIC, green: par CMST BIC, red: non par CMST BIC

29 Cis-vs-trans case Why is the cis-vs-trans case easier than the trans-vs-trans case? In general, the cis-linkages tend to be stronger than trans-linkages. R 2 (Y 2 = Q + ε) 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 R 2 (Y 1 = Q + ε)

30 Conclusions CMST tests trade a reduction in the rate of false positives by a decrease in statistical power. Whether a more powerful and less precise, or a less powerful and more precise method is more adequate, depends on the biologist s research goals and resources. If the biologist can easily validate several genes, a larger list generated by more powered and less precise methods might be more appealing. If follow up studies are time consuming and expensive, and only a few candidates can be studied in detail, a more precise method that conservatively identifies candidates with high confidence can be more appealing.

31 Causal Bayesian networks and the QTLnet algorithm chr15@61.14 YOL084W chr15@170.71 YOR028C YNL195C YKL091C YNR014W YEL011W YPL154C YFR043C YHR104W YKL085W YDR032C YJL210W YJL111W YLR178C YIL113W YPR160W YOL097C YNL160W YIR016W YJL161W YHR016C chr14@247.65 YAL061W chr10@47.16 YMR170C YJR096W

Standard Bayesian networks A graphical model is a multivariate probabilistic model whose conditional independence relations are represented by a graph. Bayesian networks are directed acyclic graph (DAG) models, 1 2 3 5 6 Assuming the Markov property, the joint distribution factors according to the conditional independence relations: 4 P(1, 2, 3, 4, 5, 6) = P(6 5) P(5 3, 4) P(4) P(3 1, 2) P(2) P(1) 6 {1, 2, 3, 4} 5, 5 {1, 2} 3, 4, and so on i.e., each node is independent of its non-descendants given its parents. 32

33 Standard Bayesian networks and causality Even though the direct edges in a Bayes net are often interpreted as causal relations, in reality they only represent conditional dependencies. Different phenotype networks, for instance, Y 1 Y 2 Y 3, Y 1 Y 2 Y 3, Y 1 Y 2 Y 3, can represent the same set of conditional independence relations (Y 1 Y 3 Y 2, in this example). When this is the case, we say the networks are Markov equivalent.

34 Standard Bayesian networks and causality In general: Markov equivalence Distribution equivalence (equivalence of likelihood functions) Hence, model selection criteria cannot distinguish between Markov equivalent networks. The best we can do is to learn equivalence classes of likelihood equivalent phenotype networks from the data.

35 Genetics as a way to reduce the size of equivalence classes The incorporation of genetic information can help distinguish between likelihood equivalent networks in two distinct ways: 1. By creating priors for the network structures, using the results of causality tests (Zhu et al. 2007). 2. By augmenting the phenotype network with QTL nodes, creating new sets of conditional independence relations (Chaibub Neto et al. 2010).

36 Genetic priors Consider the networks M 1 : Y 1 Y 2 Y 3, M 2 : Y 1 Y 2 Y 3. These Markov equivalent networks have the same likelihood, i.e., P(D M 1 ) = P(D M 2 ). If the phenotypes are associated with QTLs, we can use the results of the causality tests to compute prior probabilities for the network structures. If P(M 1 ) P(M 2 ) 1, then P(M 1 D) P(M 2 D) = P(D M 1) P(M 1 ) P(D M 2 ) P(M 2 ) 1, and we can use the posterior probability ratio to distinguish between the networks.

37 Augmenting the phenotype network with QTLs Consider the Markov equivalent networks: M 1 : Y 1 Y 2 Y 3, M 2 : Y 1 Y 2 Y 3. By augmenting the phenotype network with a QTL node, M 1 : Q Y 1 Y 2 Y 3, M 2 : Q Y 1 Y 2 Y 3, we have that M 1 and M 2 have distinct sets of conditional independence relations: Y 2 Q Y 1, Y 1 Y 3 Y 2, on M 1 Y 2 Q Y 1, Y 1 Y 3 Y 2, on M 2 Hence, M 1 and M 2 are no longer likelihood equivalent.

38 Learning Bayesian Networks from Data Posterior prob of network M k given the observed data, D, P(M k D) = P(D M k ) P(M k ) M k P(D M k ) P(M k ). Prior predictive distribution of D given M k P(D M k ) = P(D θ, M k ) P(θ) dθ Prior distribution of network M k, P(M k ). Marginal distribution of the data P(D) = P(D M k ) P(M k ), M k cannot, generally, be computed analytically because the number of networks is too large.

39 Learning Bayesian Networks from Data Complexity of the learning task: # of nodes # of networks 1 1 2 3 3 25 4 543 5 29,281 6 3,781,503 10 4.175099e+18 20 2.344880e+72 30 2.714854e+158 Hence, heuristic search algorithms are essential to traverse the network space efficiently.

40 QTLnet algorithm Perform joint inference of the causal phenotype network and the associated genetic architecture. The genetic architecture is inferred conditional on the phenotype network. Because the phenotype network structure is itself unknown, the algorithm iterates between updating the network structure and genetic architecture using a Markov chain Monte Carlo (MCMC) approach. QTLnet corresponds to a mixed Bayesian network with continuous and discrete nodes representing phenotypes and QTLs, respectively.

QTLnet algorithm - standard structure sampler

42 Bayesian model averaging Posterior prob 0.00 0.10 0.20 M_1 M_2 M_3 M_4 M_5 M_6 M_7 M_8 M_9 M_10 Model M 1 1 M 2 1 M 3 1 M 4 1 M 5 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 M 6 1 M 7 1 M 8 1 M 9 1 M 10 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 Pr(Y 1 Y 2) = Pr(M 1) + Pr(M 3) + Pr(M 4) = 0.54 Pr(Y 1... Y 2) = Pr(M 2) + Pr(M 5) + Pr(M 7) = 0.34 Pr(Y 1 Y 2) = Pr(M 6) + Pr(M 8) + Pr(M 9) + Pr(M 10) = 0.12

43 Yeast data analysis We build a causal phenotype network around PHM7. PHM7 is physically located close to the hotspot QTL on chr 15. chr 15 hotspot 120 100 80 counts 60 40 20 0 0 100 200 300 400 Map position (cm) PHM7 is the cis-gene with the largest number of significant causal calls across all hotspots (23 significant calls at α = 0.001 for joint CMST).

44 Yeast data analysis PHM7 (yellow) shows up at the top of the transcriptional network. chr15@61.14 YOL084W chr15@170.71 YOR028C YNL195C YKL091C YNR014W YEL011W YPL154C YFR043C YHR104W YKL085W YJL210W YJL111W YLR178C YIL113W YDR032C YOL097C YIR016W chr14@247.65 YPR160W YNL160W YHR016C YJL161W YAL061W chr10@47.16 YMR170C YJR096W

References 1. Chaibub Neto et al. (2013) Modeling causality for pairs of phenotypes in systems genetics. Genetics 193: 1003-1013. 2. Chaibub Neto et al. (2010). Causal graphical models in systems genetics: a unified framework for joint inference of causal network and genetic architecture for correlated phenotypes. Annals of Applied Statistics 4: 320-339. Software: R/qtlhot and R/qtlnet packages. Further references: 1. Brem and Kruglyak (2005) PNAS 102: 1572-1577. 2. Clarke (2007) Political Analysis 15: 347-363. 3. Hughes et al. (2000) Cell 102: 109-116. 4. Kullback (1959) Information theory and statistics. John Wiley. New York. 5. Li et al. (2006) Plos Genetics 2: e114. 6. Schadt et al. (2005) Nature Genetics 37: 710-717. 7. Vuong (1989) Econometrica 57: 307-333. 8. Zhu et al. (2008) Nature Genetics 40: 854-861. 45

46 Acknowledgments Co-authors: Brian S Yandell (Statistics - UW-Madison) Mark P Keller (Biochemistry - UW-Madison) Alan D Attie (Biochemistry - UW-Madison) Bin Zhang (Genetics and Genomic Sciences - MSSM) Jun Zhu (Genetics and Genomic Sciences - MSSM) Aimee T Broman (Biochemistry - UW-Madison)

Thank you! 47