High-dimensional causal inference, graphical modeling and structural equation models

Size: px

Start display at page:

Download "High-dimensional causal inference, graphical modeling and structural equation models"

Ruby Carr
5 years ago
Views:

1 High-dimensional causal inference, graphical modeling and structural equation models Peter Bühlmann Seminar für Statistik, ETH Zürich

2 cannot do confirmatory causal inference without randomized intervention experiments... but we can do better than proceeding naively

3 Goal in genomics: if we would make an intervention at a single gene, what would be its effect on a phenotype of interest? want to infer/predict such effects without actually doing the intervention i.e. from observational data (from observations of a steady-state system ) it doesn t need to be genes can generalize to intervention at more than one variable/gene

4 Goal in genomics: if we would make an intervention at a single gene, what would be its effect on a phenotype of interest? want to infer/predict such effects without actually doing the intervention i.e. from observational data (from observations of a steady-state system ) it doesn t need to be genes can generalize to intervention at more than one variable/gene

5 Policy making James Heckman: Nobel Prize Economics 2000 e.g.: Pritzker Consortium on Early Childhood Development identifies when and how child intervention programs can be most influential

6 Genomics 1. Flowering of Arabidopsis Thaliana phenotype/response variable of interest: Y = days to bolting (flowering) covariates X = gene expressions from p = genes remark: gene expression : process by which information from a gene is used in the synthesis of a functional gene product (e.g. protein) question: infer/predict the effect of knocking-out/knocking-down (or enhancing) a single gene (expression) on the phenotype/response variable Y?

2. Gene expressions of yeast p = 5360 genes phenotype of interest: Y = expression of first gene covariates X = gene expressions from all other genes and then phenotype of

7 2. Gene expressions of yeast p = 5360 genes phenotype of interest: Y = expression of first gene covariates X = gene expressions from all other genes and then phenotype of interest: Y = expression of second gene covariates X = gene expressions from all other genes and so on infer/predict the effects of a single gene knock-down on all other genes

8 consider the framework of an intervention effect = causal effect (mathematically defined see later)

9 Regression the statistical workhorse : the wrong approach we could use linear model (fitted from n observational data) Y = p β j X (j) + ε, j=1 Var(X (j) ) 1 for all j β j measures the effect of variable X (j) in terms of association i.e. change of Y as a function of X (j) when keeping all other variables X (k) fixed not very realistic for intervention problem if we change e.g. one gene, some others will also change and these others are not (cannot be) kept fixed

10 Regression the statistical workhorse : the wrong approach we could use linear model (fitted from n observational data) Y = p β j X (j) + ε, j=1 Var(X (j) ) 1 for all j β j measures the effect of variable X (j) in terms of association i.e. change of Y as a function of X (j) when keeping all other variables X (k) fixed not very realistic for intervention problem if we change e.g. one gene, some others will also change and these others are not (cannot be) kept fixed

11 and indeed: 1, IDA Lasso Elastic net Random True positives ,000 2,000 3,000 4,000 False positives can do much better than (penalized) regression!

12 and indeed: 1, IDA Lasso Elastic net Random True positives ,000 2,000 3,000 4,000 False positives can do much better than (penalized) regression!

13 Effects of single gene knock-downs on all other genes (yeast) (Maathuis, Colombo, Kalisch & PB, 2010) p = 5360 genes (expression of genes) 231 gene knock downs intervention effects the truth is known in good approximation (thanks to intervention experiments) goal: prediction of the true large intervention effects based on observational data with no knock-downs n = 63 observational data True positives 1, IDA Lasso Elastic net Random ,000 2,000 3,000 4,000 False positives

14 A bit more specifically univariate response Y p-dimensional covariate X question: what is the effect of setting the jth component of X to a certain value x: do(x (j) = x) this is a question of intervention type not the effect of X (j) on Y when keeping all other variables fixed (regression effect) Reichenbach, 1956; Suppes, 1970; Rubin, Dawid, Holland, Pearl, Glymour, Scheines, Spirtes,...

15 Intervention calculus (a review) dynamic notion of an effect: if we set a variable X (j) to a value x (intervention) some other variables X (k) (k j) and maybe Y will change we want to quantify the total effect of X (j) on Y including all changed X (k) on Y a graph or influence diagram will be very useful X 1 X 2 Y X 4 X 3

16 for simplicity: just consider DAGs (Directed Acyclic Graphs) [with hidden variables (Spirtes, Glymour & Scheines (1993); Colombo et al. (2012)) much more complicated and not validated with real data] random variables are represented as nodes in the DAG assume a Markov condition, saying that X (j) X (pa(j)) cond. independent of its non-descendant variables recursive factorization of joint distribution P(Y, X (1),..., X (p) ) = P(Y X (pa(y )) ) p P(X (j) X (pa(j)) ) j=1 for intervention calculus: use truncated factorization (e.g. Pearl)

17 for simplicity: just consider DAGs (Directed Acyclic Graphs) [with hidden variables (Spirtes, Glymour & Scheines (1993); Colombo et al. (2012)) much more complicated and not validated with real data] random variables are represented as nodes in the DAG assume a Markov condition, saying that X (j) X (pa(j)) cond. independent of its non-descendant variables recursive factorization of joint distribution P(Y, X (1),..., X (p) ) = P(Y X (pa(y )) ) p P(X (j) X (pa(j)) ) j=1 for intervention calculus: use truncated factorization (e.g. Pearl)

18 assume Markov property for causal DAG: non-intervention intervention do(x (2) = x) X (1) X (1) X (2) Y X (2) = x Y X (4) X (3) X (4) X (3) P(Y, X (1), X (2), X (3), X (4) ) = P(Y X (1), X (3) ) P(X (1) X (2) ) P(X (2) X (3), X (4) ) P(X (3) ) P(X (4) ) P(Y, X (1), X (3), X (4) do(x (2) = x)) = P(Y X (1), X (3) ) P(X (1) X (2) = x) P(X (3) ) P(X (4) )

19 truncated factorization for do(x (2) = x): P(Y, X (1), X (3), X (4) do(x (2) = x)) = P(Y X (1), X (3) )P(X (1) X (2) = x)p(x (3) )P(X (4) ) = P(Y do(x (2) = x)) P(Y, X (1), X (3), X (4) do(x (2) = x))dx (1) dx (3) dx (4)

20 the truncated factorization is a mathematical consequence of the Markov condition (with respect to the causal DAG) for the observational probability distribution P (plus assumption that structural eqns. are autonomous )

21 the intervention distribution P(Y do(x (2) = x)) can be calculated from observational data distribution need to estimate conditional distributions an influence diagram (causal DAG) need to estimate structure of a graph/influence diagram intervention effect: E[Y do(x (2) = x)] = yp(y do(x (2) = x))dy intervention effect at x 0 : x E[Y do(x (2) = x)] x=x0 in the Gaussian case: Y, X (1),..., X (p) N p+1 (µ, Σ), x E[Y do(x (2) = x)] θ 2 for all x

22 The backdoor criterion (Pearl, 1993) we only need to know the local parental set: for Z = X pa(x), if Y / pa(x) : P(Y do(x = x)) = P(Y X = x, Z )dp(z ) parental set might not be the minimal set but always suffices this is a consequence of the global Markov property: Z X A independent of X B X S : A and B are d-separated by S

23 Gaussian case for Gaussian case: x E[Y do(x (j) = x)] θ j for all x for Y / pa(j): θ j is the regression parameter in Y = θ j X (j) + θ k X (k) + error k pa(j) only need parental set and regression j = 2, pa(j) = {3, 4} X (1) X (2) Y X (4) X (3)

24 when having no unmeasured confounder (variable): intervention effect (as defined) = causal effect recap: causal effect = effect from a randomized trial (but we want to infer it without a randomized study... because often we cannot do it, or it is too expensive) structural equation models provide a different (but closely related) route for quantifying intervention effects

25 when having no unmeasured confounder (variable): intervention effect (as defined) = causal effect recap: causal effect = effect from a randomized trial (but we want to infer it without a randomized study... because often we cannot do it, or it is too expensive) structural equation models provide a different (but closely related) route for quantifying intervention effects

26 when having no unmeasured confounder (variable): intervention effect (as defined) = causal effect recap: causal effect = effect from a randomized trial (but we want to infer it without a randomized study... because often we cannot do it, or it is too expensive) structural equation models provide a different (but closely related) route for quantifying intervention effects

27 Inferring intervention effects from observational distribution main problem: inferring DAG from observational data impossible! can only infer equivalence class of DAGs (several DAGs can encode exactly the same conditional independence relationships) Example: X Y X Y X causes Y Y causes X a lot of work about identifiability: Verma & Pearl (1991); Spirtes, Glymour & Scheines (1993); Tian & Pearl ( ); Lauritzen & Richardson (2002); Shpitser & Pearl ( ); vanderweele & Robins ( ); Drton, Foygel & Sullivant (2011);...

28 Markov equivalence class of DAGs P a family of probability distributions on R p+1 Definition: M(D) = {P P; P is Markov w.r.t. DAG D} D D M(D) = M(D ) the definition depends on the family P: typical models: P = Gaussian distributions P = nonparametric model/set of all distributions P = set of distributions from additive structural equation model (see later)

29 A graphical characterization (Frydenberg, 1990; Verma & Pearl, 1991) for P = {Gaussian distributions} or P = {nonparametric distributions} it holds: D D D and D have the same skeleton and the same v-structures

30 for a DAG D, we write its corresponding Markov equivalence class E(D) = {D ; D D}

31 Equivalence class of DAGs Several DAGs can encode exactly the same conditional independence relationships. Such DAGs form an equivalence class. Example: unshielded triple X 1 X 3 X 1 X 3 X 2 X 1 X 2 X 3 false true X 1 X 2 X 3 false true X 1 X 2 X 3 false true X 1 X 2 X 3 true false no v-structure v-structure All DAGs in an equivalence class have the same skeleton and the same v-structures An equivalence class can be uniquely represented by a completed partially directed acyclic graph (CPDAG) CPDAG DAG 1 DAG 2 DAG 3 DAG 4 22

32 we cannot estimate causal/intervention effects from observational distribution but we will be able to estimate lower bounds of causal effects conceptual procedure : probability distribution P from a DAG, generating the data true underlying equivalence class of DAGs (CPDAG) find all DAG-members of true equivalence class (CPDAG): D 1,..., D m for every DAG-member D r, and every variable X (j) : single intervention effect θ r,j summarize them by Θ = {θ r,j ; r = 1,..., m; j = 1,..., p} }{{} identifiable parameter

33 we cannot estimate causal/intervention effects from observational distribution but we will be able to estimate lower bounds of causal effects conceptual procedure : probability distribution P from a DAG, generating the data true underlying equivalence class of DAGs (CPDAG) find all DAG-members of true equivalence class (CPDAG): D 1,..., D m for every DAG-member D r, and every variable X (j) : single intervention effect θ r,j summarize them by Θ = {θ r,j ; r = 1,..., m; j = 1,..., p} }{{} identifiable parameter

34 IDA (oracle version) PC-algorithm do-calculus DAG 1 effect 1 DAG 2 effect 2 oracle CPDAG.. multi-set Θ.. DAG m effect m 17

35 If you want a single number for every variable... instead of the multi-set Θ = {θ r,j ; r = 1,..., m; j = 1,..., p} minimal absolute value α j = min θ r,j (j = 1,..., p), r θ true,j α j minimal absolute effect α j is a lower bound for true absolute intervention effect

36 Computationally tractable algorithm searching all DAGs is computationally infeasible if p is large (we actually can do this up to p 15 20) instead of finding all m DAGs within an equivalence class compute all intervention effects without finding all DAGs (Maathuis, Kalisch & PB, 2009) key idea: exploring local aspects of the graph is sufficient

37 PC-algorithm do-calculus effect 1 effect 2 data CPDAG. multi-set Θ L. effect q 33 the local Θ L = Θ up to multiplicities (Maathuis, Kalisch & PB, 2009)

38 Estimation from finite samples notation: drop the Y -notation (Y = X (1), X (2),..., X (p) ) difficult part: estimation of CPDAG (equivalence class of DAGs) pcalgo(dm = d, alpha = 0.05) structural learning 3 P CPDAG }{{} equiv. class of DAGs two main approaches: multiple testing of conditional dependencies: PC-algorithm as prime example score-based methods: MLE as prime example 10

39 Estimation from finite samples notation: drop the Y -notation (Y = X (1), X (2),..., X (p) ) difficult part: estimation of CPDAG (equivalence class of DAGs) pcalgo(dm = d, alpha = 0.05) structural learning 3 P CPDAG }{{} equiv. class of DAGs two main approaches: multiple testing of conditional dependencies: PC-algorithm as prime example score-based methods: MLE as prime example 10

40 Faithfulness assumption (necessary for conditional dependence testing approaches) a distribution P is called faithful to a DAG G if all conditional independencies can be inferred from the graph (can infer some conditional independencies from a Markov assumption; but we require here all conditional independencies) assuming faithfulness: can infer the CPDAG from a list of conditional (in-)dependence relations

41 Faithfulness assumption (necessary for conditional dependence testing approaches) a distribution P is called faithful to a DAG G if all conditional independencies can be inferred from the graph (can infer some conditional independencies from a Markov assumption; but we require here all conditional independencies) assuming faithfulness: can infer the CPDAG from a list of conditional (in-)dependence relations

42 What does it mean? X (1) ε (1), X (2) αx (1) + ε (2), X (3) βx (1) + γx (2) + ε (3), ε (1), ε (2), ε (3) i.i.d. N (0, 1) enforce marginal independence of X (1) and X (3) β + αγ = 0, e.g. α = β = 1, γ = 1 Σ = , Σ 1 = failure of faithfulness due to cancellation of coefficients.

43 failure of exact faithfulness is rare (Lebesgue measure zero) but for statistical estimation (in the Gaussian case): often require strong faithfulness (Robins, Scheines, Spirtes & Wasserman, 2003): { } min ρ(i, j S) ; ρ(i, j S) 0, i j, S d τ, τ d log(p)/n (d is the maximal degree of the skeleton of the DAG)

44 ... strong faithfulness can be rather severe (Uhler, Raskutti, PB & Yu, 2013) 3 nodes, full graph imagine a strip around it large volume! strong faithfulness is restrictive in high dimensions unfaithful distributions due to exact cancellation

45 The PC-algorithm (Spirtes & Glymour, 1991) crucial assumption: distribution P is faithful to the true underlying DAG less crucial but convenient: Gaussian assumption for Y, X (1),..., X (p) can work with partial correlations input: ˆΣ MLE but we only need to consider many small sub-matrices of it (assuming sparsity of the graph) output: based on a clever data-dependent (random) sequence of multiple tests estimated CPDAG

46 PC-algorithm: a rough outline for estimating the skeleton of underlying DAG 1. start with full graph 2. remove edge i j if Ĉor(X (i), X (j) ) is small (Fisher s Z-transform and null-distribution of zero correlation) 3. partial correlations of order 1: remove edge i j if Parcor(X (i), X (j) X (k) ) is small for some k in the current neighborhood of i or j (thanks to faithfulness) full graph partial correlation order 1 correlation screening stopped

47 4. move-up to partial correlations of order 2: remove edge i j if partial correlation Parcor(X (i), X (j) X (k), X (l) ) is small for some k, l in the current neighborhood of i or j (thanks to faithfulness) 5. until removal of edges is not possible anymore, i.e. stop at minimal order of partial correlation where edge-removal becomes impossible full graph partial correlation order 1 correlation screening stopped additional step of the algorithm needed for estimating directions yields an estimate of the CPDAG (equivalence class of DAGs)

48 one tuning parameter (cut-off parameter) α for truncation of estimated Z -transformed partial correlations if the graph is sparse (few neighbors) few iterations only and only low-order partial correlations play a role and thus: the estimation algorithm works for p n problems

49 Computational complexity crudely bounded to be polynomial in p sparser underlying structure faster algorithm we can easily do the computations for sparse cases with p hrs CPU time log10( Processor Time [s] ) E[N]=2 E[N]= log10(p)

50 IDA (Intervention calculus when DAG is Absent) (Maathuis, Colombo, Kalisch & PB, 2010) 1. PC-algorithm CPDAG 2. local algorithm ˆΘ local 3. lower bounds for absolute causal effects ˆα j R-package: pcalg this is what we used in the yeast example to score the importance of the genes according to size of ˆα j

51 Statistical theory (Kalisch & PB, 2007; Maathuis, Kalisch & PB, 2009) n i.i.d. observational data points; p variables high-dimensional setting where p n assumptions: X (1),..., X (p) N p (0, Σ) Markov and faithful to true DAG high-dimensionality: log(p) n sparsity: maximal degree d = max j ne(j) n coherence : maximal (partial) correlations C < 1 max{ ρ i,j S ; i j, S d} C < 1 signal strength/strong faithfulness: min{ ρ i,j S ; ρ i,j S 0, i j, S d} d log(p)/n Then, for some suitable tuning param. and 0 < δ < 1: P[ CPDAG = true CPDAG] = 1 O(exp( Cn 1 δ )) P[ ˆΘ L as set = Θ] = 1 O(exp( Cn 1 δ )) (i.e. consistency of lower bounds for causal effects)

52 Main strategy of a proof we have to analyze an algorithm... understand the population version: in particular: can show that algorithm stops in d or d 1 steps analysis of the noisy version: additional errors use a union bound argument to control the overall error (seems very rough but leads to asymptotically optimal regime, e.g., for signal strength)

53 The role of sparsity in causal inference as usual: sparsity is necessary for accurate estimation in presence of noise but here: sparsity (so-called protectedness) is crucial for identifiability as well X Y X Y X causes Y Y causes X cannot tell from observational data the direction of the arrow the same situation arises with a full graph with more than 2 nodes causal identification really needs sparsity the better the sparsity the tighter the bounds for causal effects

54 Maximum likelihood estimation R.A. Fisher Gaussian DAG is Gaussian linear structural equation model: X (1) ε (1) X (2) β 21 X (1) + ε (2) X (3) β 31 X (1) + β 32 X (2) + ε (3) in general: p X (j) β jk X (k) + ε (j) (j = 1,..., p), β jk 0 edge k j k=1 X = BX + ε, ε N p (0, diag(σ1 2,..., σ2 p)) in matrix notation reparameterization

55 ˆΣ, ˆD = argminσ;d a DAG l(σ, D; data) + λ D = argmin B; {σ 2 j ;j} l(b, {σ2 j ; j}; data) + λ B }{{ 0 } ij I(B ij 0) under the non-convex constraint that B corresponds to no directed cycles severe non-convex problem due to the no directed cycle constraint ( 0 -penalty rather than e.g. 1 doesn t make the problem much harder)

56 toy-example X (1) β 1 X (2) + ε 1 X (2) β 2 X (1) + ε 2 beta2 (0,0) X1 X2 beta1 non-convex parameter space! (no straightforward way to do convex relaxation)

57 computation: dynamic programming algorithm (up to p 20) (Silander and Myllymäki, 2006) greedy algorithms on equivalence classes (Chickering, 2002; Hauser & PB, 2012) statistical properties of penalized MLE (for large p): (van de Geer & PB, 2013) no faithfulness assumption required to obtain the minimal edges I-MAP!

58 Successes in biology Effects of single gene knock-downs on all other genes in yeast (Maathuis, Colombo, Kalisch & PB, 2010) 1, IDA Lasso Elastic net Random n = 63 observational data True positives ,000 2,000 3,000 4,000 False positives

59 Arabidopsis thaliana (Stekhoven, Moraes, Sveinbjörnsson, Hennig, Maathuis & PB, 2012) response Y : days to bolting (flowering) of the plant (aim: fast flowering plants) covariates X: gene-expression profile observational data with n = 47 and p = lower bound estimates ˆα j for causal effect of every gene/variable on Y (using the PC-algorithm) apply stability selection (Meinshausen & PB, 2010) assigning uncertainties via control of PCER = E[V ]/p 1 q 2 2π thr 1 (per comparison error rate) p 2

60 Causal gene ranking summary median error Gene rank effect expression (PCER) name 1 AT2G AGL20 (SOC1) 2 AT4G ATCSLG1 3 AT1G PDR12 4 AT3G replication protein-related 5 AT5G ATSUC6 6 AT4G FRI 7 AT1G ATCSLA10 8 AT1G AtGH9B5 9 AT3G protein coding 10 AT1G protein coding 11 AT2G protein coding 12 AT2G protein coding 13 AT2G AVAP5 14 AT3G CYP72A9 15 AT1G protein coding 16 AT5G CHR4 17 AT3G DWF4 18 AT5G FLC 19 AT1G peroxidase, putative 20 AT1G unknown protein biological validation by gene knockout experiments in progress. red: biologically known genes responsible for flowering

61 performed validation experiment with mutants corresponding to these top 20-3 = 17 genes 14 mutants easily available only test for 14 genes more than usual: mutants showed low germination or survival... 9 among the 14 mutants survived (sufficiently strongly), i.e. 9 mutants for which we have an outcome 3 among the 9 mutants (genes) showed a significant effect for Y relative to the wildtype (non-mutated plant) that is: besides the three known genes, we find three additional genes which exhibit a significant difference in terms of time to flowering

62 in short: bounds on causal effects (ˆα j s) based on observational data lead to interesting predictions for interventions in genomics (i.e. which genes would exhibit a large intervention effect) and these predictions have been validated using experiments

63 Fully identifiable cases recap: if P = {Gaussian distributions}, then: cardinality of the Markov-equivalence class of a DAG D E(D) often > 1 same is true for P = {nonparametric distributions} but under some additional constraints, the Markov equivalence class becomes smaller or even consisting of one DAG only (i.e., identifiable)

64 structural equation model (SEM) X j f j (X pa(j), ε j ) (j = 1,..., p) e.g. X j f jk (X k ) + ε j (j = 1,..., p) k pa(j)

three types of identifiable SEMs where true DAG is identifiable from observational distribution: LiNGAM (Shimizu et al., 2006) X j k pa(j) with ε j s non-gaussian B jk X k + ε j (j = 1,.

65 three types of identifiable SEMs where true DAG is identifiable from observational distribution: LiNGAM (Shimizu et al., 2006) X j k pa(j) with ε j s non-gaussian B jk X k + ε j (j = 1,..., p), as with independent component analysis (ICA) identifying Gaussian components is the hardest linear Gauss with equal error variances (Peters & PB, 2014) X j B jk X k + ε j (j = 1,..., p), k pa(j) Var(ε j ) σ 2 for all j

I nonlinear SEM with additive noise (Scholkopf and co-workers, 2008 2014: Hoyer et al., 2009; Peters et al., 2013) Xj fj (Xpa(j) ) + εj (j = 1,.

66 I nonlinear SEM with additive noise (Scholkopf and co-workers, : Hoyer et al., 2009; Peters et al., 2013) Xj fj (Xpa(j) ) + εj (j = 1,..., p) toy example: p = 2 with DAG X (1) = X Y = X (2) errors epsilon y truth: Y = +ε ε indep. of X 0 X Y = X^3 + epsilon y eta x errors 0 x X= Y^{1/3} + eta 5 x reverse mod.: X = Y 1/3 + η η not indep. of Y y

67 strategy to infer structure and model parameters (for identifiable models): order search (order among the variables) & subsequent (sparse) regression

68 order search: MLE for best permutation/order candidate order/permutation π (of {1,..., p}) regressions X π(j) versus X π(j 1), X π(j 2),..., X π(1) (when sparse: only some of the regressors are selected) evaluation score for π: likelihood of all the regressions X π(j) vs. X π(j 1), X π(j 2),..., X π(1) p j=1 p(x π(j) X π(j 1),..., X π(1) ) need a model for the regressions: functional form of regressions (e.g. additive, linear) distribution of error terms (e.g. Gaussian, nonparametric) can compute the likelihood with estimated parameters and search over the orderings/permutations

69 search over the orderings/permutations: there are p! permutations... trick: preliminary neighborhood selection aiming for an undirected graph containing the skeleton of the DAG (i.e. the Markov blanket or a superset thereof) superset skeleton 5 4 and do order search and estimation restricted to superset-skeleton much reduced search space can use unpenalized MLE (for best order/permutation restricted to superset-skeleton) that is: regularization in neighborhood selection suffices!

70 CAM: Causal Additive Model (PB, Peters & Ernest, 2013) X j f jk (X k ) + ε j (j = 1,..., n) k pa(j) ε j N (0, σ 2 j ) underlying DAG is identifiable from joint distribution statistically feasible for estimation due to additive functions good empirical performance log-likelihood equals (up to constant term): p log(ˆσ j π ), (ˆσ j π ) 2 = n 1 j=1 n i=1(x i,π(j) k<j ˆfj,π(k) (X i,k ))

71 fitting method 1. preliminary neighborhood selection (for superset of skeleton of DAG or Markov blanket): nodewise sparse additive regression of one variable X j against all others {X k ; k j} easy (using CV), works well and is established (Ravikumar et al. (2009), Meier, van de Geer & PB (2009),...) 2. estimate a best order by unpenalized MLE restricted to the superset-skeleton 3. based on the estimated order ˆπ: additive sparse model fitting restricted to the superset-skeleton edge pruning step 1: 5 step 2: 5 step 3: 5 for step 2: greedy search restricted to superset-skeleton 4

72 for step 2: greedy search restricted to superset-skeleton exhaustive computation is possible for trees (Bürge, MSC thesis 2014) greedy methods are shown to find the optimum for restricted classes of DAGs (Peters, Balakrishnan and Wainwright, in progress) unpenalized MLE for best order search: very convenient regularization is decoupled and only used in sparse additive regression

73 statistical consistency: high-dimensional p n setting (PB, Peters & Ernest, 2013) main assumptions: maximal degree of DAG is O(1) ( sparse ) sufficiently large identifiability constant (w.r.t. Kullback- Leibler divergence) ( between true and wrong ordering: ) ξ p := p 1 min π / Π 0 E θ 0[ log(p π θ (X))] E 0 θ 0[ log(p θ 0(X))] ξ p log(p)/n exponential moments for X j s and certain smooth functions thereof then: P[ˆπ }{{} Π 0 ] 1 (n ) set of true permutations consistent recovery of true functions f 0 jk and DAG D0 consistent estimation of E[Y do(x = x)]

74 remark: preliminary neighborhood selection yields empirical and computational improvements seems to improve the theoretical results e.g. in linear Gaussian case less stringent conditions than l 0 -regularized MLE (van de Geer & PB, 2013)

75 Empirical results to illustrate what can be achieved with CAM p = 100, n = 200 true model is CAM (additive SEM) with Gaussian error SHD to true DAG SHD: Structural Hamming Distance SID: Structural Intervention Distance (Peters & PB, 2013) p = 100 CAM RESIT LiNGAM PC CPC GES SID to true DAG p = 100 CAM RESIT LiNGAM PC (lower) PC (upper) CPC (lower) CPC (upper) GES (lower) GES (upper) RESIT (Mooij et al. 2009) cannot be used for p = 100 CAM method is impressive where true functions are non-monotone and nonlinear (sampled from Gaussian proc.); for monotone functions: still good but less impressive gains

76 Empirical results to illustrate what can be achieved with CAM p = 100, n = 200 true model is CAM (additive SEM) with Gaussian error SHD to true DAG SHD: Structural Hamming Distance SID: Structural Intervention Distance (Peters & PB, 2013) p = 100 CAM RESIT LiNGAM PC CPC GES SID to true DAG p = 100 CAM RESIT LiNGAM PC (lower) PC (upper) CPC (lower) CPC (upper) GES (lower) GES (upper) RESIT (Mooij et al. 2009) cannot be used for p = 100 CAM method is impressive where true functions are non-monotone and nonlinear (sampled from Gaussian proc.); for monotone functions: still good but less impressive gains

77 Gene expressions from isoprenoid pathways in Arabidopsis Thaliana (Wille et al., 2004) p = 39, n = 118 top 20 edges from CAM Chloroplast (MEP pathway) Cytoplasm (MVA pathway) stability selection Chloroplast (MEP pathway) Cytoplasm (MVA pathway) DXPS1 DXPS2 DXPS3 AACT1 AACT2 DXPS1 DXPS2 DXPS3 AACT1 AACT2 DXR HMGS DXR HMGS MCT HMGR1 HMGR2 MCT HMGR1 HMGR2 CMK MK CMK MK MECPS MPDC1 MPDC2 MECPS MPDC1 MPDC2 HDS HDS IPPI2 IPPI2 HDR HDR IPPI1 Mitochondrion FPPS1 FPPS2 IPPI1 Mitochondrion FPPS1 FPPS2 UPPS1 UPPS1 GPPS DPPS2 GPPS DPPS2 GGPPS 2,6,8,10,11,12 PPDS1 PPDS2 GGPPS1,5,9 DPPS1,3 GGPPS3,4 GGPPS 2,6,8,10,11,12 PPDS1 PPDS2 GGPPS1,5,9 DPPS1,3 GGPPS3,4 solid edges: estimated from data stability selection: expected no. of false positives 2 (Meinshausen & PB, 2010)

78 When knowing the order of the variables a new approach... Theorem (Ernest & PB, 2014) For a general nonlinear SEM with functions f j L 2 (P) and regularity condition on the joint density/distribution: that is for all I {1,..., p}: p(x) Mp(x I )p(x I c ) for some 0 < M 1 E[Y do(x = x)] = g(x), g(x) = best additive L 2 -approx. of Y versus X, {X k ; k S(X)} e.g. S(X) = {k; k < j X }, j X = index corresponding to X E[Y do(x = x)] = g(x), g(x) = best additive approx. of Y versus X, {X j ; j S(X)}, g app = argmin fj E[(Y f jx (X) f k (X k )) 2 ], g app j X ( ) = g( ) k S(X)

79 implication: only need to run additive model fitting even for an arbitrary nonlinear SEM e.g. even if f j (X pa(j), ε j ) is very complicated...! call it ord-additive modeling/fitting very robust against model-misspecification! misspecification could be dependent ε j s, i.e. due to hidden variables if we were to consider E[Y do(x 1 = x 1, X 2 = x 2 )] would need to fit first-order interaction model

80 implication: only need to run additive model fitting even for an arbitrary nonlinear SEM e.g. even if f j (X pa(j), ε j ) is very complicated...! call it ord-additive modeling/fitting very robust against model-misspecification! misspecification could be dependent ε j s, i.e. due to hidden variables if we were to consider E[Y do(x 1 = x 1, X 2 = x 2 )] would need to fit first-order interaction model

81 Swiss Air Ltd. ticket pricing and revenue (Ghielmetti et al., in progress) do not need to worry about complicated non-linearities!

82 estimation in practice additive model fitting of Y versus X, {X k ; k S(X)} Hastie & Tibshirani ( 1990); Wood (2006);... high-dimensional scenario when S(X) is large e.g. S(X) and sample size n 50 consistency and optimal convergence rate results for sparse additive model fitting if problem is sparse: maximal degree of true DAG is small identifiability aka restricted eigenvalue conditions Ravikumar et al. (2009); Meier, van de Geer & PB (2009) additive model fitting is well-established and works well including high-dimensional setting

83 estimation in practice additive model fitting of Y versus X, {X k ; k S(X)} Hastie & Tibshirani ( 1990); Wood (2006);... high-dimensional scenario when S(X) is large e.g. S(X) and sample size n 50 consistency and optimal convergence rate results for sparse additive model fitting if problem is sparse: maximal degree of true DAG is small identifiability aka restricted eigenvalue conditions Ravikumar et al. (2009); Meier, van de Geer & PB (2009) additive model fitting is well-established and works well including high-dimensional setting

84 MEP pathway in Arabidopsis Thaliana (Wille et al, 2004) p = 14 expressions of genes, n = 118 order of variables: biological information w.r.t. up-/downstream Chloroplast (MEP pathway) DXPS1 DXPS2 DXPS3 GPPS DXR MCT CMK MECPS HDS HDR IPPI1 rank directed gene pairs (X causes Y ) with (Ê[Y do(x = x) Ê[Y ]) 2 dx top 10 scoring directed gene pairs and check their stability stability selection: E[false positives] 1 (Meinshausen & PB, 2010) GGPPS 2,6,8,10,11,12 PPDS1 PPDS2 do not need to worry about complicated nonlinearities!

85 ord-additive regression: a local operation inferring E[Y do(x = x)] when DAG (or order) is known: ord-additive regression Y versus X, pa(x): local integration of all directed paths from X to Y : global squared error Entire Path Partial Path Parent Adjustment squared error Entire Path Partial Path Parent Adjustment (100%) 16 (96%) 32 (92%) 64 (84%) 128 (68%) 256 (34%) SHD to true DAG (percentage of correct edges) nonsparse DAG 0 0 (100%) 4 (96%) 8 (92%) 16 (84%) 32 (68%) 64 (34%) SHD to true DAG (percentage of correct edges) sparse DAG ord-additive regression is much more reliable and robust when having imperfect knowledge of the DAG or the order of variables and computationally much faster

Conclusions 1. Beware of over-interpretation! so far, based on current data: we can not reliably infer a causal network despite theorems... (perturbation of the data yields unstable networks) 2.

86 Conclusions 1. Beware of over-interpretation! so far, based on current data: we can not reliably infer a causal network despite theorems... (perturbation of the data yields unstable networks) 2. Causal inference relies on subtle uncheckable(!) assumptions 1 experimental validations are important (simple organisms in biology are great for pursuing this!) statistical (and other) inference is often not confirmatory 3. many technical issues in identifiability, high-dimensional statistical inference and optimization

87 4. but there is potential: for stable ranking/prediction of intervention/causal effects... causal inference from purely observed data could have practical value in the prioritization and design of perturbation experiments Editorial in Nature Methods (April 2010) this can be very useful in computational biology and in this sense: causal inference from observational data is much further developed than 30 years ago when it was thought to be impossible

88 Thank you! R-package: pcalg (Kalisch, Mächler, Colombo, Maathuis & PB, 2012) References: Ernest, J. and Bühlmann, P. (2014). On the role of additive regression for (high-dimensional) causal inference. Preprint arxiv: Bühlmann, P., Peters, J. and Ernest, J. (2013). CAM: Causal Additive Models, high-dimensional order search and penalized regression. Preprint arxiv: Peters, J. and Bühlmann, P. (2014). Identifiability of Gaussian structural equation models with equal error variances. Biometrika 101, Uhler, C., Raskutti, G., Bühlmann, P. and Yu, B. (2013). Geometry of faithfulness assumption in causal inference. Annals of Statistics 41, van de Geer, S. and Bühlmann, P. (2013). l0 -penalized maximum likelihood for sparse directed acyclic graphs. Annals of Statistics 41, Kalisch, M., Mächler, M., Colombo, D., Maathuis, M.H. and Bühlmann, P. (2012). Causal inference using graphical models with the R package pcalg. Journal of Statistical Software 47 (11), Stekhoven, D.J., Moraes, I., Sveinbjörnsson, G., Hennig, L., Maathuis, M.H. and Bühlmann, P. (2011). Causal stability ranking. Bioinformatics 28, Hauser, A. and Bühlmann, P. (2012). Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs. Journal of Machine Learning Research 13, Maathuis, M.H., Colombo, D., Kalisch, M. and Bühlmann, P. (2010). Predicting causal effects in large-scale systems from observational data. Nature Methods 7, Maathuis, M.H., Kalisch, M. and Bühlmann, P. (2009). Estimating high-dimensional intervention effects from observational data. Annals of Statistics 37,

Simplicity of Additive Noise Models

Simplicity of Additive Noise Models Jonas Peters ETH Zürich - Marie Curie (IEF) Workshop on Simplicity and Causal Discovery Carnegie Mellon University 7th June 2014 contains joint work with... ETH Zürich: