Simplicity of Additive Noise Models

Size: px

Start display at page:

Download "Simplicity of Additive Noise Models"

Preston Skinner
6 years ago
Views:

1 Simplicity of Additive Noise Models Jonas Peters ETH Zürich - Marie Curie (IEF) Workshop on Simplicity and Causal Discovery Carnegie Mellon University 7th June 2014

2 contains joint work with... ETH Zürich: Peter Bühlmann, Jan Ernest Max-Planck-Institute Tübingen: Dominik Janzing, Bernhard Schölkopf University of Amsterdam, Radboud University Nijmegen: Joris Mooij UC Berkeley: Sivaraman Balakrishnan, Martin Wainwright

3 What is the Problem? Theoretical: P(X 1,..., X 5 )? DAG G 0 X 4 X 5 X 2 X 3 X 1 Practical: iid observations from? estimated P(X 1,..., X 5 ) DAG G 0

4 Consider the world of Structural Equation Models Assume P(X 1,..., X 4 ) has been generated by X 1 = f 1 (X 3, N 1 ) X 2 = N 2 X 3 = f 3 (X 2, N 3 ) X 4 = f 4 (X 2, X 3, N 4 ) G 0 X 1 X 2 X 3 N i jointly independent G 0 has no cycles X 4 Can the directed acyclic graph be recovered from P(X 1,..., X 4 )?

5 Consider the world of Structural Equation Models Assume P(X 1,..., X 4 ) has been generated by X 1 = f 1 (X 3, N 1 ) X 2 = N 2 X 3 = f 3 (X 2, N 3 ) X 4 = f 4 (X 2, X 3, N 4 ) G 0 X 1 X 2 X 3 N i jointly independent G 0 has no cycles X 4 Can the directed acyclic graph be recovered from P(X 1,..., X 4 )? No.

6 Consider the world of Structural Equation Models Proposition Given a distribution P, we can find a SEM generating P for each graph G, such that P is Markov with respect to G. JP: Restricted Structural Equation Models for Causal Inference, PhD Thesis 2012 (and probably others?)

7 Consider the world of Structural Equation Models Proposition Given a distribution P, we can find a SEM generating P for each graph G, such that P is Markov with respect to G a lot of graphs. JP: Restricted Structural Equation Models for Causal Inference, PhD Thesis 2012 (and probably others?)

8 Consider the world of Structural Equation Models Assume P(X 1,..., X 4 ) has been generated by X 1 = f 1 (X 3 ) + N 1 X 2 = N 2 X 3 = f 3 (X 2 ) + N 3 G 0 X 1 X 4 = f 4 (X 2, X 3 ) + N 4 X 2 X 3 N i jointly independent X 4 G 0 has no cycles Additive Noise Models. Can the DAG be recovered JP, J. Mooij, D. Janzing and B. Schölkopf: Causal Discovery with Continuous Additive Noise Models, to appear in JMLR

9 Consider the world of Structural Equation Models Assume P(X 1,..., X 4 ) has been generated by X 1 = f 1 (X 3 ) + N 1 X 2 = N 2 X 3 = f 3 (X 2 ) + N 3 G 0 X 1 X 4 = f 4 (X 2, X 3 ) + N 4 X 2 X 3 N i N (0, σi 2 ) jointly independent X 4 G 0 has no cycles Additive Noise Models with Gaussian noise.

10 Consider the world of Structural Equation Models Assume P(X 1,..., X 4 ) has been generated by X 1 = f 1 (X 3 ) + N 1 X 2 = N 2 X 3 = f 3 (X 2 ) + N 3 G 0 X 1 X 4 = f 4 (X 2, X 3 ) + N 4 X 2 X 3 N i N (0, σi 2 ) jointly independent X 4 G 0 has no cycles Additive Noise Models with Gaussian noise. Can the DAG be recovered from P(X 1,..., X 4 )?

11 Consider the world of Structural Equation Models Assume P(X 1,..., X 4 ) has been generated by X 1 = f 1 (X 3 ) + N 1 X 2 = N 2 X 3 = f 3 (X 2 ) + N 3 G 0 X 1 X 4 = f 4 (X 2, X 3 ) + N 4 X 2 X 3 N i N (0, σi 2 ) jointly independent X 4 G 0 has no cycles Additive Noise Models with Gaussian noise. Can the DAG be recovered from P(X 1,..., X 4 )? Yes iff f i nonlinear. JP, J. Mooij, D. Janzing and B. Schölkopf: Causal Discovery with Continuous Additive Noise Models, to appear in JMLR

12 Consider the world of Structural Equation Models

13 Consider the world of Structural Equation Models

14 Consider the world of Structural Equation Models GAUL GAUSS the LINEAR

15 Consider the world of Structural Equation Models Proposition Given a Gaussian distribution P, we can find a linear Gaussian SEM generating P for each graph G, such that P is Markov with respect to G. A. Hauser: Causal Inference from Interventional Data, PhD Thesis 2013 (and probably others?)

16 Consider the world of Structural Equation Models

17 Does X cause Y or vice versa? Consider a distribution generated by Y = f (X ) + N Y with N Y, X ind N X Y

18 Does X cause Y or vice versa? Consider a distribution generated by Y = f (X ) + N Y with N Y, X ind N X Y Then, if f is nonlinear, there is no X = g(y ) + M X with M X, Y ind N X Y JP, J. Mooij, D. Janzing and B. Schölkopf: Causal Discovery with Continuous Additive Noise Models, to appear in JMLR

19 Does X cause Y or vice versa? Consider a distribution corresponding to Y = X 3 + N Y with N Y, X ind N X Y with X N (1, ) N Y N (0, )

20 Does X cause Y or vice versa? Y X

21 Does X cause Y or vice versa? Y X

22 Does X cause Y or vice versa? Y X

23 Does X cause Y or vice versa? Y gam(x ~ s(y))$residuals

24 Does X cause Y or vice versa? Method No. 1: Testing for independent residuals Y Regress each variable on the other and check whether the residuals are independent of the input/independent/explanatory variable gam(x ~ s(y))$residuals

25 Does X cause Y or vice versa? Significant IGCI LiNGaM Additive Noise PNL 70 Accuracy (%) Not significant Decision rate (%)

26 Does X cause Y or vice versa? F. H. Messerli: Chocolate Consumption, Cognitive Function, and Nobel Laureates, N Engl J Med 2012

27 Does X cause Y or vice versa? F. H. Messerli: Chocolate Consumption, Cognitive Function, and Nobel Laureates, N Engl J Med 2012

28 Does X cause Y or vice versa? F. H. Messerli: Chocolate Consumption, Cognitive Function, and Nobel Laureates, N Engl J Med 2012

29 Does X cause Y or vice versa? No (not enough) data for chocolate

30 Does X cause Y or vice versa? No (not enough) data for chocolate... but we have data for coffee!

31 Does X cause Y or vice versa? # Nobel Laureates / 10 mio Correlation: p-value: < coffee consumption per capita (kg)

32 Does X cause Y or vice versa? # Nobel Laureates / 10 mio Correlation: p-value: < coffee consumption per capita (kg) Coffee Nobel Prize: Dependent residuals (p-value of ). Nobel Prize Coffee: Dependent residuals (p-value of ). Model class too small? Causally insufficient?

33 Does X cause Y or vice versa? # Nobel Laureates / 10 mio Correlation: p-value: < coffee consumption per capita (kg) Coffee Nobel Prize: Dependent residuals (p-value of ). Nobel Prize Coffee: Dependent residuals (p-value of ). Model class too small? Causally insufficient? Question: When is a p-value too small?

34 Does X cause Y or vice versa? Method No. 1: Testing for independent residuals Regress each variable on the other and check whether the residuals are independent of the input/independent/explanatory variable. Y gam(x ~ s(y))$residuals

35 Does X cause Y or vice versa? Method No. 1: Testing for independent residuals Regress each variable on the other and check whether the residuals are independent of the input/independent/explanatory variable. Y translates to p > 2 variables. Nice: correct (in population), no distributional assumption about noise. Problem: does not scale well to large p although it s O(p 2 ) ind. tests gam(x ~ s(y))$residuals

36 Does X cause Y or vice versa?

37 Does X cause Y or vice versa?

38 Does X cause Y or vice versa? Method No. 2: Minimizing KL Estimate the direction that corresponds to the closest subspace (details follow).

39 Does X cause Y or vice versa?

40 Does X cause Y or vice versa? Proposition Assume P(X, Y ) is generated by Y = βx 2 + N Y with independent X N (0, σx 2 ) and N Y N (0, σn 2 Y ).

41 Does X cause Y or vice versa? Proposition Assume P(X, Y ) is generated by Y = βx 2 + N Y with independent X N (0, σ 2 X ) and N Y N (0, σ 2 N Y ). Then inf KL(P Q) > 0 if β 0. Q {Q:Y X } gives us finite sample guarantees model misspefication: how much non-gaussianity can we allow for Question: inf Q {Q:Y X } KL(P Q) =...?

42 Does X cause Y or vice versa? Proposition Assume P(X, Y ) is generated by Y = βx 2 + N Y with independent X N (0, σ 2 X ) and N Y N (0, σ 2 N Y ). Then inf Q {Q:Y X } KL(P Q) = 1 2 log ( 1 + 2β 2 σ4 X σ 2 N Y ) gives us finite sample guarantees model misspefication: how much non-gaussianity can we allow for Question: inf Q {Q:Y X } KL(P Q) =...?

43 Does X cause Y or vice versa? Proposition Assume P(X, Y ) is generated by Y = f (X ) + N Y with independent X N (0, σ 2 X ) and N Y N (0, σ 2 N Y ). Then inf KL(P Q) Q {Q:Y X }

44 Does X cause Y or vice versa? Proposition Assume P(X, Y ) is generated by Y = f (X ) + N Y with independent X N (0, σ 2 X ) and N Y N (0, σ 2 N Y ). Then inf KL(P Q) KL (P(Y ) N (0, vary )) Q {Q:Y X }

45 Does X cause Y or vice versa? Proposition Assume P(X, Y ) is generated by Y = f (X ) + N Y with independent X N (0, σ 2 X ) and N Y N (0, σ 2 N Y ). Then inf KL(P Q) KL (P(Y ) N (0, vary )) Q {Q:Y X } gives us finite sample guarantees model misspefication: how much non-gaussianity can we allow for

46 Does X cause Y or vice versa? Proposition Assume P(X, Y ) is generated by Y = f (X ) + N Y with independent X N (0, σ 2 X ) and N Y N (0, σ 2 N Y ). Then inf KL(P Q) KL (P(Y ) N (0, vary )) Q {Q:Y X } gives us finite sample guarantees model misspefication: how much non-gaussianity can we allow for Question: inf Q {Q:Y X } KL(P Q) =...?

47 Can we reconstruct the whole causal network?

48 Can we reconstruct the whole causal network? Theorem let P(X 1,..., X p ) be generated by an additive noise model (+Gaussian) X i = f i (X PAi ) + N i with jointly independent N i N (0, σ 2 i ) and differentiable, non-linear f i. Then we can identify the corresponding DAG from P(X 1,..., X p ). JP, J. Mooij, D. Janzing and B. Schölkopf: Causal Discovery with Continuous Additive Noise Models, to appear in JMLR

49 Can we reconstruct the whole causal network? Theorem let P(X 1,..., X p ) be generated by an additive noise model (+Gaussian) X i = f i (X PAi ) + N i with jointly independent N i N (0, σ 2 i ) and differentiable, non-linear f i. Then we can identify the corresponding DAG from P(X 1,..., X p ). JP, J. Mooij, D. Janzing and B. Schölkopf: Causal Discovery with Continuous Additive Noise Models, to appear in JMLR Surprising: This follows from identifiability in the bivariate case.

50 Can we reconstruct the whole causal network? Theorem let P(X 1,..., X p ) be generated by a causal additive model (+Gaussian) X i = k PA i f i,k (X k ) + N i with jointly independent N i N (0, σ 2 i ) and differentiable, non-linear f i,k. Then we can identify the corresponding DAG from P(X 1,..., X p ). JP, J. Mooij, D. Janzing and B. Schölkopf: Causal Discovery with Continuous Additive Noise Models, to appear in JMLR Surprising: This follows from identifiability in the bivariate case.

51 Can we reconstruct the whole causal network? Given ˆP n (X 1,..., X 4 ). What now?

52 Can we reconstruct the whole causal network? Given ˆP n (X 1,..., X 4 ). What now? Consider model classes Q G := {Q : Q generated by a causal additive model CAM with DAG G} Optimize min DAG G inf KL( ˆP n Q) Q Q G

53 Can we reconstruct the whole causal network? Given ˆP n (X 1,..., X 4 ). What now? Consider model classes Q G := {Q : Q generated by a causal additive model CAM with DAG G} Optimize min DAG G inf KL( ˆP n Q) Q Q G max. likelihood min DAG G p i=1 log var(residuals PA G i X i )

54 Can we reconstruct the whole causal network? Given ˆP n (X 1,..., X 4 ). What now? Consider model classes Q G := {Q : Q generated by a causal additive model CAM with DAG G} Optimize min DAG G inf KL( ˆP n Q) Q Q G max. likelihood min DAG G p i=1 log var(residuals PA G i X i ) Wait, there is no penalization on the number of edges! fully connected graph, i.e. orderings

55 Can we reconstruct the whole causal network? There are DAGs with 13 nodes

56 Can we reconstruct the whole causal network? There are DAGs with 13 nodes. There are only :-) orderings Idea: find causal order Find the order of variables that maximizes likelihood and then perform classical variable selection. This is consistent

57 Can we reconstruct the whole causal network? There are DAGs with 13 nodes. There are only :-) orderings Idea: find causal order Find the order of variables that maximizes likelihood and then perform classical variable selection. This is consistent (but intractable). P. Bühlmann, JP and J. Ernest: CAM: Causal add. models, high-dim. order search and penalized regression, submitted

58 Can we reconstruct the whole causal network? include best edge recompute column X 1 X 4 X 2 X 5 X 3 X 6 STEP 1: Greedy Addition. Include the edge that leads to the largest increase of the log-likelihood.

59 Can we reconstruct the whole causal network? Theorem (Some correctness of greedy search for CAM) Assume that the skeleton of the correct DAG does not contain any cycles (plus some mild conditions/modifications). Greedy addition then yields a correct causal order (in population). JP, S. Balakrishnan and M. Wainwright: in progress. STEP 1: Greedy Addition. Include the edge that leads to the largest increase of the log-likelihood.

60 Can we reconstruct the whole causal network? X 2 X 2 X 1 X 3 X 1 X 3 X 4 X 6 X 4 X 6 X 5 remove edges by variable selection X 5 STEP 2: Variable Selection. For each node, remove non-relevant edges.

61 Can we reconstruct the whole causal network? X 2 X 2 X 1 X 3 X 1 X 3 X 4 X 6 X 4 X 6 X 5 PNS by variable selection X 5 Easy to add for high-dim data: STEP 0: Preliminary Neighbourhood Selection.

62 Can we reconstruct the whole causal network? Simulations p = 100, n = 200, functions drawn from Gaussian Process SHD to true DAG p = 100 CAM RESIT LiNGAM PC CPC GES

63 Can we reconstruct the whole causal network? Simulations p = 100, n = 200, functions drawn from Gaussian Process SID to true DAG p = 100 CAM RESIT LiNGAM PC (lower) PC (upper) CPC (lower) CPC (upper) GES (lower) GES (upper) JP and P. Bühlmann: Structural Intervention Distance (SID) for Evaluating Causal Graphs, arxiv

64 Can we reconstruct the whole causal network? Simulations p = 10, n = 200, linear functions SHD to true CPDAG for p = 10 and n = CAM LiNGAM PC CPC GES SHD

65 2004. Can we reconstruct the whole causal network? Real Data: Arabidopsis thaliana p = 39, n = 118, 20 most significant edges Chloroplast (MEP pathway) Cytoplasm (MVA pathway) DXPS1 DXPS2 DXPS3 AACT1 AACT2 DXR HMGS MCT HMGR1 HMGR2 CMK MK MECPS MPDC1 MPDC2 HDS IPPI2 HDR IPPI1 Mitochondrion FPPS1 FPPS2 UPPS1 GPPS DPPS2 GGPPS 2,6,8,10,11,12 PPDS1 PPDS2 GGPPS1,5,9 DPPS1,3 GGPPS3,4 Wille et al.: Sparse graphical Gaussian modeling of the isoprenoid gene network in arabidopsis thaliana. Genome Biology, 5(11),

66 2004. Can we reconstruct the whole causal network? Real Data: Arabidopsis thaliana p = 39, n = 118, stability selection: expected false positives 2 Chloroplast (MEP pathway) Cytoplasm (MVA pathway) DXPS1 DXPS2 DXPS3 AACT1 AACT2 DXR HMGS MCT HMGR1 HMGR2 CMK MK MECPS MPDC1 MPDC2 HDS IPPI2 HDR IPPI1 Mitochondrion FPPS1 FPPS2 UPPS1 GPPS DPPS2 GGPPS 2,6,8,10,11,12 PPDS1 PPDS2 GGPPS1,5,9 DPPS1,3 GGPPS3,4 Wille et al.: Sparse graphical Gaussian modeling of the isoprenoid gene network in arabidopsis thaliana. Genome Biology, 5(11),

67 Can we reconstruct the whole causal network? Real Data: A ground truth? Assume we are given observational data (j = ; k = ) O kj : expression level of gene j in observation k,

68 Can we reconstruct the whole causal network? Real Data: A ground truth? Assume we are given observational data (j = ; k = ) O kj : expression level of gene j in observation k, and some interventional data (j = ; i =... in total 400 genes) A ij : expression level of gene j under a knock-down of gene i.

69 Can we reconstruct the whole causal network? Real Data: A ground truth? Assume we are given observational data (j = ; k = ) O kj : expression level of gene j in observation k, and some interventional data (j = ; i =... in total 400 genes) A ij : expression level of gene j under a knock-down of gene i. How do we evaluate causal inference methods?

70 Summary Idea: Additive Noise Models Structural assumptions like additive noise models lead to identifiability: X i = f i (X pa(i) ) + N i Idea: causal order + variable selection Find the order of variables that maximizes likelihood (by greedily adding edges) and then perform classical variable selection.

71 Summary Idea: Additive Noise Models Structural assumptions like additive noise models lead to identifiability: X i = f i (X pa(i) ) + N i Idea: causal order + variable selection Find the order of variables that maximizes likelihood (by greedily adding edges) and then perform classical variable selection. Open Questions Identifiability (e.g. in terms of KL-distances) depending on non-linearity, for example. Hidden variables Do the assumptions (roughly) hold in practice? How do we evaluate causal inference methods?

Summary Idea: Additive Noise Models Structural assumptions like additive noise models lead to identifiability: X i = f i (X pa(i) ) + N i Idea: causal order + variable selection Find the order of

72 Summary Idea: Additive Noise Models Structural assumptions like additive noise models lead to identifiability: X i = f i (X pa(i) ) + N i Idea: causal order + variable selection Find the order of variables that maximizes likelihood (by greedily adding edges) and then perform classical variable selection. Open Questions Identifiability (e.g. in terms of KL-distances) depending on non-linearity, for example. Hidden variables Do the assumptions (roughly) hold in practice? How do we evaluate causal inference methods? Dankeschön!!

Recovering the Graph Structure of Restricted Structural Equation Models

Recovering the Graph Structure of Restricted Structural Equation Models Workshop on Statistics for Complex Networks, Eindhoven Jonas Peters 1 J. Mooij 3, D. Janzing 2, B. Schölkopf 2, P. Bühlmann 1 1 Seminar