Distinguishing between Cause and Effect: Estimation of Causal Graphs with two Variables

Size: px

Start display at page:

Download "Distinguishing between Cause and Effect: Estimation of Causal Graphs with two Variables"

Carol Douglas
5 years ago
Views:

1 Distinguishing between Cause and Effect: Estimation of Causal Graphs with two Variables Jonas Peters ETH Zürich Tutorial NIPS 2013 Workshop on Causality 9th December 2013

2 F. H. Messerli: Chocolate Consumption, Cognitive Function, and Nobel Laureates, N Engl J Med 2012

3 F. H. Messerli: Chocolate Consumption, Cognitive Function, and Nobel Laureates, N Engl J Med 2012

4 F. H. Messerli: Chocolate Consumption, Cognitive Function, and Nobel Laureates, N Engl J Med 2012

5 Problem: Given P(X, Y ), can we infer whether X Y or Y X?

6 Problem: Given P(X, Y ), can we infer whether X Y or Y X? Difficulty: So much symmetry: P(X ) P(Y X ) = P(X, Y ) = P(X Y ) P(Y ) We need assumptions!! (e.g. Markov and faithfulness do not suffice.)

7 Problem: Given P(X, Y ), can we infer whether X Y or Y X? Difficulty: So much symmetry: P(X ) P(Y X ) = P(X, Y ) = P(X Y ) P(Y ) We need assumptions!! (e.g. Markov and faithfulness do not suffice.) Surprise (for some assumptions): 2 variables p variables J. Peters, J. Mooij, D. Janzing and B. Schölkopf: Causal Discovery with Continuous Additive Noise Models, arxiv:

8 Idea No. 1: Linear Non-Gaussian Additive Models (LiNGAM) Structural assumptions like additive non-gaussian noise models break the symmetry: Y = βx + N Y N Y X, with N Y non-gaussian.

9 Asymmetry No. 1 Consider a distribution corresponding to Y = βx + N Y N Y X N Y non-gaussian X Y

10 Asymmetry No. 1 Consider a distribution corresponding to Y = βx + N Y N Y X N Y non-gaussian X Y Then there is no X = φy + N X N X Y N X non-gaussian X Y S. Shimizu, P.O. Hoyer, A. Hyvärinen and A.J. Kerminen: A linear non-gaussian acyclic model for causal discovery, JMLR 2006

11 Idea No. 2: Additive noise models Nonlinear functions are also fine! Y = f (X ) + N Y N Y X P. Hoyer, D. Janzing, J. Mooij, J. Peters and B. Schölkopf: Nonlinear causal discovery with additive noise models, NIPS 2008

12 Asymmetry No. 2 Consider a distribution corresponding to Y = f (X ) + N Y with N Y X X Y

13 Asymmetry No. 2 Consider a distribution corresponding to Y = f (X ) + N Y with N Y X X Y Then for most combinations (f, P(X ), P(N Y )) there is no X = g(y ) + M X with M X Y X Y

14 Y = f (X ) + N Y, N Y X

15 Y = f (X ) + N Y, N Y X

16 X = g(y ) + N X, N X Y

17 X = g(y ) + N X, N X Y

18 Idea No. 3: Gaussian Process Inference (GPI) We can always write and Y = f (X, N Y ), X = g(y, N X ), N Y X N X Y Which model is more complex? Use Bayesian model comparison. J. M. Mooij, O. Stegle, D. Janzing, K. Zhang, B. Schölkopf: Probabilistic latent variable models for distinguishing between cause and effect, NIPS 2010 E.g., J. Peters: Restricted Structural Equation Models for Causal Inference, PhD Thesis

19 Asymmetry No. 3 1 Fix the noise distribution to be N (0, 1). 2 Put prior p(θ X ) on input distribution p(x θ X ) ( complexity of X ). 3 Put prior p(θ f ) on the functions p(f θ f ) ( complexity of f ).

20 Asymmetry No. 3 1 Fix the noise distribution to be N (0, 1). 2 Put prior p(θ X ) on input distribution p(x θ X ) ( complexity of X ). 3 Put prior p(θ f ) on the functions p(f θ f ) ( complexity of f ). 4 Approximate marginal likelihood for X Y p(x, y) = p(x) p(y x) = p(x θ X )p(θ X ) dθ X δ ( y f (x, e) ) p(e)p(f ) de df 5 Approximate marginal likelihood for Y X. θ f f θ X X Y E 6 Compare. J. M. Mooij, O. Stegle, D. Janzing, K. Zhang, B. Schölkopf: Probabilistic latent variable models for distinguishing between cause and effect, NIPS 2010

21 Idea No. 4: Information Geometric Causal Inference (IGCI) Assume a deterministic relationship Y = f (X ) and that f and P(X ) are independent. D. Janzing, J. M. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniusis, B. Steudel, B. Schölkopf: Information-geometric approach to inferring causal directions, Artificial Intelligence 2012

22 Idea No. 4: Information Geometric Causal Inference (IGCI) Assume a deterministic relationship Y = f (X ) and that f and P(X ) are independent. D. Janzing, J. M. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniusis, B. Steudel, B. Schölkopf: Information-geometric approach to inferring causal directions, Artificial Intelligence 2012 p(y) y f(x) x p(x)

23 Asymmetry No. 4 Consider Y = f (X ) with id f : [0, 1] [0, 1] invertible and X = g(y ). If cov (log f, p X ) = log(f (x)) p X (x) dx log f (x) dx = 0 then cov (log g, p Y ) = log(g (y)) p Y (y) dy log g (y) dy > 0 D. Janzing, J. M. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniusis, B. Steudel, B. Schölkopf: Information-geometric approach to inferring causal directions, Artificial Intelligence 2012

24 Open Questions 1: Quantifying Identifiability

25 Open Questions 1: Quantifying Identifiability

26 Open Questions 1: Quantifying Identifiability Proposition Assume P(X, Y ) is generated by with independent X and N Y. Y = f (X ) + N Y Then inf KL(P Q) =? Q {Q:Y X } first steps to understand the geometry gives us finite sample guarantees

27 Open Questions 2: Robustness What happens if assumptions are violated? E.g., in case of confounding? Z X Y Can we still infer X Y? How useful is this?

28 Conclusions In theory, we can brake asymmetry between cause and effect. restricted structural equation models: - linear functions, additive non-gaussian noise - nonlinear functions, additive noise complexity measures on functions and distributions independence between function and input distribution

29 Conclusions In theory, we can brake asymmetry between cause and effect. restricted structural equation models: - linear functions, additive non-gaussian noise - nonlinear functions, additive noise complexity measures on functions and distributions independence between function and input distribution... principles behind new methods from challenge?

30 Conclusions In theory, we can brake asymmetry between cause and effect. restricted structural equation models: - linear functions, additive non-gaussian noise - nonlinear functions, additive noise complexity measures on functions and distributions independence between function and input distribution... principles behind new methods from challenge? Causal inference problem of climate change is solved! Fight the cause! Don t fly! (Zurich-SFO 5.4t CO 2 )! Compensate!

31 IGCI It turns out that if X Y log f (x) p(x) dx < log g (y) p(y) dy Estimator: Infer X Y if Ĉ X Y := 1 m m log y j+1 y j x j+1 x j j=1 Ĉ X Y < Ĉ Y X log f (x) p(x) dx

32 Y = βx + N Y, N Y X, N Y non-gaussian Y X

33 Y = βx + N Y, N Y X, N Y non-gaussian Y X

34 X = φy + N X, N X Y, N X non-gaussian Y X

35 X = φy + N X, N X Y, N X non-gaussian Y X

36 Does X cause Y or vice versa? Real Data

37 Does X cause Y or vice versa? Real Data

38 Does X cause Y or vice versa? No (not enough) data for chocolate

39 Does X cause Y or vice versa? No (not enough) data for chocolate... but we have data for coffee!

40 Does X cause Y or vice versa? # Nobel Laureates / 10 mio coffee consumption per capita (kg) Correlation: 0.698, p-value: <

41 Does X cause Y or vice versa? # Nobel Laureates / 10 mio coffee consumption per capita (kg) Correlation: 0.698, p-value: < Nobel Prize Coffee: Dependent residuals (p-value of 0). Coffee Nobel Prize: Dependent residuals (p-value of 0). Model class too small? Causally insufficient?

42 The linear Gaussian case Y = βx + N Y with independent X N (0, σx 2 ) and N N (0, σn 2 Y ). Then there is a linear SEM with X = αy + M X How can we find α and M X? Y N Y L 2 βx X

Simplicity of Additive Noise Models

Simplicity of Additive Noise Models Jonas Peters ETH Zürich - Marie Curie (IEF) Workshop on Simplicity and Causal Discovery Carnegie Mellon University 7th June 2014 contains joint work with... ETH Zürich: