Recovering the Graph Structure of Restricted Structural Equation Models

Similar documents
Simplicity of Additive Noise Models

Distinguishing between Cause and Effect: Estimation of Causal Graphs with two Variables

Causality. Bernhard Schölkopf and Jonas Peters MPI for Intelligent Systems, Tübingen. MLSS, Tübingen 21st July 2015

Identifiability of Gaussian structural equation models with equal error variances

Foundations of Causal Inference

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Learning of Causal Relations

Using background knowledge for the estimation of total causal e ects

Predicting the effect of interventions using invariance principles for nonlinear models

Nonlinear causal discovery with additive noise models

Advances in Cyclic Structural Causal Models

Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise

arxiv: v3 [stat.me] 10 Mar 2016

Distinguishing Cause from Effect Using Observational Data: Methods and Benchmarks

Interpreting and using CPDAGs with background knowledge

Learning With Bayesian Networks. Markus Kalisch ETH Zürich

Inference of Cause and Effect with Unsupervised Inverse Regression

Distinguishing between cause and effect

arxiv: v2 [cs.lg] 9 Mar 2017

Learning causal network structure from multiple (in)dependence models

Causality. Jonas Peters. Lecture Notes Version: September 5, Spring Semester 2015, ETH Zurich

A review of some recent advances in causal inference

Causal discovery from big data : mission (im)possible? Tom Heskes Radboud University Nijmegen The Netherlands

Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges

Identifying confounders using additive noise models

Deep Convolutional Neural Networks for Pairwise Causality

Causal Effect Identification in Alternative Acyclic Directed Mixed Graphs

Bayesian Discovery of Linear Acyclic Causal Models

arxiv: v6 [math.st] 3 Feb 2018

Arrowhead completeness from minimal conditional independencies

arxiv: v1 [cs.lg] 26 May 2017

Towards an extension of the PC algorithm to local context-specific independencies detection

Causal inference (with statistical uncertainty) based on invariance: exploiting the power of heterogeneous data

Bayesian Network Structure Learning and Inference Methods for Handwriting

Jointly interventional and observational data: estimation of interventional Markov equivalence classes of directed acyclic graphs

Directed Graphical Models or Bayesian Networks

Chris Bishop s PRML Ch. 8: Graphical Models

arxiv:cs/ v2 [cs.it] 1 Oct 2006

arxiv: v1 [math.st] 13 Mar 2013

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Abstract. Three Methods and Their Limitations. N-1 Experiments Suffice to Determine the Causal Relations Among N Variables

arxiv: v1 [cs.lg] 3 Jan 2017

Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features

Lecture 6: Graphical Models: Learning

Undirected Graphical Models

Causal Inference on Discrete Data via Estimating Distance Correlations

Causal Structure Learning and Inference: A Selective Review

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

An Efficient Bayesian Network Structure Learning Algorithm in the Presence of Deterministic Relations

From Ordinary Differential Equations to Structural Causal Models: the deterministic case

Machine Learning Summer School

High-dimensional learning of linear causal networks via inverse covariance estimation

ESTIMATING HIGH-DIMENSIONAL INTERVENTION EFFECTS FROM OBSERVATIONAL DATA

Causality on Longitudinal Data: Stable Specification Search in Constrained Structural Equation Modeling

Causal Inference on Multivariate and Mixed-Type Data

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Causal Models with Hidden Variables

CS Lecture 3. More Bayesian Networks

Inferring deterministic causal relations

COMP538: Introduction to Bayesian Networks

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

Identification of Time-Dependent Causal Model: A Gaussian Process Treatment

The regression model with one stochastic regressor.

Expectation Propagation in Dynamical Systems

Introduction to Probabilistic Graphical Models

Probabilistic Graphical Models

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

arxiv: v2 [stat.ml] 16 Oct 2017

JOINT PROBABILISTIC INFERENCE OF CAUSAL STRUCTURE

Rapid Introduction to Machine Learning/ Deep Learning

Probabilistic Graphical Models (I)

10708 Graphical Models: Homework 2

Causal Discovery with Linear Non-Gaussian Models under Measurement Error: Structural Identifiability Results

Constraint-based Causal Discovery for Non-Linear Structural Causal Models with Cycles and Latent Confounders

Probabilistic latent variable models for distinguishing between cause and effect

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Effectiveness of classification approach in recovering pairwise causal relations from data.

Learning Quadratic Variance Function (QVF) DAG Models via OverDispersion Scoring (ODS)

Causal Modeling with Generative Neural Networks

Causal Discovery in the Presence of Measurement Error: Identifiability Conditions

Discovery of Linear Acyclic Models Using Independent Component Analysis

Robust Inverse Covariance Estimation under Noisy Measurements

Bootstrap & Confidence/Prediction intervals

Bayesian Inference. Chris Mathys Wellcome Trust Centre for Neuroimaging UCL. London SPM Course

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Learning Causality. Sargur N. Srihari. University at Buffalo, The State University of New York USA

Probabilistic Graphical Models

Causal Discovery from Nonstationary/Heterogeneous Data: Skeleton Estimation and Orientation Determination

QUANTIFYING CAUSAL INFLUENCES

Bayesian Networks BY: MOHAMAD ALSABBAGH

Probabilistic Causal Models

High-dimensional graphical model selection: Practical and information-theoretic limits

Undirected Graphical Models

STA 4273H: Statistical Machine Learning

Marginal integration for nonparametric causal inference

Econ 423 Lecture Notes: Additional Topics in Time Series 1

Scoring Bayesian networks of mixed variables

Transcription:

Recovering the Graph Structure of Restricted Structural Equation Models Workshop on Statistics for Complex Networks, Eindhoven Jonas Peters 1 J. Mooij 3, D. Janzing 2, B. Schölkopf 2, P. Bühlmann 1 1 Seminar for Statistics, ETH Zürich, Switzerland 2 MPI for Intelligent Systems, Tübingen, Germany 3 Radboud University, Nijmegen, Netherlands 31st January 2013

How to win a Nobel Prize F. H. Messerli: Chocolate Consumption, Cognitive Function, and Nobel Laureates, N Engl J Med 2012

How to win a Nobel Prize F. H. Messerli: Chocolate Consumption, Cognitive Function, and Nobel Laureates, N Engl J Med 2012

How to win a Nobel Prize F. H. Messerli: Chocolate Consumption, Cognitive Function, and Nobel Laureates, N Engl J Med 2012

What is the Problem? Given some data, what is the causal structure of the underlying mechanism?

What is the Problem? Given some data, what is the causal structure of the underlying mechanism? Understand the (physical) process in more detail.

What is the Problem? Given some data, what is the causal structure of the underlying mechanism? Understand the (physical) process in more detail. Intervene! Alain s talk (tomorrow)

What is the Problem? Given some data, what is the causal structure of the underlying mechanism? Understand the (physical) process in more detail. Intervene! Alain s talk (tomorrow) Use observational data!

What is the Problem? Theoretical: P(X 1,..., X 5 )? DAG G 0 X 4 X 5 X 2 X 3 X 1 Practical: iid observations from? estimated P(X 1,..., X 5 ) DAG G 0

Structural Equation Models (SEMs) The joint distribution P(X 1,..., X p ) satisfies a Structural Equation Model (SEM) with DAG G 0 if X i = f i (X PAi, N i ) 1 i p with X PAi being the parents of X i in G 0. The N i are required to be jointly independent.

Structural Equation Models (SEMs) P(X 1,..., X 4 ) could be generated by X 1 = f 1 (N 1 ) X 2 = f 2 (X 3, X 4, N 2 ) X 3 = f 3 (X 1, N 3 ) X 4 = f 4 (X 3, N 4 ) N i jointly independent G X 1 X 2 X 3 X 4 X 1 = f 1 (X 3, N 1 ) G 0 X 1 is generated by X 2 = f 2 (N 2 ) X 3 = f 3 (X 2, N 3 ) X 4 = f 4 (X 2, X 3, N 4 ) N i jointly independent X 2 X 3 X 4

Structural Equation Models (SEMs) P(X 1,..., X 4 ) could be generated by X 1 = g 1 (M 1 ) X 2 = g 2 (X 3, X 4, M 2 ) X 3 = g 3 (X 1, M 3 ) X 4 = g 4 (X 1, X 3, M 4 ) N i jointly independent G X 1 X 2 X 3 X 4 X 1 = f 1 (X 3, N 1 ) G 0 X 1 is generated by X 2 = f 2 (N 2 ) X 3 = f 3 (X 2, N 3 ) X 4 = f 4 (X 2, X 3, N 4 ) N i jointly independent X 2 X 3 X 4

SEMs are not identifiable Proposition Given a distribution P(X 1,..., X p ), we can find an SEM for each graph G, such that P is Markov with respect to G. Special case: two variables. JP: Restricted Structural Equation Models for Causal Inference, PhD Thesis 2012 (and others?)

The Idea We gain identifiability by restricting the function class (excluding combinations of functions, input and noise distributions).

Two Variables - Good I X 1 = N 1 X 2 = βx 1 + N 2 Then there is no linear SEM with same error variances in the backward direction. with N 1, N 2 iid N (0, σ 2 ). X 2 N 2 L 2 βx 1 X 1

Two Variables - Good II Consider a distribution corresponding to X 1 = N 1 X 2 = X 2 1 + N 2 X 1 X 2 with N 1 N 2 with N 1 U[0.1, 0.9] N 2 U[ 0.15, 0.15]

Two Variables - Good II

Two Variables - Good II Jonas Peters (ETH Zu rich) Recovering the Graph Structure of Restricted SEMs 31st January 2013

Two Variables - Good II Consider a distribution corresponding to X 1 = N 1 X 2 = f (X 1 ) + N 2 X 1 X 2 with N 1 N 2 For most combinations (f, P(N 1 ), P(N 2 )) there is no X 1 = g(x 2 ) + M 1 X 2 = M 2 X 1 X 2 with M 1 M 2 More or less one exception: (linear, Gaussian, Gaussian) with different error variances. P. Hoyer, D. Janzing, J. Mooij, JP and B. Schölkopf: Nonlinear causal discovery with additive noise models, NIPS 2008

Two Variables Is the case of two variables easy or hard? Easy: Visualization. 2 is a very small number. Hard: It extends to the multivariate case. There are no (cond.) independences that could be exploited.

Restricted Structural Equation Models Assumption Assume that P(X 1,..., X p ) follows a (specific type of) restricted SEMs with graph G 0 and assume causal minimality. Theorem Then, the true causal DAG can be recovered from the joint distribution.

Restricted Structural Equation Models Linear Gaussian Models with same Error Variance X i = β j X j + N i 1 i p j PA i iid with N i N (0, σ 2 ). Assume β j 0 ( causal minimality). Theorem One can identify G 0 from P(X 1,..., X p ). JP, P. Bühlmann: Identifiability of Gaussian Structural Equation Models with Same Error Variances, ArXiv e-print 2012

Restricted Structural Equation Models Non-Linear Additive Noise Models X i = f i (X PAi ) + N i 1 i p Theorem with N i iid and graph G 0. Assume causal minimality. Exclude a few combinations of f i, P(N i ) and P(X PAi ). Then one can identify G 0 from P(X 1,..., X p ). P. Hoyer, D. Janzing, J. Mooij, JP and B. Schölkopf: Nonlinear causal discovery with additive noise models, NIPS 2008 JP, J. M. Mooij, D. Janzing and B. Schölkopf: Identifiability of Causal Graphs using Functional Models, UAI 2011 Very similar for discrete variables JP, D. Janzing and B. Schölkopf: Causal inference on discrete data using additive noise models, IEEE TPAMI 2011

Practical Method

Practical Method There are 18676600744432035186664816926721 DAGs with 13 nodes. How can we find the correct SEM without enumerating all DAGs?

Practical Method There are 18676600744432035186664816926721 DAGs with 13 nodes. How can we find the correct SEM without enumerating all DAGs? Gaussian SEM with same error variance: BIC with greedy search ( ˆβ, ˆσ 2) ( = argmin l(β, σ 2 ; X (1),..., X (n) ) + log(n) ) β 0 β B,σ 2 R + 2 JP and P. Bühlmann: Identifiability of Gaussian SEMs with same error variances, ArXiv e-print 2012

Practical Method There are 18676600744432035186664816926721 DAGs with 13 nodes. How can we find the correct SEM without enumerating all DAGs? Gaussian SEM with same error variance: BIC with greedy search ( ˆβ, ˆσ 2) ( = argmin l(β, σ 2 ; X (1),..., X (n) ) + log(n) ) β 0 β B,σ 2 R + 2 JP and P. Bühlmann: Identifiability of Gaussian SEMs with same error variances, ArXiv e-print 2012 Nonlinear SEM: Iterated procedure. Always identify the sink node. (Improvements possible!?) J. Mooij, D. Janzing, JP and B. Schölkopf: Regression by dep. minim. and its appl. to causal inference, ICML 2009

Experiment Linear SEMs with same Error Variance 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 5 10 15 20 25 30 a Structural Hamming Distance (to DAG) GDS_SEV GES PC BEST_SCORE 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 5 10 15 20 25 30 a Structural Hamming Distance (to CPDAG) GDS_SEV GES PC BEST_SCORE

Experiment Linear SEMs with same Error Variance Table: BIC scores of GES and GDS with SEV on microarray data (smaller is better). Prostate Lymphoma DSM Leukemia Brain NCI Colon GES 4095 4560 2711 5456 1411 5891 3224 GDS w/ SEV 6057 5404 3236 5481 1343 6288 3201

Experiment How to win a Nobel Prize? No (not enough) data for chocolate

Experiment How to win a Nobel Prize? No (not enough) data for chocolate... but we have data for coffee!

Experiment How to win a Nobel Prize? # Nobel Laureates / 10 mio 0 5 15 25 0 2 4 6 8 10 12 coffee consumption per capita (kg) Correlation: 0.698, p-value: < 2.2 10 16.

Model class too small? Causally insufficient? Experiment How to win a Nobel Prize? # Nobel Laureates / 10 mio 0 5 15 25 0 2 4 6 8 10 12 coffee consumption per capita (kg) Correlation: 0.698, p-value: < 2.2 10 16. Nobel Prize Coffee: Dependent residuals (p-value of 1.8 10 11 ). Coffee Nobel Prize: Dependent residuals (p-value of < 2.2 10 16 ).

Experiment Nonlinear SEMs with two continuous variables

Experiment Nonlinear SEMs with three continuous variables Random variables: X 1 : Altitude X 2 : Temperature X 3 : Hours of sunshine Altitude Sunshine Temperature 205 1552 9.7 46 1443 8.2 794 1097 6.4 325 1572 8.1 500 1368 6.2 215 1594 9.4 383 1591 7.8 54 1702 8.3...

Experiment Nonlinear SEMs with three continuous variables Altitude, Duration of Sunshine, Temperature (349 samples) linear SEM 1 p value mutual independence test 0.8 0.6 0.4 0.2 0 0 5 10 15 20 25 enumerated DAGs DAG 20: Alt Sun Temp

Experiment Nonlinear SEMs with three continuous variables Altitude, Duration of Sunshine, Temperature (349 samples) nonlinear SEM 0.01 p value mutual independence test 0.008 0.006 0.004 0.002 0 0 5 10 15 20 25 Sun DAG 20: Alt Temp enumerated DAGs

Conclusions Restricted SEMs...... exploit different assumptions than traditional methods.... can identify the true DAG.... work well in practice for graphs with a small number of nodes. interesting tool for causal inference... that should be applied to large-scale data sets. Thank you!