Causal Modeling with Generative Neural Networks

Similar documents
Deep Convolutional Neural Networks for Pairwise Causality

arxiv: v1 [cs.lg] 3 Jan 2017

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Hilbert Schmidt Independence Criterion

UNSUPERVISED LEARNING

MACHINE LEARNING FOR CAUSE-EFFECT PAIRS DETECTION. Mehreen Saeed CLE Seminar 11 February, 2014.

Deep Solar Imaging for Geomagnetic Storms Prediction

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Distinguishing Cause from Effect Using Observational Data: Methods and Benchmarks

Belief Propagation for Traffic forecasting

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Generative models for missing value completion

Machine Learning, Fall 2009: Midterm

Naïve Bayes classification

Intelligent Systems Discriminative Learning, Neural Networks

Logistic Regression & Neural Networks

Notes on Machine Learning for and

Regression I: Mean Squared Error and Measuring Quality of Fit

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Unsupervised Neural Nets

Advances in kernel exponential families

Is the test error unbiased for these programs?

Bayesian Support Vector Machines for Feature Ranking and Selection

Probabilistic & Unsupervised Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Gaussian Processes in Machine Learning

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

CSC321 Lecture 20: Autoencoders

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Recap from previous lecture

CSCE 478/878 Lecture 6: Bayesian Learning

Probabilistic Graphical Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Nonparameteric Regression:

Machine Learning 2nd Edition

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

PCA and admixture models

the Information Bottleneck

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Chapter 16. Structured Probabilistic Models for Deep Learning

Kernel Learning via Random Fourier Representations

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

11. Learning graphical models

Causation and Prediction (2007) Neural Connectomics (2014) Pot-luck challenge (2008) Causality challenges Isabelle Guyon

Tensor Methods for Feature Learning

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

PATTERN RECOGNITION AND MACHINE LEARNING

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Probabilistic Graphical Models

Information Theory and Feature Selection (Joint Informativeness and Tractability)

Conditional Independence

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

arxiv: v1 [stat.ml] 13 Mar 2018

CSC321 Lecture 18: Learning Probabilistic Models

Nonlinear Statistical Learning with Truncated Gaussian Graphical Models

Introduction to Machine Learning

Greedy Layer-Wise Training of Deep Networks

Generative Models for Sentences

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Reward-modulated inference

Holdout and Cross-Validation Methods Overfitting Avoidance

Probabilistic Graphical Models & Applications

Semi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT)

Approximation Theoretical Questions for SVMs

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Introduction to Gaussian Process

Why is Deep Learning so effective?

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Sum-Product Networks: A New Deep Architecture

Causal Inference via Kernel Deviance Measures

Nonparametric Bayesian Methods (Gaussian Processes)

ECE521 Lecture7. Logistic Regression

Bayesian Networks Inference with Probabilistic Graphical Models

MAP Examples. Sargur Srihari

INTRODUCTION TO DATA SCIENCE

Variational Autoencoders (VAEs)

Linear discriminant functions

RegML 2018 Class 8 Deep learning

Lecture 16 Deep Neural Generative Models

Machine Learning Lecture 5

Probabilistic Graphical Models for Image Analysis - Lecture 1

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Lecture 9: PGM Learning

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Learning Deep Architectures

Artificial Intelligence

Introduction to Causal Calculus

Approximate Inference Part 1 of 2

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

Transcription:

Causal Modeling with Generative Neural Networks Michele Sebag TAO, CNRS INRIA LRI Université Paris-Sud Joint work: D. Kalainathan, O. Goudet, I. Guyon, M. Hajaiej, A. Decelle, C. Furtlehner https://arxiv.org/abs/1709.05321 Credit for slides: Yann LeCun Leiden Sept. 2017 1 / 27

Motivation State of art Causal Generative Neural Nets Naive ML Approach to SW 2 / 27

ML: discriminative or generative modelling usually iid samples P(X, Y ) Given a training set E = {(xi, yi ), xi IRd, i [[1, n]]} Find I I b X ) Supervised learning: h : X 7 Y or P(Y b,y) Generative model P(X Predictive modelling might be based on correlations If umbrellas in the street, Then it rains 3 / 27

The big data promise: ML models will expectedly support interventions: health and nutrition education economics/management climate Intervention Pearl 2009 Intervention do(x = x) forces variables X to value x Direct cause X i X j P Xj do(x i =x,x \ij =c) P Xj do(x i =x,x \ij =c) Example C: Cancer, S : Smoking, G : Genetic factors P(C do{s = 0, G = 0}) P(C do{s = 1, G = 0}) 4 / 27

Correlations do not support interventions Causal models are needed to support interventions 5 / 27

Why is this relevant to space weather? Causal models support understanding Causal models are more robust Given observations drawn after P(X ), P(Y X ), Find P(Y X ) that minimizes IE x P(X ) [arg max y ] P(y x) arg max P(y x) y e.g., to concept drift But P(X ) in production might differ from P(X ) in training 6 / 27

Causal modelling, how Historically, based on interventions. However, often impossible climate unethical make people smoking too expensive e.g., in economics Machine Learning alternatives Observational data Statistical tests Learned models Prior knowledge / Assumptions / Constraints 7 / 27

Motivation State of art Causal Generative Neural Nets Naive ML Approach to SW 8 / 27

Functional Causal Models, a.k.a. Structural Equation Models X i = f i (Pa(X i ), E i ) Pa(X i ): Direct causes for X i All unobserved influences: noise variables E i X 1 = f 1(E 1) X 2 = f 2(X 1, E 2) X 3 = f 3(X 1, E 3) X 4 = f 4(E 4) X 5 = f 5(X 3, X 4, E 5) Tasks Finding the structure of the graph (no cycles) Finding functions (f i ) 9 / 27

Conducting a causal modelling study Milestones Testing bivariate independence (statistical tests) find edges Conditional independence prune the edges Full causal graph modelling orient the edges X Y ; Y Z X Z Y X Y Z Challenges Computational complexity tractable approximation Conditional independence: data hungry tests Assuming causal sufficiency can be relaxed 10 / 27

X Y independance Categorical variables P(X, Y ) =?P(X ).P(Y ) Entropy H(X ) = x p(x)log(p(x)) x: value taken by X, p(x) its frequency Mutual information M(X, Y ) = H(X ) + H(Y ) H(X, Y ) Others: χ 2, G-test Continuous variables t-test, z-test Hilbert-Schmidt Independence Criterion (HSIC) Gretton et al., 05 Cov(f, g) = IE x,y [f (x)g(y)] IE x[f (x)]ie y [g(y)] Given f : X IR and g : Y IR Cov(f, g) = 0 for all f, g iff X and Y are independent 11 / 27

An ML approach Guyon et al, 2014-2015 E = {(A i, B i, l i ), l i in {,, }} 12 / 27

Exploiting the distribution asymmetry Hoyer et al. 09; Mooij et al. 2016 True model with noise ɛ independent on X Y = X + ɛ Learn Y = f (X ), plot the residual Y f (X ) Learn X = g(y ), plot the residual X g(y ) 13 / 27

Exploiting the asymmetry, 2 Given A, B 14 / 27

Exploiting the asymmetry, 2 Given A, B, Learn A = f (B) B = g(a) Retain model with best fit: A B 15 / 27

Exploiting the asymmetry, 2 Given A, B, Learn A = f (B) B = g(a) Retain model with best fit: A B A: Altitude of city, B: Temperature 15 / 27

Find V-structure: A C and A C B Explaining away causes 16 / 27

Motivation State of art Causal Generative Neural Nets Naive ML Approach to SW 17 / 27

Auto-Encoders Training set Structure of Auto-Encoder E = {(x i ), x i IR d, i = 1... n} Minimization of Mean Squared Error (MSE) Minimize i x i x i 2 Output: z, a compressed representation of x 18 / 27

Stacked Auto-Encoders E = {(x i ), x i IR d, i = 1... n} Differences Several hidden layers Minimize MSE or cross-entropy loss Minimize i,j x i,j log ˆx i,j + (1 x i,j ) log (1 ˆx i,j ) 19 / 27

Variational Auto-Encoders Kingma et al. 13 E = {(x i ), x i IR d, i = 1... n} Difference Hidden layer: parameters of a distribution N (µ, σ 2 ) Distribution used to generate values z = µ + σ N (0, 1) 20 / 27

Variational Auto-Encoders Kingma et al. 13 E = {(x i ), x i IR d, i = 1... n} Difference Hidden layer: parameters of a distribution N (µ, σ 2 ) Distribution used to generate values z = µ + σ N (0, 1) 21 / 27

Causal Generative Neural Nets E = {(x i ), x i IR d, i = 1... n} Goudet et al. 17 E = {(x i ), x i IR d, i = 1... n } Train the generator to minimize the distance between original and generated data in IR d MMD(G) = 1 k(x n 2 i, x j ) + 1 k(x n 2 i, x j) 2 1 k(x nn i, x j) i,j k(x, z) = i i,j exp γ i d x z 2 γ i in {10 2... 10 2 } i,j 22 / 27

Relaxing the causal sufficiency assumption X 2 = f 2(E 2, E 2,3) X 3 = f 3(E 3, E 2,3, E 3,5) X 4 = f 4(E 4, E 4,5) X 5 = f 5(X 3, X 4, E 5, E 3,5, E 4,5) 23 / 27

Graph inference Results: Area under the precision/recall curve Algorithm G 2 G 3 G 4 Constraint-based PC-Gaussian 82.3 ±4 (87.8) 80.0 ±7 (89.2) 88.1 ±10 (95.7) PC-HSIC 93.4 ±3 (78.5) 93.0 ±4 (77.9) 98.9 ±2 (88.0) Score-based GES 75.3 ±7 (81.2) 73.6 ±7 (77.7) 69.3±11 (78.6) Pairwise orientation LiNGAM 64.4 ±4 (100) 71.1 ±1 (100) 71.6 ±7 (100) ANM 72.9 ±9 (100) 72.5 ±4 (100) 79.9 ±5 (100) Jarfo 69.9 ±9 (100) 87.3 ±3 (100) 88.5 ±5 (100) CGNN-Fourier 94.5 ±2 (100) 84.9 ±9 (100) 93.6 ±3 (100) CGNN-MMD 96.9 ±1 (100) 96.5 ±3 (100) 97.2 ±3 (100) Python framework available at :https://github.com/diviyan-kalainathan/causaldiscoverytoolbox Caveat: up to 50 variables 24 / 27

Motivation State of art Causal Generative Neural Nets Naive ML Approach to SW 25 / 27

Compact solar state representations

Principle 9

Image preprocessing 10

Autoencoders Dimensionality reduction 11

Autoencoders Dimensionality reduction Input and Output similarity 11

Autoencoders Dimensionality reduction Input and Output similarity Bottleneck 11

Autoencoders Dimensionality reduction Input and Output similarity Bottleneck 256x256 512 11

Autoencoders 512x512 512 12

Autoencoders 256x256 64 13

Variational Autoencoder Assumption on the latent space distribution 256x256 90 14

Autoencoders training Intermediate image size 15

Autoencoders training Intermediate image size Custom loss : loss = (ytrue y pred ) 2 (y true+ɛ) α + (ytrue y pred ) 2 (1 y true+ɛ) α 15

Results Autoencoder Conv Conv + Dense Conv + PCA Variational Reduction rate 1/128 1/1024 1/524 1/728 Visual similarity 16

Results Autoencoder Conv Conv + Dense Conv + PCA Variational Reduction rate 1/128 1/1024 1/524 1/728 Visual similarity Smoothness over time 16

Results Autoencoder Conv Conv + Dense Conv + PCA Variational Reduction rate 1/128 1/1024 1/524 1/728 Visual similarity Smoothness over time Classification for verification 16

Results Event precision recall accuracy F1-score Coronal hole 0.74 0.36 0.62 0.48 Lepping 0.90 0.51 0.77 0.65 Pseudo streamer 0.66 0.93 0.78 0.77 Strahl 0.55 0.98 0.73 0.70 * Random predictor performances are 0.625 for accuracy and 0.25 for the rest Only 8000 labeled images 17

Results Event precision recall accuracy F1-score Coronal hole 0.74 0.36 0.62 0.48 Lepping 0.90 0.51 0.77 0.65 Pseudo streamer 0.66 0.93 0.78 0.77 Strahl 0.55 0.98 0.73 0.70 * Random predictor performances are 0.625 for accuracy and 0.25 for the rest Only 8000 labeled images Time distribution 17

Results Event precision recall accuracy F1-score Coronal hole 0.74 0.36 0.62 0.48 Lepping 0.90 0.51 0.77 0.65 Pseudo streamer 0.66 0.93 0.78 0.77 Strahl 0.55 0.98 0.73 0.70 * Random predictor performances are 0.625 for accuracy and 0.25 for the rest Only 8000 labeled images Time distribution Prediction at L1 17

Results Event precision recall accuracy F1-score Coronal hole 0.74 0.36 0.62 0.48 Lepping 0.90 0.51 0.77 0.65 Pseudo streamer 0.66 0.93 0.78 0.77 Strahl 0.55 0.98 0.73 0.70 * Random predictor performances are 0.625 for accuracy and 0.25 for the rest Only 8000 labeled images Time distribution Prediction at L1 Low performances 17

Results Event precision recall accuracy F1-score Coronal hole 0.74 0.36 0.62 0.48 Lepping 0.90 0.51 0.77 0.65 Pseudo streamer 0.66 0.93 0.78 0.77 Strahl 0.55 0.98 0.73 0.70 * Random predictor performances are 0.625 for accuracy and 0.25 for the rest Only 8000 labeled images Time distribution Prediction at L1 Low performances Let s extract more information 17

Going further Classification of solar events More data Caveat: the train/test split Predicting data at L1 the propagation time from sun to L1 help needed! 26 / 27

Thanks Olivier Goudet, Diviyan Kalainathan, Isabelle Guyon, Aris Tritas Mhamed Hajaiej, Cyril Furtlehner, Aurélien Decelle 27 / 27