Causal Modeling with Generative Neural Networks

Size: px

Start display at page:

Download "Causal Modeling with Generative Neural Networks"

Brittney George
6 years ago
Views:

1 Causal Modeling with Generative Neural Networks Michele Sebag TAO, CNRS INRIA LRI Université Paris-Sud Joint work: D. Kalainathan, O. Goudet, I. Guyon, M. Hajaiej, A. Decelle, C. Furtlehner Credit for slides: Yann LeCun Leiden Sept / 27

2 Motivation State of art Causal Generative Neural Nets Naive ML Approach to SW 2 / 27

3 ML: discriminative or generative modelling usually iid samples P(X, Y ) Given a training set E = {(xi, yi ), xi IRd, i [[1, n]]} Find I I b X ) Supervised learning: h : X 7 Y or P(Y b,y) Generative model P(X Predictive modelling might be based on correlations If umbrellas in the street, Then it rains 3 / 27

4 The big data promise: ML models will expectedly support interventions: health and nutrition education economics/management climate Intervention Pearl 2009 Intervention do(x = x) forces variables X to value x Direct cause X i X j P Xj do(x i =x,x \ij =c) P Xj do(x i =x,x \ij =c) Example C: Cancer, S : Smoking, G : Genetic factors P(C do{s = 0, G = 0}) P(C do{s = 1, G = 0}) 4 / 27

5 Correlations do not support interventions Causal models are needed to support interventions 5 / 27

6 Why is this relevant to space weather? Causal models support understanding Causal models are more robust Given observations drawn after P(X ), P(Y X ), Find P(Y X ) that minimizes IE x P(X ) [arg max y ] P(y x) arg max P(y x) y e.g., to concept drift But P(X ) in production might differ from P(X ) in training 6 / 27

7 Causal modelling, how Historically, based on interventions. However, often impossible climate unethical make people smoking too expensive e.g., in economics Machine Learning alternatives Observational data Statistical tests Learned models Prior knowledge / Assumptions / Constraints 7 / 27

8 Motivation State of art Causal Generative Neural Nets Naive ML Approach to SW 8 / 27

9 Functional Causal Models, a.k.a. Structural Equation Models X i = f i (Pa(X i ), E i ) Pa(X i ): Direct causes for X i All unobserved influences: noise variables E i X 1 = f 1(E 1) X 2 = f 2(X 1, E 2) X 3 = f 3(X 1, E 3) X 4 = f 4(E 4) X 5 = f 5(X 3, X 4, E 5) Tasks Finding the structure of the graph (no cycles) Finding functions (f i ) 9 / 27

10 Conducting a causal modelling study Milestones Testing bivariate independence (statistical tests) find edges Conditional independence prune the edges Full causal graph modelling orient the edges X Y ; Y Z X Z Y X Y Z Challenges Computational complexity tractable approximation Conditional independence: data hungry tests Assuming causal sufficiency can be relaxed 10 / 27

11 X Y independance Categorical variables P(X, Y ) =?P(X ).P(Y ) Entropy H(X ) = x p(x)log(p(x)) x: value taken by X, p(x) its frequency Mutual information M(X, Y ) = H(X ) + H(Y ) H(X, Y ) Others: χ 2, G-test Continuous variables t-test, z-test Hilbert-Schmidt Independence Criterion (HSIC) Gretton et al., 05 Cov(f, g) = IE x,y [f (x)g(y)] IE x[f (x)]ie y [g(y)] Given f : X IR and g : Y IR Cov(f, g) = 0 for all f, g iff X and Y are independent 11 / 27

12 An ML approach Guyon et al, E = {(A i, B i, l i ), l i in {,, }} 12 / 27

13 Exploiting the distribution asymmetry Hoyer et al. 09; Mooij et al True model with noise ɛ independent on X Y = X + ɛ Learn Y = f (X ), plot the residual Y f (X ) Learn X = g(y ), plot the residual X g(y ) 13 / 27

14 Exploiting the asymmetry, 2 Given A, B 14 / 27

15 Exploiting the asymmetry, 2 Given A, B, Learn A = f (B) B = g(a) Retain model with best fit: A B 15 / 27

16 Exploiting the asymmetry, 2 Given A, B, Learn A = f (B) B = g(a) Retain model with best fit: A B A: Altitude of city, B: Temperature 15 / 27

17 Find V-structure: A C and A C B Explaining away causes 16 / 27

18 Motivation State of art Causal Generative Neural Nets Naive ML Approach to SW 17 / 27

19 Auto-Encoders Training set Structure of Auto-Encoder E = {(x i ), x i IR d, i = 1... n} Minimization of Mean Squared Error (MSE) Minimize i x i x i 2 Output: z, a compressed representation of x 18 / 27

20 Stacked Auto-Encoders E = {(x i ), x i IR d, i = 1... n} Differences Several hidden layers Minimize MSE or cross-entropy loss Minimize i,j x i,j log ˆx i,j + (1 x i,j ) log (1 ˆx i,j ) 19 / 27

21 Variational Auto-Encoders Kingma et al. 13 E = {(x i ), x i IR d, i = 1... n} Difference Hidden layer: parameters of a distribution N (µ, σ 2 ) Distribution used to generate values z = µ + σ N (0, 1) 20 / 27

22 Variational Auto-Encoders Kingma et al. 13 E = {(x i ), x i IR d, i = 1... n} Difference Hidden layer: parameters of a distribution N (µ, σ 2 ) Distribution used to generate values z = µ + σ N (0, 1) 21 / 27

23 Causal Generative Neural Nets E = {(x i ), x i IR d, i = 1... n} Goudet et al. 17 E = {(x i ), x i IR d, i = 1... n } Train the generator to minimize the distance between original and generated data in IR d MMD(G) = 1 k(x n 2 i, x j ) + 1 k(x n 2 i, x j) 2 1 k(x nn i, x j) i,j k(x, z) = i i,j exp γ i d x z 2 γ i in { } i,j 22 / 27

24 Relaxing the causal sufficiency assumption X 2 = f 2(E 2, E 2,3) X 3 = f 3(E 3, E 2,3, E 3,5) X 4 = f 4(E 4, E 4,5) X 5 = f 5(X 3, X 4, E 5, E 3,5, E 4,5) 23 / 27

25 Graph inference Results: Area under the precision/recall curve Algorithm G 2 G 3 G 4 Constraint-based PC-Gaussian 82.3 ±4 (87.8) 80.0 ±7 (89.2) 88.1 ±10 (95.7) PC-HSIC 93.4 ±3 (78.5) 93.0 ±4 (77.9) 98.9 ±2 (88.0) Score-based GES 75.3 ±7 (81.2) 73.6 ±7 (77.7) 69.3±11 (78.6) Pairwise orientation LiNGAM 64.4 ±4 (100) 71.1 ±1 (100) 71.6 ±7 (100) ANM 72.9 ±9 (100) 72.5 ±4 (100) 79.9 ±5 (100) Jarfo 69.9 ±9 (100) 87.3 ±3 (100) 88.5 ±5 (100) CGNN-Fourier 94.5 ±2 (100) 84.9 ±9 (100) 93.6 ±3 (100) CGNN-MMD 96.9 ±1 (100) 96.5 ±3 (100) 97.2 ±3 (100) Python framework available at : Caveat: up to 50 variables 24 / 27

26 Motivation State of art Causal Generative Neural Nets Naive ML Approach to SW 25 / 27

27 Compact solar state representations

28 Principle 9

29 Image preprocessing 10

30 Autoencoders Dimensionality reduction 11

31 Autoencoders Dimensionality reduction Input and Output similarity 11

32 Autoencoders Dimensionality reduction Input and Output similarity Bottleneck 11

33 Autoencoders Dimensionality reduction Input and Output similarity Bottleneck 256x

34 Autoencoders 512x

35 Autoencoders 256x

36 Variational Autoencoder Assumption on the latent space distribution 256x

37 Autoencoders training Intermediate image size 15

38 Autoencoders training Intermediate image size Custom loss : loss = (ytrue y pred ) 2 (y true+ɛ) α + (ytrue y pred ) 2 (1 y true+ɛ) α 15

39 Results Autoencoder Conv Conv + Dense Conv + PCA Variational Reduction rate 1/128 1/1024 1/524 1/728 Visual similarity 16

40 Results Autoencoder Conv Conv + Dense Conv + PCA Variational Reduction rate 1/128 1/1024 1/524 1/728 Visual similarity Smoothness over time 16

41 Results Autoencoder Conv Conv + Dense Conv + PCA Variational Reduction rate 1/128 1/1024 1/524 1/728 Visual similarity Smoothness over time Classification for verification 16

42 Results Event precision recall accuracy F1-score Coronal hole Lepping Pseudo streamer Strahl * Random predictor performances are for accuracy and 0.25 for the rest Only 8000 labeled images 17

43 Results Event precision recall accuracy F1-score Coronal hole Lepping Pseudo streamer Strahl * Random predictor performances are for accuracy and 0.25 for the rest Only 8000 labeled images Time distribution 17

44 Results Event precision recall accuracy F1-score Coronal hole Lepping Pseudo streamer Strahl * Random predictor performances are for accuracy and 0.25 for the rest Only 8000 labeled images Time distribution Prediction at L1 17

45 Results Event precision recall accuracy F1-score Coronal hole Lepping Pseudo streamer Strahl * Random predictor performances are for accuracy and 0.25 for the rest Only 8000 labeled images Time distribution Prediction at L1 Low performances 17

46 Results Event precision recall accuracy F1-score Coronal hole Lepping Pseudo streamer Strahl * Random predictor performances are for accuracy and 0.25 for the rest Only 8000 labeled images Time distribution Prediction at L1 Low performances Let s extract more information 17

47 Going further Classification of solar events More data Caveat: the train/test split Predicting data at L1 the propagation time from sun to L1 help needed! 26 / 27

48 Thanks Olivier Goudet, Diviyan Kalainathan, Isabelle Guyon, Aris Tritas Mhamed Hajaiej, Cyril Furtlehner, Aurélien Decelle 27 / 27

Deep Convolutional Neural Networks for Pairwise Causality

Deep Convolutional Neural Networks for Pairwise Causality Karamjit Singh, Garima Gupta, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal TCS Research, Delhi Tata Consultancy Services Ltd. {karamjit.singh,