Pairwise measures of causal direction in linear non-gaussian acy

Pairwise measures of causal direction in linear non-gaussian acyclic models Dept of Mathematics and Statistics & Dept of Computer Science University of Helsinki, Finland

Abstract Estimating causal direction is fundamental problem in science Bayesian networks or structural equation models are ill-defined for gaussian data They can be estimated using non-gaussianity (Shimizu et al, JMLR 2006) Here, we develop a new approach based on likelihood ratios of variable pairs Approximated by simple nonlinear correlations Further lead to higher-order cumulants which allow deeper theoretical analysis give intuitive interpretation are noise-robust

Introduction Structural equation models : Introduction Model connections between the measured variables: Which variable causes which? Correlation does not equal causation : but we can go beyond correlation Two fundamental approaches If we have time series and time-resolution of measurements fast enough: use autoregressive modelling Otherwise, use structural equation models (here)

Structural equation models Introduction Structural equation models How does an externally imposed change in one variable affect the others? x i = j i b ij x j +e i Difficult to estimate, not simple regression Classic methods fail in general

Introduction Structural equation models Structural equation models How does an externally imposed change in one variable affect the others? x4 x i = j i b ij x j +e i -0.3 x2-0.56 0.82 0.89 Difficult to estimate, not simple regression Classic methods fail in general x3 0.14-0.26 x1 0.37 Can be estimated if (Shimizu et al., JMLR, 2006) 1-1 0.12 1. the e i (t) are mutually independent 2. the e i (t) are non-gaussian, e.g. sparse 3. the b ij are acyclic: There is an ordering of x i where effects are all forward x6 x7 1 x5

Introduction Structural equation models Estimation of SEM by ICA We have thus defined a linear non-gaussian acyclic model (LiNGAM; Shimizu et al, JMLR, 2006) Previously, we proposed estimation using ICA. Transform Becomes an ICA model! x = Bx+e x = (I B) 1 e But one complication: ICA does not estimate order of e i In SEM, ei have a specific order Acyclicity allows determination of the right order

Definition and approximation Cumulant-based approach Pairwise likelihood ratio approach Consider two variables, x and y, both standardized and non-gaussian. Goal: distinguish between two causal models: y = ρx +d (x y) (1) x = ρy +e (y x) (2) where disturbances d,e are independent of x,y. Simple solution: Compute likelihoods of the models, and take their ratio

Deriving likelihood ratios Definition and approximation Cumulant-based approach The two models are special cases of LiNGAM, and likelihood can be obtained as (Hyvarinen, JMLR, 2010) For x y : logl(x y) = t G x (x t )+G d ( y t ρx t 1 ρ 2 ) log(1 ρ2 ) where G x (u) = logp x (u), and G d is the standardized log-pdf of the residual when regressing y on x. Symmetrically obtained for y x. This gives likelihood ratio in closed form, if we have good approximations of the pdf s of the variables and the residuals. in ICA we would typically approximate: G(u) = 2logcosh( π 2 u)+const. (3) 3

Approximation of likelihood ratios Definition and approximation Cumulant-based approach Assume the pdf s of x and y equal, and take Taylor expansion G( y ρx 1 ρ 2 ) = G(y) ρx g(y)+o(ρ2 ) (4) where g is the derivative of G, We obtain: logl(x y) logl(y x) R = ρ x t g(y t )+g(x t )y t T where typically g(u) = tanh(u) Choosing between models is reduced to considering the sign of a nonlinear correlation!!! If R > 0, decide x y, otherwise decide y x. t

Definition and approximation Cumulant-based approach Approach using higher-order cumulants Nonlinear correlations can be replaced using cumulants, e.g. ρê{x3 y xy 3 } = ρ[cum(x,x,x,y) cum(y,y,y,x)] Likely to have similar qualitative behaviour as ρê{x tanh(y)+tanh(x)y} since related to Taylor expansion tanh(u) = u 1 3 u3 +o(u 3 ) For skewed distributions, we can use third-order cumulant ρê{x 2 y xy 2 } Important points: Cumulants can be proven to give right direction Immune to additive noise

Intuitive interpretation Definition and approximation Cumulant-based approach x y, i.e. y = ρx +d and the variables are very sparse. Regression toward the mean: ρ < 1 for standardized variables Nonlinear correlation E{x 3 y} is larger than E{xy 3 } because both variables are simultaneously large typically when x takes larger values than y due to regression towards the mean.

Using pairwise measures with more variables Assume LiNGAM model for n variables x = Bx+e (5) Compute nonlinear correlation matrix derived above M = cov(x) E{xg(x) T g(x)x T } (6) for some nonlinearity such as g(u) = u 3, g(u) = u 2, g(u) = tanh(u), For x i with no parents, all entries in the i-th row of M are non-negative, neglecting random errors. This allows us to find the (non-unique) root of the graph Iterating, we find ordering of directed acyclic graph Closely related to DirectLiNGAM (Shimizu et al 2009)

One simulation 1 a) First variable found 1 b) Two first variables found 1 c) Mean rank correlations 10000 d) Computation time 0.8 0.8 0.8 1000 0.6 0.6 0.6 0.4 0.4 0.4 100 0.2 0.2 0.2 10 0 LR LRap LRnd kdir dir ICA 0 LR LRap LRnd kdir dir ICA 0 LR LRap LRnd kdir dir ICA 1 LR LRap LRnd kdir dir ICA with added measurement noise. Five variables, 10,000 data points. Algorithms: LR : true likelihood ratios LRap : LR approximations based on tanh LRnd : Simpler variant with no deflation kdir : KernelDirectLiNGAM (Sogawa et al 2010) dir : original DirectLiNGAM (Shimizu et al 2009) ICA : LiNGAM estimated by ICA (Shimizu et al 2006)

possible using something more than correlations Structural equation models can be estimated by non-gaussianity (Shimizu et al, JMLR, 2006) Here I propose likelihood ratio tests for two variables Log-likelihood ratio approximated by nonlinear correlations Leads to higher-order cumulants (which have no other known intuitive interpretation) Pairwise tests can be used to estimate model with many variables using methods like DirectLiNGAM Particularly efficient when there is Measurement noise, or Few data points