Causal Inference: Discussion

Size: px

Start display at page:

Download "Causal Inference: Discussion"

Theodore Short
6 years ago
Views:

1 Causal Inference: Discussion Mladen Kolar The University of Chicago Booth School of Business Sept 23, 2016

2 Types of machine learning problems Based on the information available: Supervised learning Reinforcement learning Unsupervised learning M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

3 Bayesian networks P (F, A, S, H, N) = P (F ) P (A) P (S F, A) P (H S) P (N S) M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

4 Bayesian networks Probabilistic Interpretation of Bayesian Networks A Bayesian network represents a distribution P when each variable is independent of its non-descendants conditional on its parents in the DAG Causal Interpretation of Bayesian Networks There is a directed edge from A to B (relative to V) when A is a direct cause of B. M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

5 Markov Networks Random vector X = (X 1,..., X p ) Graph G = (V, E) with p nodes - represents conditional independence relationships between nodes Useful for exploring associations between measured variables (a, b) E X a X b X ab ( ab := V \{a, b} ) P[X a X b, X ab ] = P[X a X ab ] (Koller and Friedman, 2009) M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

6 Two Common Markov Networks Gaussian Markov Network: X N (µ, Σ) ( p(x) exp 1 ) 2 (x µ)t Σ 1 (x µ) The precision matrix Ω = Σ 1 encodes both parameters and the graph structure (Lauritzen, 1996; Koller and Friedman, 2009) M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

7 Two Common Markov Networks Gaussian Markov Network: X N (µ, Σ) ( p(x) exp 1 ) 2 (x µ)t Σ 1 (x µ) The precision matrix Ω = Σ 1 encodes both parameters and the graph structure Discrete Markov network: X { 1, 1} p p(x; Θ) exp x a θ aa + a V (Ising model) x a x b θ ab a,b V V Θ = (θ ab ) ab encodes the conditional independence relationships (Lauritzen, 1996; Koller and Friedman, 2009) M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

8 Structure Learning Problem Given an i.i.d. sample D n = {x i } n i=1 from a distribution P P Learn the set of conditional independence relationships Ĝ = Ĝ(D n) M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

9 Structure Learning Problem Given an i.i.d. sample D n = {x i } n i=1 from a distribution P P Learn the set of conditional independence relationships Ĝ = Ĝ(D n) Gaussian Markov Networks (Drton and Perlman, 2007) - Form the maximum likelihood estimator for the covariance matrix - Test for zeros in the precision matrix M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

10 Structure Learning Problem Given an i.i.d. sample D n = {x i } n i=1 from a distribution P P Learn the set of conditional independence relationships Ĝ = Ĝ(D n) Gaussian Markov Networks (Drton and Perlman, 2007) - Form the maximum likelihood estimator for the covariance matrix - Test for zeros in the precision matrix Discrete Markov Networks (Chickering, 1996) - Hard to learn structure, since the log partition function cannot be evaluated efficiently M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

11 Structure Learning in High-Dimensions (Some) Existing Work: Gaussian graphical models: GLasso (Yuan and Lin, 2007), CLIME (Cai et al., 2011), neighborhood selection (Meinshausen and Bühlmann, 2006) Ising model neighborhood selection (Ravikumar et al., 2009), composite likelihood (Xue et al., 2012) Exponential family graphical models exponential (Yang et al., 2012, 2015), Poisson (Yang et al., 2013), mixed (Yang et al., 2014),... Recent overview: Drton and Maathuis (2016) M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

12 Neighborhood Selection Local structure estimation θ a = arg max θ a R p l(θ a; D n ) λ θ a Estimated neighborhood N a = {b V θ ab 0} M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

13 Neighborhood Selection Local structure estimation θ a = arg max θ a R p l(θ a; D n ) λ θ a Estimated neighborhood N a = {b V θ ab 0} M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

14 Neighborhood Selection Local structure estimation θ a = arg max θ a R p l(θ a; D n ) λ θ a Estimated neighborhood 8 θ 17 θ 16 θ 15 4 N a = {b V θ ab 0} θ 18 θ 14 1 θ 12 θ M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

15 Neighborhood Selection Local structure estimation θ a = arg max θ a R p l(θ a; D n ) λ θ a θ 1 = ( ) Estimated neighborhood 8 θ 17 θ 16 θ 15 4 N a = {b V θ ab 0} θ 18 θ 14 1 θ 12 θ M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

16 Neighborhood Selection Local structure estimation θ a = arg max θ a R p l(θ a; D n ) λ θ a θ 1 = ( ) Estimated neighborhood 8 θ 15 4 N a = {b V θ ab 0} N a = {2, 3, 5} 1 θ 12 θ M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

17 Neighborhood Selection Local structure estimation θ a = arg max θ a R p l(θ a; D n ) λ θ a 1 θ 1 = ( ) Estimated neighborhood N a = {b V θ ab 0} N a = {2, 3, 5} 2 M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

18 Neighborhood Selection Local structure estimation θ a = arg max θ a R p l(θ a; D n ) λ θ a Estimated neighborhood N a = {b V θ ab 0} M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

19 Neighborhood Selection Local structure estimation θ a = arg max θ a R p l(θ a; D n ) λ θ a Estimated neighborhood N a = {b V θ ab 0} M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

20 Neighborhood Selection Local structure estimation θ a = arg max θ a R p l(θ a; D n ) λ θ a Estimated neighborhood N a = {b V θ ab 0} M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

Implications for Science 5 4 6 3 1 2 Some questions remain unanswered: How can

How certain we are there is an edge between nodes a and b?

21 Implications for Science Some questions remain unanswered: How can we quantify uncertainty of estimated graph structure? How certain we are there is an edge between nodes a and b? How to construct honest, robust tests about edge parameters? M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

22 Quantifying uncertainty For Gaussian graphical model inference on values of the precision matrix Ω using an asymptotically normal estimator (Ren et al., 2015) covariate adjusted (Chen et al., 2015) time-varying extension (Wang and Kolar, 2014) Exponential family graphical models (Wang and Kolar, 2016; Yu, Gupta, and Kolar, 2016) Quantile graphical models (Belloni et al., 2016) M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

23 Transelliptical Graphical Models M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

24 Background: Nonparanormal model / Gaussian copula Nonparanormal distribution: X NP N p (Σ; f 1,..., f p ) if (f 1 (X 1 ),..., f p (X p )) T N (0, Σ) (Liu et al., 2009) M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

25 Background: Transelliptical Distribution Transelliptical distribution: X T E p (Σ, ξ; f 1,..., f p ) if (f 1 (X 1 ),..., f p (X p )) T EC P (0, Σ, ξ) where Σ = [σ ab ] a,b R p is a correlation matrix and P[ξ = 0] = 0. Elliptical distribution: if Z = µ + ξ }{{} random radius Z EC p (µ, Σ, ξ) Σ 1/2 U }{{} random unit vector (Liu et al., 2012b) M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

26 Tail dependence Elliptical and transelliptical distributions allow for heavy tail dependence between variables. (X 1, X 2 ) multivariate t-distribution with d degrees of freedom Tail correlations: Corr ( 1I { X 1 q X } { 1 α, 1I Xb q X }) 2 α d=0.1 Tail correlation d=1 d= d=10 d= (Gaussian) Quantile M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

27 Robust Graphical Modeling Data: X 1,..., X n T E p (Σ, ξ; f 1,..., f p ) Underlying graph: Edge (a, b) E if ω ab 0 where Ω = Σ 1 = [ω kl ] Construct Σ = [ σ ab ] where σ ab = sin ( π 2 τ ab) and τ ab = ( ) n 1 2 i<i sign((x ia X i a)(x ib X i b)) is Kendall s tau. Plug into, for example, GLasso objective Ω = arg max Ω 0 log Ω tr ΣΩ λ Ω 1 (Liu et al., 2012a) M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

28 ROCKET: Robust Confidence Intervals via Kendalls Tau Let Idea ( θaa θ Θ ab = ab θ ba θ bb ) = Ω 1 JJ = Cov(ɛ a, ɛ b ). θ ab = E[ɛ a ɛ b ] = E [( Y a Y I ) ( γ a Yb Y I )] γ b = E [Y a Y b ] + γ a E [ Y I Y I ] γb E [Y a Y I ] γ b E [Y b Y I ] γ a where γ a = Σ 1 II Σ Ia and γ b = Σ 1 II Σ Ib Our procedure constructs γ a and γ b θ ab = Σ JJ + γ a ΣII γ b Σ ai γ b Σ bi γ a M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

29 Main results Estimation consistency If n kn 2 log(p), γ a 1 k n γ a 2, λ max (Σ)/λ min (Σ) C cov, kn log(p n ) k 2 γ a γ a 2 and γ a γ a 1 n log(p n ), n n then Θ Θ k n log(p n ) n where Θ is an oracle estimator that knows γ c exactly. Asymptotic Normality sup t R { n P ω ab ω ab Ŝ ab } t, Φ(t) C n M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

30 How To Estimate γ a? Lasso γ a = arg min γ, γ 1 R { } 1 2 γt ΣII γ γ T ΣIa + λ γ 1 non-convex problem, however Loh and Wainwright (2015) need R so that γ a 1 R Dantzig selector γ a = arg min { γ 1 s.t. Σ II γ Σ } Ia λ M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

31 Minimax optimality G 0 (M, k n ) = { Ω = (Ωab ) a,b [p] : max a [p] b a 1I{Ω ab 0} k n, and M 1 λ min (Ω) λ max (Ω) M. where M is a constant greater than one. } Theorem 1 in Ren et al. (2015) states that inf inf a,b sup ω ab G 0 (M,k n) P { ω ab ω ab ɛ 0 ( n 1 k n log(p n ) n 1/2)} ɛ 0. M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

32 Simulations Data generated from a grid, sample size n = 400 Ω aa = 1, Ω ab = { 0.24 for edges 0 for non edges, X EC(0, Ω 1, t 5 ) M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

33 Simulations Check if the estimator is asymptotically normal (over 1000 trials): Quantiles of Ť(2,2),(2,3) ROCKET Standard Normal Quantiles Pearson Standard Normal Quantiles Nonparanormal Standard Normal Quantiles Quantiles of Ť(2,2),(3,3) Standard Normal Quantiles Standard Normal Quantiles Standard Normal Quantiles Quantiles of Ť(2,2),(10,10) Standard Normal Quantiles Standard Normal Quantiles Standard Normal Quantiles M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

34 Simulations ROCKET Pearson Nonparanormal Coverage Width Coverage Width Coverage Width True edge Near non-edge Far non-edge M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

35 Simulations Results for Gaussian data with the same Ω (grid graph): ROCKET Pearson Nonparanormal Coverage Width Coverage Width Coverage Width True edge Near non-edge Far non-edge All methods have 95% coverage ROCKET confidence intervals only slightly wider M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

36 The ROCKET method: Theoretical guarantees for asymptotic normality over the transelliptical family Confidence intervals have the right coverage Practical recommendation: we should use the transelliptical family in practice Code: Preprint: arxiv: with Rina Foygel Barber Extension to dynamic model: arxiv: with Junwei Lu and Han Liu M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

37 No Cool-aid Assumptions Literature often requires a lot of assumptions pros: makes math work out cons: hard to verify in practice Negative result from Wasserman, Kolar, and Rinaldo (2014) inf sup E[Wn(C 2 n )] C(α) C n C n P P n p See also: Cai and Guo (2015) M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

38 Technical Difficulties Many machine learning methods come with knobs When the goal is prediction, we can use cross-validation to select knobs. When the goal is inference, it is not clear that using cross-validation will give valid confidence intervals. So, how do we choose these tuning parameters? my personal experience: the tuning parameters affect finite sample properties a lot many procedures are asymptotically normal, however, higher order biases may significantly affect finite sample properties M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

39 Extensions/Ideas Other tree ensembles gradient boosted trees (Friedman, 2001) Uniform convergence results Finite sample results Adaptive estimation of nuisance parameter Running competitions for estimating causal effects (Jennifer Hill) M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

40 Thank you! M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

41 References I A. Belloni, M. Chen, and V. Chernozhukov. Quantile graphical models: Prediction and conditional independence with applications to financial risk management. ArXiv e-prints, arxiv: , July T. T. Cai and Z. Guo. Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity. ArXiv e-prints, arxiv: , June T. T. Cai, W. Liu, and X. Luo. A constrained l 1 minimization approach to sparse precision matrix estimation. J. Am. Stat. Assoc., 106(494): , M. Chen, Z. Ren, H. Zhao, and H. H. Zhou. Asymptotically normal and efficient estimation of covariate-adjusted gaussian graphical model. Journal of the American Statistical Association, 0(ja):00 00, M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

42 References II D. M. Chickering. Learning bayesian networks is np-complete. In Learning from Data: Artificial Intelligence and Statistics V, pages Springer-Verlag, M. Drton and M. H. Maathuis. Structure learning in graphical modeling. To appear in Annual Review of Statistics and Its Application, 3, M. Drton and M. D. Perlman. Multiple testing and error control in gaussian graphical model selection. Statistical Science, 22(3): , J. H. Friedman. Greedy function approximation: a gradient boosting machine. Ann. Statist., 29(5): , D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, S. L. Lauritzen. Graphical Models (Oxford Statistical Science Series). Oxford University Press, USA, July M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

43 References III H. Liu, J. D. Lafferty, and L. A. Wasserman. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. J. Mach. Learn. Res., 10: , H. Liu, F. Han, M. Yuan, J. D. Lafferty, and L. A. Wasserman. High-dimensional semiparametric Gaussian copula graphical models. Ann. Stat., 40(4): , 2012a. H. Liu, F. Han, and C.-H. Zhang. Transelliptical graphical models. In P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Proc. of NIPS, pages b. P.-L. Loh and M. J. Wainwright. Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. J. Mach. Learn. Res., 16: , N. Meinshausen and P. Bühlmann. High dimensional graphs and variable selection with the lasso. Ann. Stat., 34(3): , M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

44 References IV P. Ravikumar, M. J. Wainwright, and J. D. Lafferty. High-dimensional ising model selection using l 1 regularized logistic regression. Annals of Statistics, to appear, Z. Ren, T. Sun, C.-H. Zhang, and H. H. Zhou. Asymptotic normality and optimalities in estimation of large Gaussian graphical models. Ann. Stat., 43(3): , J. Wang and M. Kolar. Inference for high-dimensional exponential family graphical models. In A. Gretton and C. C. Robert, editors, Proc. of AISTATS, volume 51, pages , J. Wang and M. Kolar. Inference for sparse conditional precision matrices. ArXiv e-prints, arxiv: , December L. A. Wasserman, M. Kolar, and A. Rinaldo. Berry-Esseen bounds for estimating undirected graphs. Electron. J. Stat., 8: , M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

45 References V L. Xue, H. Zou, and T. Ca. Nonconcave penalized composite conditional likelihood estimation of sparse ising models. Ann. Stat., 40(3): , E. Yang, G. I. Allen, Z. Liu, and P. Ravikumar. Graphical models via generalized linear models. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages Curran Associates, Inc., E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu. On poisson graphical models. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages Curran Associates, Inc., E. Yang, Y. Baker, P. Ravikumar, G. I. Allen, and Z. Liu. Mixed graphical models via exponential families. In Proc. 17th Int. Conf, Artif. Intel. Stat., pages , M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

46 References VI E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu. On graphical models via univariate exponential family distributions. J. Mach. Learn. Res., 16: , M. Yu, V. Gupta, and M. Kolar. Statistical inference for pairwise graphical models using score matching. In Advances in Neural Information Processing Systems 29. Curran Associates, Inc., M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19 35, M. Kolar (Chicago Booth) Causal Inference: Discussion Sept 23,

Learning discrete graphical models via generalized inverse covariance matrices

Learning discrete graphical models via generalized inverse covariance matrices Duzhe Wang, Yiming Lv, Yongjoon Kim, Young Lee Department of Statistics University of Wisconsin-Madison {dwang282, lv23, ykim676,