Graphical models. Christophe Giraud. Lecture Notes on High-Dimensional Statistics : Université Paris-Sud and Ecole Polytechnique Maths Department

Size: px

Start display at page:

Download "Graphical models. Christophe Giraud. Lecture Notes on High-Dimensional Statistics : Université Paris-Sud and Ecole Polytechnique Maths Department"

Cecilia Tucker
6 years ago
Views:

1 Graphical Modeling Université Paris-Sud and Ecole Polytechnique Maths Department Lecture Notes on High-Dimensional Statistics : giraud/msv/lecturenotes.pdf 1/73

2 Please ask questions! 2/73

3 Topic of the lecture : Graphical modeling is a convenient representation of conditional dependences among random variables It is a powerful tool for exploring direct effect between variables fast computations in complex models It is popular in many different fields, including bioinformatics, computer vision, speech recognition, environmental statistics, economics, social sciences, etc 3/73

4 Warning Two important topics : learning graphical models learning with graphical models 4/73

5 Warning Two important topics : learning graphical models learning with graphical models 5/73

6 Example 1 Figure: Learning Biological Regulatory Networks Seminal reference : Friedman [6] 6/73

7 Example 1 Figure: Learning Gene-Gene Regulatory Networks from microarrays Seminal reference : Friedman [6] 7/73

8 Example 1 8/73

9 Example 2 Figure: Learning Brain Connectivity Networks 9/73

10 Example 3 Figure: Learning direct effects in Social Sciences 10/73

11 Conditional independence 11/73

The concept of conditional dependence is more suited than the concept of dependence in order to catch direct dependences between variables Traffic jams and snowmen are

12 The concept of conditional dependence is more suited than the concept of dependence in order to catch direct dependences between variables Traffic jams and snowmen are correlated. But conditionally on snow falls, the size of the traffic jams and the number of snowmen are independent. Figure: Difference between dependence and conditional dependence 12/73

13 Conditional independence random variables X and Y are independent conditionally on a variable Z (we write X Y Z) if law((x, Y ) Z) = law(x Z) law(y Z). Caracterisation When the distribution of (X, Y, Z) has a positive density f, then X Y Z f (x, y z) = f (x z)f (y z) f (x, y, z) = f (x, z)f (y, z)/f (z) f (x y, z) = f (x z), 13/73

14 Directed acyclic model 14/73

15 Terminology Directed graph g set of nodes and arrows Acyclic no sequence of arrows forms a loop in the graph Parents the parents of a is the set pa(a) of nodes b such that b a Descendent the descendent of a is the set de(a) of nodes that can be reached from a by following some sequence of arrows. 15/73

16 X a {X b : a... b} conditionally on {X c : c a} Directed acyclic graphical model The law of the random variable X = (X 1..., X p ) is a graphical model according to the directed acyclic graph g if for all a, X a {X b, b / de(a)} {X c, c pa(a)} We write L(X ) g. Remark: if g g and L(X ) g then L(X ) g. 16/73

17 Warning There is no unique minimal graph in general! Example: Be careful with the interpretation of directed graphical models! X i+1 = αx i + ε i with ε i independent of X 1,..., X i 1. Then, the two graphs p and p are minimal graphs for this model. 17/73

18 The issue of estimating the minimal g is ill-posed in this context. Yet, 1 it is very useful for defining / computing laws (next slides) 2 it can be used for exploring causal effect (last part of the talk) 18/73

19 Bayesian networks / DAG models Here, we assume that g is known (from expert knowledge). Factorization formula If L(X ) g, we have p f (x 1,..., x p ) = f (x b x pa(b) ) b=1 Proof: for a leaf p f (x 1,..., x p ) = f (x p x 1,..., x p 1 )f (x 1,..., x p 1 ) = f (x p x pa(p) )f (x 1,..., x p 1 ) Very useful for defining / computing f! Sampling with Gibbs sampler 19/73

20 Learning with graphical models Examples of applications speech recognition computer vision ecological monitoring decision making diagnosis environmental statistics etc 20/73

21 Examples Seals monitoring (Ver Hoef and Frost [17]) Ice streams (Berliner et al.) sses/collab ice.html 21/73

22 Non-directed model 22/73

23 Terminology Non-directed graph set of nodes and edges The nodes are labelled by 1,..., p. Neighbors The neighbors of a are the nodes in ne(a) = { b : b g a } Class of a We set cl(a) = ne(a) {a} 23/73

24 X a independent from {X b : b a} conditionally on {X c : c a} Non-directed graphical model The law of the random variable X = (X 1..., X p ) is a graphical model according to the non-directed graph g if for all a: X a {X b, b / cl(a)} {X c, c ne(a)}. We write L(X ) g. Remark: if g g and L(X ) g then L(X ) g. 24/73

25 Minimal graph Minimal graph When X has a positive density there exists a unique minimal graph g such that L(X ) g. In the following, we call simply graph of X the minimal graph g such that L(X ) g. Our main goal will be to learn g from data. 25/73

26 Connection with directed acyclic graphs Moral graph The moral graph g m associated to a directed acyclic graph g is obtained by setting an edge between each parents of each nodes replacing arrows by edges Proposition L(X ) g = L(X ) g m Proof : 1 From the factorization formula, g 1, g 2 such that f (x) = g 1 (x a, x ne m (a)) g 2 (x nn m (a), x ne m (a)) where nn m (a) = {1,..., p} \ cl m (a). 2 This ensures X a X nn m (a) given X ne m (a). 26/73

27 Factorization in undirected graphical models Hammersley-Clifford formula For a random variable X with positive density f ( ) L(X ) g f (x) exp g c (x c ). Proof : based on Möbius inversion formula. c:c cliques(g) 27/73

28 Questions so far? 28/73

29 Gaussian graphical models In the remaining X N (0, Σ), with Σ non singular. 29/73

30 Reminder on Gaussian distribution (1/2) Proposition 1 : Gaussian conditioning We consider two sets A = {1,[..., k} ] and B = {1,..., p} \ A, and a XA Gaussian random vector X = R X p with N (0, Σ) distribution. [ ] B KAA K We write K = AB for Σ K BA K 1 and K 1 AA for the inverse (K AA) 1 BB of K AA. Then, we have Law(X A X B ) = N ( K 1 AA K ABX B, K 1 ) AA which means X A = K 1 AA K ABX B + ε A, with ε A N ( 0, K 1 AA ) independent of XB. 30/73

31 Proof: explicit We have for some function f that we do not need to make g(x A x B ) = g(x A, x B )/g(x B ) = exp As a consequence, g(x A x B ) exp ( 12 x TA K AA x A x TA K AB x B ) f (x B ). ( 1 ) 2 (x A + K 1 AA K ABx B ) T K AA (x A + K 1 AA K ABx B ) where the factor of proportionality does not depend on x A. We recognize the density of the Gaussian N ( K 1 AA K AB x B, K 1 AA ) law. Ref: Lauritzen [11] 31/73

32 Reminder on Gaussian distribution (2/2) Partial correlation For any a, b {1,..., p}, we have cor(x a, X b X c : c a, b) = K a,b Kaa K bb. Proof : The previous proposition with A = {a, b} and B = A c gives ( ) 1 ( ) Kaa K cov(x A X B ) = ab 1 Kbb K = ab K ab K bb K aa K bb Kab 2. K ab K aa Plugging this formula in the definition of the partial correlation gives the result. 32/73

33 Reading the graph g on K From K to g We set K = Σ 1 and define the graph g by a g b K a,b 0. (1) GGM and precision matrix For the graph g defined by (1), we have 1 L(X ) g and g is minimal. 2 There exists ε a N (0, Kaa 1 ) independent of {X b : b a} such that X a = K ab X b + ε a. K aa b ne(a) 33/73

34 Proof. We apply Proposition 1 : 1 We set A = {a} nn(a) and B = ne(a), where nn(a) = ({1,..., p} \ cl(a). ) The precision matrix restricted to A is Kaa 0 K AA = so its inverse is 0 ( K nn(a) nn(a) ) K (K AA ) 1 1 aa 0 = 0 (K nn(a) nn(a) ) 1. The above Lemma ensures that the law of X {a} nn(a) given X ne(a) is Gaussian with covariance matrix (K AA ) 1 so X a and X nn(a) are independent conditionally on X ne(a). 2 The second point is obtained with A = {a} and B = A c. 34/73

35 Estimation strategies Goal From a n-sample X 1,..., X n i.i.d. with distribution N (0, Σ), we want to estimate the (minimal) graph g such that L(X ) g. The above results suggest 3 estimations strategies: 1 by estimating the partial correlations + multiple testing 2 by a sparse estimation of K 3 by a regression approach 35/73

36 Estimation with partial correlation (1/3) Reminder 1 Reminder 2 a g b ρ a,b := cor(x a, X b X c : c a, b) 0 ρ a,b = K a,b Kaa K bb Partial covariance estimation For n > p, we estimate ρ a,b by ρ ab = [ Σ 1 ] ab [ Σ 1 ] [ Σ 1 ] aa bb, where Σ is the empirical covariance. 36/73

37 Estimation with partial correlation (2/3) Under the null hypothesis when ρ a,b = 0 and n > p 2, we have t a,b := n 2 p ρ ab 1 ρ 2 ab Student(n p 2). Estimation procedure 1 Compute the t a,b 2 Apply a multiple testing thresholding Weakness when p > n 2 the procedure cannot be applied when n > p but n p small, t a,b has a large variance and the procedure is powerless 37/73

38 Estimation with partial correlation (3/3) Solution 1: Shrinking the conditioning work with ĉor(x a, X b X c : c S) with S small Ref: Wille and Bühlmann [19], Castelo and Roverato [2], Spirtes et al. [16] or Kalisch and Bühlmann [8]. ĉor(x a, X b X c : c S) is stable when S is small it is unclear what we estimate at the end (in general) Solution 2 : Sparse estimation of K The instability for large p comes from the instability of Σ 1 for estimating K. Build a more stable estimator of K capitalizing on its sparsity. 38/73

39 Sparse estimation of K (1/2) The likelihood of a p p positive symmetric matrix K S p + n ( det(k) Likelihood(K) = (2π) p exp 1 ) 2 X i T KX i. i=1 is Negative log-likelihood The negative-log-likelihood is convex. K n 2 log(det(k)) + n 2 K, Σ F Graphical Lasso : sparse estimation of K K λ = argmin K S + p { n 2 log(det(k)) + n 2 K, Σ F + λ } K ab a b 39/73

40 Sparse estimation of K (2/2) Efficient optimization algorithms. Ref: Friedman et al. [5], Banerjee et al. [1] Poor empirical results reported by Villers et al. [18] Theoretical guaranties under some compatibility conditions hard to check/interpret (by Ravikumar et al. [14]) keep the (good) idea of exploiting the sparsity, but move to the more classical regression framework. 40/73

41 Regression approach (1/4) Definitions Θ = the set of p p matrices with zero on the diagonal θ : matrix in Θ defined by θ ab = K ab /K bb for a b Characterization Proof: E[X a X b : b a] = b θ bax b θ = argmin Σ 1/2 (I θ) 2 F θ Θ a=1 since X a = b θ bax b + ε a. So: [ p θ = argmin E (X a ) 2 ] θ ba X b θ Θ = argmin θ Θ b:b a E [ X θ T X 2] = argmin Σ 1/2 (I θ) 2 F θ Θ 41/73

42 Regression approach (2/4) Replacing Σ by Σ, we obtain (I θ), Σ(I θ) F = 1 n X(I θ) 2 F. Estimation procedure Examples : θ λ = argmin θ Θ { } 1 n X Xθ 2 F + λω(θ) with Ω(θ) enforcing coordinate sparsity. 1 l 1 penalty : Ω(θ) = a b θ ab 2 l 1 /l 2 penalty : Ω(θ) = a<b θab 2 + θ2 ba 42/73

43 Regression approach (3/4) With the l 1 penalty : (Meinshausen and Bühlmann [13]) We can split the minimization into p problems in R p 1 { ] [ θl 1 ba = argmin 1 b:b a β R n X a } β b X b 2 + λ β l 1 p 1 b Very efficient algorithms by coordinate descent. l1 l1 No constraint enforces that θ ab 0 when θ ba 0. = choose an arbitrary decision rule to build ĝ from θ l1. Examples: 1 set an edge between a b in ĝ when either 2 set an edge a b in ĝ when both l1 l1 θ ab 0 or θ ba 0. l1 l1 θ ab 0 and θ ba 0. 43/73

44 Regression approach (4/4) With the l 1 /l 2 penalty : Symmetric zeros = no ambiguity to define ĝ from θ l1 /l 2 Computational cost The minimization problem cannot be split into p subproblems and it is less easy to minimize it in large dimensions. Algorithm : iterate on couple (a, b) until convergence ( ) ab 1 set = with ab = 1 n XT a (X b θ k a,b kb X k ). ba 2 set ) ( ( θab 1 λ ) ( ) ab. θ ba 2 + ba λ 44/73

45 Bayesian approach A series of papers [20, 4, 15] investigate the Bayesian approach. Issues 1 design of sensible priors 2 efficient posterior sampling To the best of my knowledge, cannot handle large dimensional problems 45/73

46 Conclusion(?) Conclusion we have the choice between multiple procedures for each procedure, there is at least one (non-scale free) tuning parameter to choose = we need a selection criterion 46/73

47 Classical selection criterion We have a collection G of graphs. Unbiased risk estimation AIC = 2 log(l(g)) + 2 g Bayesian criterion 2 log(p(g X)) n BIC = 2 log(l(g))+ g log(n) 2 log(p(g)) Only mathematically grounded in asymptotic setting : p fixed and n. 47/73

48 Resampling criterions Cross-Validation schemes train train train train test train train train test train train train test train train train test train train train test train train train train Figure recursive data splitting for 5-fold Cross-Validation No guaranty in high-dimensional settings : p n or p n. 48/73

49 GGMselect GGMselect R package (available on which 1 generates a collection Ĝ of candidates graphs according to the above procedures (+ some variants) 2 selects the best graph among Ĝ 49/73

50 GGMselect Quality criterion For a graph g MSEP(g) = Mean Square Error of Prediction related to g = bias(g) + variance(g) where bias(g) quantifies how important are the missing edges variance(g) is roughly proportional to the number of edges in g divided by n. Why MSEP? It is a way to quantify the importance of each edge. 50/73

51 GGMselect Ideal Select g = argmin{msep(g) : g Ĝ} g unknown! Selection criterion select ĝ which minimizes some penalized empirical MSEP where the penalty term: roughly penalizes each node of ĝ according to its degree (number of edges), is based on quantiles of Fisher random variables. 51/73

52 GGMselect Theorem : oracle-like inequality for GGMselect If max {deg(g)} ρ g G b n 2 ( log p ), for some ρ < 1, 2 then the estimated graph ĝ fulfills [ MSEP(ĝ) c ρ E inf {bias(g) + log(p) var(g)} g G b where R n = O(Tr(Σ)e c ρn + CVar(Σ) log(p)/n) with CVar(Σ) = ( ) a Σ 1 1. aa Ref: Giraud et al. [7] ] + R n, 52/73

53 GGMselect Theorem : oracle-like inequality for GGMselect If max {deg(g)} ρ g G b n 2 ( log p ), for some ρ < 1, 2 then the estimated graph ĝ fulfills [ MSEP(ĝ) c ρ E inf {bias(g) + log(p) var(g)} g G b where R n = O(Tr(Σ)e c ρn + CVar(Σ) log(p)/n) with CVar(Σ) = ( ) a Σ 1 1. aa Ref: Giraud et al. [7] ] + R n, 52/73

54 GGMselect Optimality? Optimal selection criterion? minimal size of the penalty to avoid overfitting minimax estimation rates when Ĝ contains good graphs What about the condition on the degree? (n/2 log p) unavoidable, otherwise estimation rate gets worse. 53/73

55 Practical issues : Gaussianity Gaussianity? Hammersley-Clifford formula For a random variable X with positive density f ( ) L(X ) g f (x) exp g c (x c ). c:c cliques(g) 54/73

56 Practical issues : Gaussianity Gaussianity? Data transformation: f j (X j ) := Φ 1 (F j (X j )) N (0, 1) Assumption: (f 1 (X 1 ),..., f p (X p )) N (0, Σ) Key point: graph(x 1,..., X p ) = graph(f 1 (X 1 ),..., f p (X p )) Estimation: work with f j (X j ) = Φ 1 ( F j (X j )) for some estimator F j. Ref: Data transformations proposed by Lafferty et al. [10] 55/73

57 Practical issues : Hidden variables Hidden variables? we may only observe part of the relevant variables: ( ) ( ( )) XO ΣOO Σ X = N 0, HO with X O observed and X H X H unobserved. Σ OH Σ HH We only have access to (Σ OO ) 1 = K OO K OH (K HH ) 1 K HO Ref: Chandrasekaran et al. [3] proposes a sparse + low rank estimation to recover K OO when H is small 56/73

58 Back to directed models 57/73

59 Reminder X a {X b : a... b} conditionally on {X c : c a} 58/73

60 Causal inference Setting We have p covariables X 1,..., X p and Y a variable of interest. [X, Y ] is a Gaussian graphical model according to g. No arrows from Y to X 1,..., X p Example Y is the end product of a metabolic network X 1,..., X p are protein abundances 59/73

61 Causal inference Direct versus Causal effect Direct effect given by θ a in E[Y X 1,..., X p ] = b θ b X b Causal effect (relative to g ) given by β a in E[Y X a, X b : b pa(a)] = β a X a + β b X b b pa(a) 60/73

62 Causal inference Main issue Causal effects are defined relative to g and there is no unique minimal directed graph... find all the DAG g (1),..., g (m) such that L(X ) g (k) compute a lower bound of the causal effect: β = min { β (1),..., β (m) } 61/73

63 PCalg PCalg R package (available on which estimates the DAG g (1),..., g (m) from the data estimates β (1),..., β (m) and β Ref: Kalish et al [9] Maathuis et al. [12] 62/73

64 PC algorithm Principle of PC algorithm Init : g = complete graph Iterate : for a = 1,..., p, for b ne(a): remove a b if ĉor(x a, X b ) < t 0 for a = 1,..., p, for b ne(a): remove a b if c 1 ne(a) such that ĉor(x a, X b X c1 ) < t 1 for a = 1,..., p, for b ne(a): remove a b if c 1, c 2 ne(a) such that ĉor(x a, X b X c1, X c2 ) < t 2... Output : skeleton of the DAGs Last step: compute g (1),..., g (m) from the skeleton 63/73

65 Riboflavin prediction Figure: Important genes for riboflavin production Ref : Kalish et al [9] 64/73

66 Warning Be aware of over-interpretation : we cannot reliably infer causal networks on i.i.d. data Relies on uncheckable assumptions But, seems promising for ranking covariables 65/73

67 Statistics in high-dimensional setting Despite theorem, do not trust to much statistical inferences in high-dimensional setting n p ex: gene pre-selection, metagenes, etc It is not a validation tool, but rather a good tool for providing good hints Requires experimental validations. References on high-dimensional statistics: Lecture Notes on High-Dimensional Statistics giraud/msv/lecturenotes.pdf The Element of Statistical Learning by Hastie, Tibshirani, Friedman www-stat.stanford.edu/ tibs/elemstatlearn/ 66/73

68 Thank you! 67/73

69 References I [1] O. Banerjee, L. El Ghaoui, and A. d Aspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res., 9: , [2] R. Castelo and A. Roverato. A robust procedure for Gaussian graphical model search from microarray data with p larger than n. J. Mach. Learn. Res., 7: , [3] V. Chandrasekaran, P. Parrilo, and A. Willsky. Latent variable graphical model selection via convex optimization. Annals of Statistics, /73

70 References II [4] P. Dellaportas, P. Giudici, and G. Roberts. Bayesian inference for nondecomposable graphical Gaussian models. Sankhyā, 65(1):43 55, [5] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3): , [6] Nir Friedman. Inferring Cellular Networks Using Probabilistic Graphical Models. Science, 303(5659): , [7] C. Giraud, S. Huet, and N. Verzelen. Graph selection with GGMselect. Stat. Appl. Genet. Mol. Biol., 11(3):1 50, /73

71 References III [8] M. Kalisch and P. Bühlmann. Robustification of the pc-algorithm for directed acyclic graphs. J. Comput. Graph. Statist., 17(4): , [9] Markus Kalisch, Martin Mächler, Diego Colombo, Marloes H. Maathuis, and Peter Bühlmann. Causal inference using graphical models with the r package pcalg. Journal of Statistical Software, 47(11):1 26, [10] John Lafferty, Han Liu, and Larry Wasserman. Sparse nonparametric graphical models. Statistical Science, 27: , [11] Steffen L. Lauritzen. Graphical Models. Oxford University Press, /73

72 References IV [12] Marloes H. Maathuis, Markus Kalisch, and Peter Bühlmann. Estimating high-dimensional intervention effects from observational data. Ann. Statist, 37(6A): , [13] N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the lasso. Ann. Statist., 34(3): , [14] Pradeep Ravikumar, Martin J. Wainwright, Garvesh Raskutti, and Bin Yu. High-dimensional covariance estimation by minimizing, l 1 -penalized log-determinant divergence. Electronic Journal of Statistics, 5: , /73

73 References V [15] J.G. Scott and C. M. Carvalho. Feature-inclusion stochastic search for gaussian graphical models. J. Comp. Graph. Statist., 17: , [16] P. Spirtes, C. Glymour, and R. Scheines. Causation, prediction, and search. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, second edition, [17] J. Verhoef and K. Frost. A bayesian hierarchical model for monitoring harbor seal changes in prince william sound, alaska. Environmental and Ecological Statistics, 10: , /73

74 References VI [18] Fanny Villers, Brigitte Schaeffer, Caroline Bertin, and Sylvie Huet. Assessing the Validity Domains of Graphical Gaussian Models in Order to Infer Relationships among Components of Complex Biological Systems. Statistical Applications in Genetics and Molecular Biology, 7, [19] A. Wille and P. Bühlmann. Low-order conditional independence graphs for inferring genetic networks. Stat. Appl. Genet. Mol. Biol., 5:Art. 1, 34 pp. (electronic), [20] F. Wong, C. K. Carter, and R. Kohn. Efficient estimation of covariance selection models. Biometrika, 90(4): , /73

Predicting causal effects in large-scale systems from observational data

nature methods Predicting causal effects in large-scale systems from observational data Marloes H Maathuis 1, Diego Colombo 1, Markus Kalisch 1 & Peter Bühlmann 1,2 Supplementary figures and text: Supplementary