Sparse Graph Learning via Markov Random Fields

Sparse Graph Learning via Markov Random Fields Xin Sui, Shao Tang Sep 23, 2016 Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 1 / 36

Outline 1 Introduction to graph learning 2 Markov Random Fields 3 Gaussian MRF 4 Extensions to the nonparanormal family 5 Ising models 6 Challenges in Poission 7 Mixed models Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 2 / 36

Introduction Let X R n p be n samples of p random variables X = {X 1, X 2,, X p } with a probability measure P(X ). The goal is to reconstruct underlying dependencies among those variables. E.g., social network analysis, reverse-engineering of genes, discovering functional brain connectivity patterns By capturing those dependencies in a graph, probabilistic graphical model is a convenient visualization and inference tool. See Chapter 9 of Hastie et al. (2015) and Chapter 8 of Rish and Grabarnik (2014). Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 3 / 36

Graph Formally, a graph G is a pair G = (V, E). V : a finite set of vertices. E V V : a set of edges connecting pairs of vertices (i, j ). In our settings, V = {1, 2,, p} corresponds to the set of variables, while the set of edges E is to capture their dependencies. Undirected graphical models (for any i V, j V, (i, j ) E (j, i) E) are easier to handle, although they can not capture causal dependencies. We focus on learning Markov Random Fields that captures conditional dependencies in graphs. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 4 / 36

Markov Random Fields A Markov Random Field (MRF) is a undirected graphical model (X, P(X ), G), where the undirected graph G satisfies the global Markov property: For all sets S V that separates the graph into disconnected components A, B V, X A X B X S. Here, for example, X A denotes the subset of random variables that is associated with the subset of vertices A. A, B are separated by S iff for any s A, t B, (s, t) E, where A, B, S is a partition of V. E.g., Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 5 / 36

Local Markov property Local Markov property reveals pairwise independencies (given other variables) in any MRF and is thus important in our development. In a MRF (X, P(X ), G), where G = (V, E), for any s V, t V, s t, (s, t) E X s X t X /{s,t}. Here, X /{s,t} is the set of all random variables except X s and X t. Proof. Let A = {s}, B = {t}, S = V /{s, t}. Then when (s, t) E, S separates A and B, and by the global Markov property, X s X t X /{s,t}. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 6 / 36

Examples of MRFs Two dimensional grids, useful in computer vision with each vertex denoting a pixel. pairwise MRF model p P θ (X ) = exp θs X s + s=1 p h θ (X s ) + s=1 p θstx s X t A(θ ), s=1 t:t s for some base function h θ (X s ). Here, A(θ ) is the log of partition function. This includes Gaussian MRF, Ising models, etc..those specific applications are our focus. institution-logo-filena Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 7 / 36

Hammersley-Clifford Theorem Hammersley-Clifford Theorem: for any strictly positive distribution, global Markov property holds iff factorization property holds. Given a graph G, it states the sufficient and necessary condition on the joint distribution P(X ) such that (X, P(X ), G) is an MRF. Factorization property: P(X ) factorizes over G if it has the decomposition P(X 1,..., X p ) = 1 Z ψ C (X C ). C C Z : normalization factor (partition function). C: the set of all maximal cliques in the graph. ψ C : positive real-valued function (compatibility function). A clique C V is a fully-connected subset of the vertex set. A clique is said to be maximal if it is not strictly contained in any other cliques. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 8 / 36

Example In the graph G below, sets A, B, C, D are the maximal cliques. The joint distribution P must have the form P(X ) = 1 Z ψ 123(X 1, X 2, X 3 )ψ 345 (X 3, X 4, X 5 )ψ 46 (X 4, X 6 )ψ 57 (X 5, X 7 ) such that (X, P(X ), G) is an MRF. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 9 / 36

Minimal I-map Since the factorization of a joint distribution is not unique, if G = (V, E) is a map such that the triplet (X, P(X ), G) is an MRF, then for any G = (V, Ẽ), where Ẽ E, (X, P(X ), G) is also a MRF. Define the minimal I-map G = (V, E ) of a joint distribution P(X ) to be the most sparse graph such that (X, P(X ), G ) is an MRF. It is our sole interest to estimate the minimal I-map, because it is the most informative among all possible graphs. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 10 / 36

Pairwise MRF models revisited The pairwise MRF model is p P θ (X ) = exp θs X s + s=1 p h θ (X s ) + s=1 p θstx s X t A(θ ). s=1 t:t s In the pairwise MRF (X, P θ (X ), G ), when s t: From the model definition: θ st 0 X s X t X /{s,t}. From pairwise Markov property: (s, t) E X s X t X /{s,t}. From the definition of minimal I-map: X s X t X /{s,t} (s, t) E. θ st = 0 (s, t) E. It suffices to estimate θ st. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 11 / 36

Example of Gaussian MRF on stock data We first show an example of Gaussian MRF. This dataset contains the close prices of 452 stocks for 1258 trading days. We use the first 20 stocks to plot their dependencies. A cluster of stocks have dependencies within it. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 12 / 36

Gaussian MRFs When X N (0, Σ ) (assume µ = 0 for simplicity), the probability density function (PDF) f (X ) takes the form { } f Θ (X ) = exp 1 p Θ 2 stx s X t A(Θ ). s,t=1 Θ = (Σ ) 1 : precision matrix; A(Θ ) = 1 2 log det[θ /(2π)]: log of partition function. f Θ (X ) is exactly the pairwise model P θ (X ) with θ s = 0, h θ (X s ) = Θ ssx 2 s /2, and θ st = Θ st/2. (s, t) E Θ st = 0. It suffices to estimate Θ! In estimation, sparsity-inducing penalties on the precision matrix are preferred to enforce sparsity on the graph. The l 1 -penalized optimization problem is named graphical lasso. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 13 / 36

Optimization criterion for graphical lasso Optimization criterion (penalized maximum likelihood) min f (Θ) := log det Θ + tr(s T Θ) + λ Θ 1 Θ S p ++ S p ++ : set of positive definite symmetric matrices of dimensionality p p. S: sample covariance matrix with mean given. λ: tuning parameter for graphical lasso. Strictly convex (strictly convex objective function and convex domain). The dual problem is max log det W + p. W S ρ It is concave and smooth. The solution gives an estimate of the covariance matrix. Estimation methods: neighborhood selection, glasso, projected gradient, greedy coordinate ascent, only to name a few. institution-logo-filena Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 14 / 36

Neighborhood selection in Gaussian graph learning Neighborhood selection is an approximation method in optimization. See Meinshausen and Bühlmann (2006) for reference. For every random variable X s, regress it on all the other random variables X /s with the model X s t:t s β stx t. Under the assumption µ = 0, X s X /s N (Y s, 1/Θ ss), where Y s = t:t s (Θ st/θ ss)x t. Let ˆΘ st = 0 iff ˆβ st = 0. Simple & scalable. Also, the estimation of sign(θ ) is consistent, where sign( ) operates component-wisely. The estimation of Θ is not necessarily consistent, and it may violate the symmetry and positive-definite constraints. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 15 / 36

Algorithm 1 Neighborhood-based graph selection for graphical lasso 1: for each node vertex s = 1, 2,, p do 2: Solve the neighborhood prediction problem 1 ˆβ s = arg min β 2n X,s X,/s β 2 2 + λ β 1 3: Compute the estimate ˆN (s) = J (ˆβ s ) of the neighborhood set N (s), 4: end for 5: Combine the neighborhood estimates { ˆN (s), s V } via the AND or OR rule to form a graph estimate Ĝ = (V, Ê). X,s is the s th column of X, while X,/s is X without this column. J (ˆβ s ) is the support of ˆβ s. Neighborhood set N (s) := {t V : (s, t) E } AND/OR rule: (s, t) Ê iff ˆβ st = 0 AND/OR ˆβ ts = 0. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 16 / 36

Projected gradient approach Recall the dual problem max log det W + p. W S ρ In the dual problem, iterate between (1) and (2): 1 a gradient update W W + αw 1, where α > 0; 2 a projection W arg min Z { Z W 2 : Z S ρ}. Same order of time complexity as glasso O(p 3 ), but empirically outperforms it by a factor of 2. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 17 / 36

Greedy coordinate ascent on the primal problem Recall the prime problem min f (Θ) := log det Θ + tr(s T Θ) + λ Θ 1. Θ S p ++ At each iteration t, the update is Θ (t+1) Θ (t) + ˆθ (t) (e r e T s + e s e T r ), where (ˆθ (t), r, s) = arg min θ,i,j f (Θ (t) + θ(e i e T j + e j e T i )). (e j R p with only j th entry non-zero, which takes the value 1). Advantages 1 Time complexity: O(p 2 ), better scaling with respect to p than glasso. 2 Naturally preserves the sparsity of the solution. 3 Fixing a small λ, It can directly obtain a greedy solution path. 4 Massive parallelization is straightforward. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 18 / 36

Nonparanormal family Gaussian graph learning techniques can be naturally extended to the case where the joint distribution is nonparanormal. Definition. A random vector X = (X 1, X 2,, X p ) T has a nonparanormal distribution if there exist functions (f 1, f 2,, f p ) such that (f 1 (X 1 ), f 2 (X 2 ),, f p (X p )) N (0, Σ o) with diag(σ o) = 1. Property: f j = Φ 1 F j for all j. Here, Φ( ) is the CDF of standard normal distribution, and F j is the marginal CDF of X j. Limitation: not working for discrete random variables. Connection. If each f j is differentiable, then X i X j X /{i,j } (Θ o) ij = 0 (Liu et al., 2009). It suffice to estimate Σ o, or Θ o, as in the Gaussian case. However, the estimation of sample covariance matrix S o is not easy. We first need to estimate f j s. This leads to the copula transform method. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 19 / 36

Estimation of S Copula transform method. Estimate F j by its (Winsorized) empirical CDF, then use ˆf j = Φ 1 ˆF j for all j to compute S. Alternatively, Xue et al. (2012b); Liu et al. (2012) proposed the rank-based method by noticing each f j (x) is monotonely increasing in x. 1 For each X ij, compute its rank statistic ˆr ij based on its rank in the variable X j. 2 Compute ˆρ, the estimated correlation matrix between rank statistics. 3 Approximate S by 2 sin(πˆρ/6), where sin( ) is a component-wise function. In rank-based methods, there is no need to tune the winsorization parameter. This one also sacrifices estimation efficiency for robustness. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 20 / 36

Ising model Ising model is a pairwise MRF with each random variable X s takes discrete values { 1, +1} for all s V : P θ (X ) = exp θs X s + θ stx s X t A(θ ). s V (s,t) E Originated from ferromagnetism studies in physics. Related to Restricted Boltzmann Machine (RBM) in machine learning literature. In fact, if we partition V into V 1 and V 2, then when θ st = 0 for all s V k, t V k (k = 1, 2), and X j s are unobserved for all j V 2, Ising model degenerates to RBM model. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 21 / 36

Example: image de-noising (Bishop, 2006) Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 22 / 36

Example: politician networks Example from Hastie et al. (2015). Politician networks estimated from voting records of U.S. Senate (2004-2006), with Democratic/Republican/Independent senators coded as blue/red/yellow nodes, respectively. (b) is a smaller subgraph of the same social network. The subgraph shows a strong bipartite tendency with clustering within party lines. A few senators show cross-party connections. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 23 / 36

Limitation In this slide only, let P(X ) be the true joint distribution of X instead of the distribution in the pairwise MRF model. Recall if X is multivariate Gaussian, then P(X ) can be exactly represented into a pairwise MRF distribution. However, when X s are Bernoulli random variables, P(X ) may not fit into the pairwise MRF framework! E.g., when for some normalizing constant Z. P(X 1, X 2, X 3 ) = 1 Z ex1x2x3 The Ising model may be extended by adding higher order interaction terms. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 24 / 36

Neighborhood selection in Ising models A(θ ) is the summation over 2 p terms, and is computationally intractable! A(θ ) = log exp θs x s + θ stx s x t x { 1,+1} p s V (s,t) E Again, neighborhood selection can be used to approximate a solution, see Ravikumar et al. (2010) for reference. It can be shown that P(X s X /s ) = Ys e2xs, 2Xs Ys 1 + e where Y s = θ s + t:t s θ stx t. In logistic regression, if X s is the response, and X /s are the predictors, the conditional probability has the same form. In the algorithm, We only need to change the squared error loss in the Gaussian case to the logistic loss. Again, only sign consistency results are established in the literature (Xue et al., 2012a). institution-logo-filena Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 25 / 36

Algorithm 2 Neighborhood-based graph selection for Ising models 1: for each node vertex s = 1, 2,, p do 2: Solve the neighborhood prediction problem (ˆθ s, ˆθ s,/s ) = arg min θ s,θ s,/s 1 n n log P(X is X i,/s, θ s, θ s,/s ) + λ β 1 i=1 3: Compute the estimate ˆN (s) = J (ˆθ s,/s ) of the neighborhood set N (s). 4: end for 5: Combine the neighborhood estimates { ˆN (s), s V } via the AND or OR rule to form a graph estimate Ĝ = (V, Ê). Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 26 / 36

Pseudo likelihood approach Another approach is to use the pseudo likelihood P(X 1, X 2,, X p ) p i=1 P(X i X /i )(:= L(θ)). The approximation is exact when all X i s are independent. In general, the sparser the graph, the better the approximation. Optimization criterion. min θ log L(θ) + λ θ 1. Optimization methods. Höfling and Tibshirani (2009): At each iteration t, approximate log L by its second-order Taylor expansion at θ (t) with only diagonal elements in the Hessian (denote this by Q(θ; θ (t) )). Then, θ (t+1) arg min θ Q(θ; θ (t) ) + λ θ 1. Xue et al. (2012a) used coordinate descent to iteratively update each parameter. At each iteration, a quadratic function is used to majorize log L. It differs from Höfling and Tibshirani (2009) s in that: (a) it updates only one parameter at each iteration; (b) it bounds 2 log L/ θ 2 st instead of calculating its value for every parameter θ st. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 27 / 36

Comparison between NS and PL Both of them are approximation methods, the intractability of A(θ) is avoided by taking advantage of conditional likelihoods. Generally, ˆβ ij ˆβ ji in neighborhood selection method. Theoretically, ˆβ NS (the estimate of β in neighborhood selection) is sign consistent under some conditions (Ravikumar et al., 2010). Under similar conditions, ˆβ PL (the estimate of β in pseudo likelihood) is not only sign consistent, but also, its estimation error is O(λ E ) (Xue et al., 2012a), where E is the number of edges in the real model. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 28 / 36

Comparison in Experiments Computational time. Both methods are fast in implementation. Neighborhood selection is faster in large datasets, solely because each parameter is updated only once, while it has to be updated until a convergence criterion is met in pseudo likelihood approaches. Accuracy. When estimating the edges, there is hardly any difference. But pseudo likelihood approaches are slightly better in estimating the joint distribution. We show some results in Höfling and Tibshirani (2009). Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 29 / 36

Computational time P: number of variables. N : number of observations. Neigh: average number of neighbors per node. This may be misleading! Neighborhood selection is faster in larger datasets! In my own experiment, PL took 136.219s while NS took 13.814s on a dataset of size (n, p) = (1000, 50). Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 30 / 36

Edge recovery Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 31 / 36

Estimation of the joint distribution Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 32 / 36

Challenges in Poisson Karlis (2003) studies the case where X s = Y 0 + Y s. Here, Y j, j = 0, 1,, p are independent Poisson random variables. It can only model positive correlations. The pairwise Poisson graphical model is P exp (θs X s log X s!) + θ stx s X t A(θ ). s V (s,t) E A(θ ) < + only if θ st 0 for (s, t) E. It can only model negative correlations. This issue may be solved by truncating the Poisson distribution or modifying the loss function (see, e.g., Yang et al. (2013)). However, these techniques are ad-hoc and have limited applicability. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 33 / 36

Mixed models We would like to model independencies among variables that come from different distributions. Let us use another notation system in this slide. Let X = (X 1, X 2,, X p ) be p continuous random variables, and Y = (Y 1, Y 2,, Y q ) be q discrete ones, each has L j possible states. In the pairwise model, P(X, Y ) is proportional to { p } exp γs X s 1 p p p q q q θ 2 stx s X t + ρ sj [Y j ]X s + ψjr[y j, Y r ] s=1 s=1 t=1 s=1 j =1 i=1 j =1 Each ρ sj is a vector of L j parameters, and each ψ ij is a matrix with L i L j elements. P(X s X /s, Y ) is Gaussian, while P(Y j X, Y /j ) is multinomial. Neighborhood selection still applies. Gaussian approximation of continuous random variables may not be appropriate. Also, it cannot model Poisson data. institution-logo-filena Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 34 / 36

Bibiography I Christopher M Bishop. Pattern recognition. Machine Learning, 128, 2006. Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC Press, 2015. Holger Höfling and Robert Tibshirani. Estimation of sparse binary pairwise markov networks using pseudo-likelihoods. Journal of Machine Learning Research, 10 (Apr):883 906, 2009. Dimitris Karlis. An em algorithm for multivariate poisson distribution and related models. Journal of Applied Statistics, 30(1):63 77, 2003. Han Liu, John Lafferty, and Larry Wasserman. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research, 10(Oct):2295 2328, 2009. Han Liu, Fang Han, Ming Yuan, John Lafferty, and Larry Wasserman. High-dimensional semiparametric gaussian copula graphical models. The Annals of Statistics, pages 2293 2326, 2012. Nicolai Meinshausen and Peter Bühlmann. High-dimensional graphs and variable selection with the lasso. The annals of statistics, pages 1436 1462, 2006. institution-logo-filena Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 35 / 36

Bibiography II Pradeep Ravikumar, Martin J Wainwright, John D Lafferty, et al. High-dimensional ising model selection using?1-regularized logistic regression. The Annals of Statistics, 38(3):1287 1319, 2010. Irina Rish and Genady Grabarnik. Sparse modeling: theory, algorithms, and applications. CRC Press, 2014. Lingzhou Xue, Hui Zou, and Tianxi Cai. Nonconcave penalized composite conditional likelihood estimation of sparse ising models. The Annals of Statistics, pages 1403 1429, 2012a. Lingzhou Xue, Hui Zou, et al. Regularized rank-based estimation of high-dimensional nonparanormal graphical models. The Annals of Statistics, 40 (5):2541 2571, 2012b. Eunho Yang, Pradeep K Ravikumar, Genevera I Allen, and Zhandong Liu. On poisson graphical models. In Advances in Neural Information Processing Systems, pages 1718 1726, 2013. Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 36 / 36