Chapter 17: Undirected Graphical Models

Chapter 17: Undirected Graphical Models The Elements of Statistical Learning Biaobin Jiang Department of Biological Sciences Purdue University bjiang@purdue.edu October 30, 2014 Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 1 / 30

Overview 1 Introduction Probabilistic Graphical Models Review: Multivariate Statistics Review: Matrix Operations 2 Undirected Graphical Models for Continuous Variables Connection with Multiple Linear Regression Estimation of Parameters with Known Structure Estimation of Graph Structure 3 Undirected Graphical Models for Discrete Variables Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 2 / 30

Introduction Probabilistic Graphical Models What is Probabilistic Graphical Models A graph consists of a set of vertices (nodes), along with a set of edges joining some pairs of the vertices. In graphical models, each vertex represents a random variable, and the graph gives a visual way of understanding the joint distribution of the entire set of random variables. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 3 / 30

Introduction Probabilistic Graphical Models How it works Categories of PGM Directed Graphical Models, a.k.a. Bayesian Networks Undirected Graphical Models, a.k.a. Markov Random Field Computational Tasks of PGM Structuring, choosing the structure of the graph; Learning, estimating the edge parameters from data; and Inference, computing marginal vertex probabilities and expectations from their joint distribution. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 4 / 30

Introduction MultiVariate Normal Distribution Review: Multivariate Statistics The MVN distribution is a generalization of the univariate normal distribution which has the density function (p.d.f.) f (x) = 1 { exp 2πσ } (x µ)2 2σ 2 where µ is mean of distribution, σ 2 is variance. In p-dimensions the density becomes { 1 f (x) = (2π) p/2 exp 1 } Σ 1/2 2 (x µ)t Σ 1 (x µ) where µ is a p-dimensional mean vector and Σ is a symmetric covariance matrix. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 5 / 30

Introduction Conditional Probability of MVN Review: Multivariate Statistics [ ] X1 Let X = be a partitioned MVN random p-vector, with mean X [ ] 2 µ1 µ = and covariance matrix µ 2 [ ] Σ11 Σ Σ = 12. Σ 21 Σ 22 The conditional distribution of X 2 given X 1 = x 1 is an MVN with E(X 2 X 1 = x 1 ) = µ 2 + Σ 21 Σ 1 11 (x 1 µ 1 ) Cov(X 2 X 1 = x 1 ) = Σ 22 Σ 21 Σ 1 11 Σ 12 Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 6 / 30

Introduction Review: Matrix Operations Matrix Trace In Linear Algebra, the trace of an n-by-n square matrix A is defined to be the sum of the elements on the main diagonal of A, i.e., tr(a) = a 11 + a 22 + + a nn = n a ii. i=1 Matrix trace has several basic properties: tr(a + B) = tr(a) + tr(b) tr(ab) = tr(ba) Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 7 / 30

Undirected Graphical Models for Continuous Variables Estimation of Parameters with Known Graph Structure Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 8 / 30

Undirected Graphical Models for Continuous Variables What is Parameter Estimation Connection with Multiple Linear Regression Given empirical covariance matrix S, find the optimal estimation ˆΣ = W and its inverse ˆΣ 1 = Θ. In particular, if the ijth component of Θ is zero, then variable i and j are conditionally independent, given the other variables. In other words, there is no edge connection between vertex i and j. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 9 / 30

Undirected Graphical Models for Continuous Variables Connection with Multiple Linear Regression Conditional Mean and Multiple Linear Regression Suppose we partition X = (Z, Y ) where Z = (X 1,..., X p 1 ) and Y = X p. Then we have the conditional distribution of Y given Z (Eq. (17.6)) (Y Z = z) N (µ Y + (z µ Z ) T Σ 1 ZZ σ ZY, σ YY σ 1 ZY Σ 1 ZZ σ ZY ) where we have partitioned Σ as (Eq. (17.7)) [ ] ΣZZ σ Σ = ZY. σ T ZY σ YY The conditional mean in Eq. (17.6) has exactly the same form as the population multiple linear regression of Y on Z, with regression coefficient β = Σ 1 ZZ σ ZY. (Proof on next page) Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 10 / 30

Proof Undirected Graphical Models for Continuous Variables Connection with Multiple Linear Regression Given Eq. (2.9), we have expected prediction error as EPE(f ) = E(y f (z)) 2 = E(y z T β) 2 = E[y 2 2yz T β + β T zz T β] By differentiating the expected function, we have [ depe(f ) d(y 2 2yz T β + β T zz T ] β) = E dβ dβ [ ] = E 2yz + 2zz T β = 0 Then we derive β = E(zz T ) 1 E[yz] = Σ 1 ZZ σ ZY. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 11 / 30

Undirected Graphical Models for Continuous Variables How to Solve its Inverse Θ Connection with Multiple Linear Regression The standard formulas for partitioned inverses give ΣΘ = I, i.e., [ ] [ ] [ ] ΣZZ σ ZY ΘZZ θ ZY I 0 = 0 T. 1 Then we derive σ T ZY σ YY θ T ZY θ YY Σ ZZ θ ZY + σ ZY θ YY = 0 σ T ZY θ ZY + σ YY θ YY = 1 To solve these two equations, we have Eq. (17.8) θ ZY = θ YY Σ 1 ZZ σ ZY where 1/θ YY = σ YY σzy T Σ 1 ZZ σ ZY > 0. And hence, we have Eq. (17.9) β = Σ 1 ZZ σ ZY = θ ZY /θ YY. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 12 / 30

Undirected Graphical Models for Continuous Variables What We Have Learned Connection with Multiple Linear Regression The dependence of Y on Z in (17.6) is in the mean term alone. Here we see exactly that zero elements in β and hence θ ZY mean that the corresponding elements of Z are conditionally independent of Y. We can learn about this dependence structure through Multiple Linear Regression. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 13 / 30

Undirected Graphical Models for Continuous Variables Maximum Likelihood Estimation of MVN Estimation of Parameters with Known Structure Let X T = (x 1,..., x N ) be sampled from N p (µ, Σ). And the MLE of µ and Σ are the sample mean and empirical covariance (Eq. (17.10)) ˆµ = x = 1 N x i N ˆΣ = S = 1 N i=1 N (x i x)(x i x) T i=1 The likelihood function is a function of the parameters µ and Σ given the data X N L(µ, Σ X) = f (x i µ, Σ) i=1 = (2π) Np 2 Σ N 2 exp { 1 2 } N (x i µ) T Σ 1 (x i µ) Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 14 / 30 i=1

Undirected Graphical Models for Continuous Variables Log-likelihood Estimation of Parameters with Known Structure Then the log-likelihood of the data can be written as l(µ, Σ) = 2 log L(µ, Σ X) N = N log Σ + (x i µ) T Σ 1 (x i µ) + C which is equivalent to Eq. (17.11) since l(θ) = log det Θ tr(sθ) = log Σ tr( (x i µ) (x i µ) T Θ) i i=1 = log Σ i = log Σ i tr((x i µ) T Θ (x i µ)) (x i µ) T Σ 1 (x i µ) Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 15 / 30

Undirected Graphical Models for Continuous Variables Missing Edges: Equality Constraints Estimation of Parameters with Known Structure Now, we would like to maximize the log-likelihood under the constraints that some pre-defined subset of the parameters are zero. maximize Θ subject to l C (Θ) = log det Θ tr(sθ) θ jk = 0, (j, k) E Then we add Lagrange multiplier, and derive Eq. (17.12) maximize l C (Θ) = log Θ tr(sθ) γ j,k θ j,k Θ Taking the derivative, we have Eq. (17.13) Θ 1 S Γ = 0 (j,k) E where Γ is a matrix of Lagrange parameters with nonzero values for all missing edges. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 16 / 30

Undirected Graphical Models for Continuous Variables Estimation of Parameters with Known Structure Solve (17.13) by Multiple Linear Regression Step 1: Partition W and derive Eq. (17.14): w 12 s 12 γ 12 = 0. Step 2: Connect w 12 with β. Eq. (17.16) [ ] [ ] W11 w 12 Θ11 θ 12 w T 12 w 22 θ T = 12 θ 22 This implies Eq. (17.17) [ ] I 0 0 T. 1 w 12 = W 11 θ 12 /θ 22 = W 11 β Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 17 / 30

Undirected Graphical Models for Continuous Variables Estimation of Parameters with Known Structure Solve (17.13) by Multiple Linear Regression (Cont.) Step 3: Use simple subset regression to solve Eq. (17.18) W 11 β s 12 γ 12 = 0 If γ j 0, we remove all the elements in jth row and jth column, and derive the reduced system of equation Eq. (17.19) W 11β s 12 = 0 Step 4: Update θ 22 and θ 12 (Eq. (17.20)) 1/θ 22 = s 22 w T 12 ˆβ θ 12 = ˆβθ 22. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 18 / 30

Undirected Graphical Models for Continuous Variables Summary: Algorithm 17.1 Estimation of Parameters with Known Structure Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 19 / 30

Undirected Graphical Models for Continuous Variables A Case Study: Figure 17.4 Estimation of Parameters with Known Structure Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 20 / 30

Undirected Graphical Models for Continuous Variables Estimation of Graph Structure Estimation of the Graph Structure Graph Lasso Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 21 / 30

Undirected Graphical Models for Continuous Variables Graph Lasso Estimation of Graph Structure Graph Lasso fits a lasso regression using each variable as the response and the others as predictors. Consider maximizing the penalized log-likelihood Eq. (17.21) log Θ tr(sθ) λ Θ 1 where Θ 1 is the L 1 norm, i.e., the sum of the absolute values of the elements of Θ. Similarly, taking the differentiation, we reach the analog of Eq. (17.18) as Eq. (17.23) W 11 β s 12 + λ Sign(β) = 0. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 22 / 30

Undirected Graphical Models for Continuous Variables Estimation of Graph Structure Cyclical Coordinate Descent Algorithm Let s re-denote the following equation W 11 β s 12 + λ Sign(β) = 0. as a (p 1) by (p 1) linear system using A, x and b Ax b + λ Sign(x) = 0. For i = 1, 2,..., p 1, 1, 2,..., p 1,..., we update (Eq. (17.26)) ( x i St b i ) A ki x k, λ k i /A ii where St(x, t) is the soft-threshold operator (Eq. (17.27)) St(x, t) = sign(x) ( x t) + Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 23 / 30

Undirected Graphical Models for Continuous Variables Summary: Graph Lasso Algorithm Estimation of Graph Structure Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 24 / 30

Undirected Graphical Models for Continuous Variables A Case Study: Flow-Cytometry Data Estimation of Graph Structure Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 25 / 30

Undirected Graphical Models for Continuous Variables Missing/Hidden Node Values: EM Estimation of Graph Structure Note that the values at some of the nodes in a graphical model can by unobserved; i.e., missing or hidden. The EM algorithm can be used to impute the missing values with E Step (Eq. (17.43)) imputing the missing values from the current estimates of µ and Σ ˆx i,mi = E(x i,mi x i,oi, θ) = ˆµ mi + ˆΣ mi,o i ˆΣ 1 o i,o i (x i,oi ˆµ oi ) and M Step (Eq. (17.44)) re-estimating µ and Σ from the empirical mean and (modified) covariance of the imputed data ˆµ j = 1 N ˆx ij N ˆΣ jj = 1 N i=1 N (ˆx ij ˆµ j )(ˆx ij ˆµ j ) + c i,jj i=1 Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 26 / 30

Undirected Graphical Models for Discrete Variables Ising Models/Boltzmann Machines Pairwise Markov networks with binary variables are called Ising models in statistical mechanics, and Boltzmann machines in machine learning. The joint probabilities of the Ising model is given by Eq. (17.28, 17.29) ( ) P(X, Θ) = 1 exp (j,k) E θ jkx j X k ψ C (x C ) = [ ( Φ(Θ) )] C C exp (j,k) E θ jkx j x k x X The Ising model implies a logistic form for each node conditional on the other (Eq. (17.30)) P(X j = 1 X j = x j ) = 1 ( 1 + exp θ j0 ) (j,k) E θ jkx k where X j denotes all of the nodes except j. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 27 / 30

Undirected Graphical Models for Discrete Variables Estimation of Parameters with Known Graph Structure Given X, find Θ. The log-likelihood is Eq. (17.31) N l(θ) = log P Θ (X i = x i ) = i=1 [ ] N θ jk x ij x ik Φ(Θ) i=1 (j,k) E The gradient of the log-likelihood is Eq. (17.32, 17.33, 17.34) l(θ) θ jk = N x ij x ik N x j x k p(x, Θ) x X i=1 = Ê(X j X k ) E Θ (X j X k ) = 0 Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 28 / 30

Undirected Graphical Models for Discrete Variables Reference T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning. D. Koller, N. Friedman. Probabilistic Graphical Models: Principles and Techniques. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 29 / 30

Undirected Graphical Models for Discrete Variables The End Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 30 / 30