Chapter 17: Undirected Graphical Models

Similar documents
Graphical Model Selection

Undirected Graphical Models

An Introduction to Graphical Lasso

10708 Graphical Models: Homework 2

Gaussian Graphical Models and Graphical Lasso

CPSC 540: Machine Learning

MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models

Gaussian Models (9/9/13)

MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.)

Sparse inverse covariance estimation with the lasso

Undirected Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models

CPSC 540: Machine Learning

Chris Bishop s PRML Ch. 8: Graphical Models

Joint Gaussian Graphical Model Review Series I

CPSC 540: Machine Learning

Probabilistic Graphical Models

Sparse Gaussian conditional random fields

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Sparse Graph Learning via Markov Random Fields

Graphical Models for Collaborative Filtering

11 : Gaussian Graphic Models and Ising Models

The lasso: some novel algorithms and applications

Probabilistic Graphical Models

Notes on the Multivariate Normal and Related Topics

Probabilistic Graphical Models

Chapter 16. Structured Probabilistic Models for Deep Learning

Review: Directed Models (Bayes Nets)

Introduction to Probabilistic Graphical Models

Exam 2. Jeremy Morris. March 23, 2006

Lecture 1 October 9, 2013

COS513 LECTURE 8 STATISTICAL CONCEPTS

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Structure estimation for Gaussian graphical models

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

TAMS39 Lecture 2 Multivariate normal distribution

MATH 829: Introduction to Data Mining and Analysis Graphical Models I

Naive Bayes and Gaussian Bayes Classifier

Notes on Random Vectors and Multivariate Normal

Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure

The Expectation-Maximization Algorithm

Introduction to Machine Learning

1 Data Arrays and Decompositions

Probabilistic Graphical Models

CS229 Lecture notes. Andrew Ng

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Graphical Models and Kernel Methods

An Introduction to Bayesian Machine Learning

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Lecture 25: November 27

Learning Gaussian Graphical Models with Unknown Group Sparsity

Conditional Random Field

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

STAT 730 Chapter 4: Estimation

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

STA 4273H: Statistical Machine Learning

Comparing Bayesian Networks and Structure Learning Algorithms

3 : Representation of Undirected GM

Association studies and regression

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Learning discrete graphical models via generalized inverse covariance matrices

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

The generative approach to classification. A classification problem. Generative models CSE 250B

Naive Bayes and Gaussian Bayes Classifier

Testing a Normal Covariance Matrix for Small Samples with Monotone Missing Data

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Naive Bayes and Gaussian Bayes Classifier

Lecture 1: Bayesian Framework Basics

Multivariate Normal Models

Bayesian Learning in Undirected Graphical Models

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Introduction to Machine Learning

Machine learning - HT Maximum Likelihood

Graphical Models and Independence Models

Undirected Graphical Models

Mixtures of Gaussians. Sargur Srihari

Rapid Introduction to Machine Learning/ Deep Learning

Probabilistic Graphical Models

Sparse Permutation Invariant Covariance Estimation: Motivation, Background and Key Results

Factor Analysis and Kalman Filtering (11/2/04)

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CSC321 Lecture 18: Learning Probabilistic Models

Machine Learning, Fall 2012 Homework 2

Cheng Soon Ong & Christian Walder. Canberra February June 2018

COM336: Neural Computing

A brief introduction to Conditional Random Fields

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Robust and sparse Gaussian graphical modelling under cell-wise contamination

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Directed and Undirected Graphical Models

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Undirected Graphical Models: Markov Random Fields

K-Means and Gaussian Mixture Models

Transcription:

Chapter 17: Undirected Graphical Models The Elements of Statistical Learning Biaobin Jiang Department of Biological Sciences Purdue University bjiang@purdue.edu October 30, 2014 Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 1 / 30

Overview 1 Introduction Probabilistic Graphical Models Review: Multivariate Statistics Review: Matrix Operations 2 Undirected Graphical Models for Continuous Variables Connection with Multiple Linear Regression Estimation of Parameters with Known Structure Estimation of Graph Structure 3 Undirected Graphical Models for Discrete Variables Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 2 / 30

Introduction Probabilistic Graphical Models What is Probabilistic Graphical Models A graph consists of a set of vertices (nodes), along with a set of edges joining some pairs of the vertices. In graphical models, each vertex represents a random variable, and the graph gives a visual way of understanding the joint distribution of the entire set of random variables. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 3 / 30

Introduction Probabilistic Graphical Models How it works Categories of PGM Directed Graphical Models, a.k.a. Bayesian Networks Undirected Graphical Models, a.k.a. Markov Random Field Computational Tasks of PGM Structuring, choosing the structure of the graph; Learning, estimating the edge parameters from data; and Inference, computing marginal vertex probabilities and expectations from their joint distribution. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 4 / 30

Introduction MultiVariate Normal Distribution Review: Multivariate Statistics The MVN distribution is a generalization of the univariate normal distribution which has the density function (p.d.f.) f (x) = 1 { exp 2πσ } (x µ)2 2σ 2 where µ is mean of distribution, σ 2 is variance. In p-dimensions the density becomes { 1 f (x) = (2π) p/2 exp 1 } Σ 1/2 2 (x µ)t Σ 1 (x µ) where µ is a p-dimensional mean vector and Σ is a symmetric covariance matrix. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 5 / 30

Introduction Conditional Probability of MVN Review: Multivariate Statistics [ ] X1 Let X = be a partitioned MVN random p-vector, with mean X [ ] 2 µ1 µ = and covariance matrix µ 2 [ ] Σ11 Σ Σ = 12. Σ 21 Σ 22 The conditional distribution of X 2 given X 1 = x 1 is an MVN with E(X 2 X 1 = x 1 ) = µ 2 + Σ 21 Σ 1 11 (x 1 µ 1 ) Cov(X 2 X 1 = x 1 ) = Σ 22 Σ 21 Σ 1 11 Σ 12 Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 6 / 30

Introduction Review: Matrix Operations Matrix Trace In Linear Algebra, the trace of an n-by-n square matrix A is defined to be the sum of the elements on the main diagonal of A, i.e., tr(a) = a 11 + a 22 + + a nn = n a ii. i=1 Matrix trace has several basic properties: tr(a + B) = tr(a) + tr(b) tr(ab) = tr(ba) Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 7 / 30

Undirected Graphical Models for Continuous Variables Estimation of Parameters with Known Graph Structure Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 8 / 30

Undirected Graphical Models for Continuous Variables What is Parameter Estimation Connection with Multiple Linear Regression Given empirical covariance matrix S, find the optimal estimation ˆΣ = W and its inverse ˆΣ 1 = Θ. In particular, if the ijth component of Θ is zero, then variable i and j are conditionally independent, given the other variables. In other words, there is no edge connection between vertex i and j. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 9 / 30

Undirected Graphical Models for Continuous Variables Connection with Multiple Linear Regression Conditional Mean and Multiple Linear Regression Suppose we partition X = (Z, Y ) where Z = (X 1,..., X p 1 ) and Y = X p. Then we have the conditional distribution of Y given Z (Eq. (17.6)) (Y Z = z) N (µ Y + (z µ Z ) T Σ 1 ZZ σ ZY, σ YY σ 1 ZY Σ 1 ZZ σ ZY ) where we have partitioned Σ as (Eq. (17.7)) [ ] ΣZZ σ Σ = ZY. σ T ZY σ YY The conditional mean in Eq. (17.6) has exactly the same form as the population multiple linear regression of Y on Z, with regression coefficient β = Σ 1 ZZ σ ZY. (Proof on next page) Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 10 / 30

Proof Undirected Graphical Models for Continuous Variables Connection with Multiple Linear Regression Given Eq. (2.9), we have expected prediction error as EPE(f ) = E(y f (z)) 2 = E(y z T β) 2 = E[y 2 2yz T β + β T zz T β] By differentiating the expected function, we have [ depe(f ) d(y 2 2yz T β + β T zz T ] β) = E dβ dβ [ ] = E 2yz + 2zz T β = 0 Then we derive β = E(zz T ) 1 E[yz] = Σ 1 ZZ σ ZY. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 11 / 30

Undirected Graphical Models for Continuous Variables How to Solve its Inverse Θ Connection with Multiple Linear Regression The standard formulas for partitioned inverses give ΣΘ = I, i.e., [ ] [ ] [ ] ΣZZ σ ZY ΘZZ θ ZY I 0 = 0 T. 1 Then we derive σ T ZY σ YY θ T ZY θ YY Σ ZZ θ ZY + σ ZY θ YY = 0 σ T ZY θ ZY + σ YY θ YY = 1 To solve these two equations, we have Eq. (17.8) θ ZY = θ YY Σ 1 ZZ σ ZY where 1/θ YY = σ YY σzy T Σ 1 ZZ σ ZY > 0. And hence, we have Eq. (17.9) β = Σ 1 ZZ σ ZY = θ ZY /θ YY. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 12 / 30

Undirected Graphical Models for Continuous Variables What We Have Learned Connection with Multiple Linear Regression The dependence of Y on Z in (17.6) is in the mean term alone. Here we see exactly that zero elements in β and hence θ ZY mean that the corresponding elements of Z are conditionally independent of Y. We can learn about this dependence structure through Multiple Linear Regression. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 13 / 30

Undirected Graphical Models for Continuous Variables Maximum Likelihood Estimation of MVN Estimation of Parameters with Known Structure Let X T = (x 1,..., x N ) be sampled from N p (µ, Σ). And the MLE of µ and Σ are the sample mean and empirical covariance (Eq. (17.10)) ˆµ = x = 1 N x i N ˆΣ = S = 1 N i=1 N (x i x)(x i x) T i=1 The likelihood function is a function of the parameters µ and Σ given the data X N L(µ, Σ X) = f (x i µ, Σ) i=1 = (2π) Np 2 Σ N 2 exp { 1 2 } N (x i µ) T Σ 1 (x i µ) Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 14 / 30 i=1

Undirected Graphical Models for Continuous Variables Log-likelihood Estimation of Parameters with Known Structure Then the log-likelihood of the data can be written as l(µ, Σ) = 2 log L(µ, Σ X) N = N log Σ + (x i µ) T Σ 1 (x i µ) + C which is equivalent to Eq. (17.11) since l(θ) = log det Θ tr(sθ) = log Σ tr( (x i µ) (x i µ) T Θ) i i=1 = log Σ i = log Σ i tr((x i µ) T Θ (x i µ)) (x i µ) T Σ 1 (x i µ) Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 15 / 30

Undirected Graphical Models for Continuous Variables Missing Edges: Equality Constraints Estimation of Parameters with Known Structure Now, we would like to maximize the log-likelihood under the constraints that some pre-defined subset of the parameters are zero. maximize Θ subject to l C (Θ) = log det Θ tr(sθ) θ jk = 0, (j, k) E Then we add Lagrange multiplier, and derive Eq. (17.12) maximize l C (Θ) = log Θ tr(sθ) γ j,k θ j,k Θ Taking the derivative, we have Eq. (17.13) Θ 1 S Γ = 0 (j,k) E where Γ is a matrix of Lagrange parameters with nonzero values for all missing edges. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 16 / 30

Undirected Graphical Models for Continuous Variables Estimation of Parameters with Known Structure Solve (17.13) by Multiple Linear Regression Step 1: Partition W and derive Eq. (17.14): w 12 s 12 γ 12 = 0. Step 2: Connect w 12 with β. Eq. (17.16) [ ] [ ] W11 w 12 Θ11 θ 12 w T 12 w 22 θ T = 12 θ 22 This implies Eq. (17.17) [ ] I 0 0 T. 1 w 12 = W 11 θ 12 /θ 22 = W 11 β Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 17 / 30

Undirected Graphical Models for Continuous Variables Estimation of Parameters with Known Structure Solve (17.13) by Multiple Linear Regression (Cont.) Step 3: Use simple subset regression to solve Eq. (17.18) W 11 β s 12 γ 12 = 0 If γ j 0, we remove all the elements in jth row and jth column, and derive the reduced system of equation Eq. (17.19) W 11β s 12 = 0 Step 4: Update θ 22 and θ 12 (Eq. (17.20)) 1/θ 22 = s 22 w T 12 ˆβ θ 12 = ˆβθ 22. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 18 / 30

Undirected Graphical Models for Continuous Variables Summary: Algorithm 17.1 Estimation of Parameters with Known Structure Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 19 / 30

Undirected Graphical Models for Continuous Variables A Case Study: Figure 17.4 Estimation of Parameters with Known Structure Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 20 / 30

Undirected Graphical Models for Continuous Variables Estimation of Graph Structure Estimation of the Graph Structure Graph Lasso Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 21 / 30

Undirected Graphical Models for Continuous Variables Graph Lasso Estimation of Graph Structure Graph Lasso fits a lasso regression using each variable as the response and the others as predictors. Consider maximizing the penalized log-likelihood Eq. (17.21) log Θ tr(sθ) λ Θ 1 where Θ 1 is the L 1 norm, i.e., the sum of the absolute values of the elements of Θ. Similarly, taking the differentiation, we reach the analog of Eq. (17.18) as Eq. (17.23) W 11 β s 12 + λ Sign(β) = 0. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 22 / 30

Undirected Graphical Models for Continuous Variables Estimation of Graph Structure Cyclical Coordinate Descent Algorithm Let s re-denote the following equation W 11 β s 12 + λ Sign(β) = 0. as a (p 1) by (p 1) linear system using A, x and b Ax b + λ Sign(x) = 0. For i = 1, 2,..., p 1, 1, 2,..., p 1,..., we update (Eq. (17.26)) ( x i St b i ) A ki x k, λ k i /A ii where St(x, t) is the soft-threshold operator (Eq. (17.27)) St(x, t) = sign(x) ( x t) + Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 23 / 30

Undirected Graphical Models for Continuous Variables Summary: Graph Lasso Algorithm Estimation of Graph Structure Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 24 / 30

Undirected Graphical Models for Continuous Variables A Case Study: Flow-Cytometry Data Estimation of Graph Structure Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 25 / 30

Undirected Graphical Models for Continuous Variables Missing/Hidden Node Values: EM Estimation of Graph Structure Note that the values at some of the nodes in a graphical model can by unobserved; i.e., missing or hidden. The EM algorithm can be used to impute the missing values with E Step (Eq. (17.43)) imputing the missing values from the current estimates of µ and Σ ˆx i,mi = E(x i,mi x i,oi, θ) = ˆµ mi + ˆΣ mi,o i ˆΣ 1 o i,o i (x i,oi ˆµ oi ) and M Step (Eq. (17.44)) re-estimating µ and Σ from the empirical mean and (modified) covariance of the imputed data ˆµ j = 1 N ˆx ij N ˆΣ jj = 1 N i=1 N (ˆx ij ˆµ j )(ˆx ij ˆµ j ) + c i,jj i=1 Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 26 / 30

Undirected Graphical Models for Discrete Variables Ising Models/Boltzmann Machines Pairwise Markov networks with binary variables are called Ising models in statistical mechanics, and Boltzmann machines in machine learning. The joint probabilities of the Ising model is given by Eq. (17.28, 17.29) ( ) P(X, Θ) = 1 exp (j,k) E θ jkx j X k ψ C (x C ) = [ ( Φ(Θ) )] C C exp (j,k) E θ jkx j x k x X The Ising model implies a logistic form for each node conditional on the other (Eq. (17.30)) P(X j = 1 X j = x j ) = 1 ( 1 + exp θ j0 ) (j,k) E θ jkx k where X j denotes all of the nodes except j. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 27 / 30

Undirected Graphical Models for Discrete Variables Estimation of Parameters with Known Graph Structure Given X, find Θ. The log-likelihood is Eq. (17.31) N l(θ) = log P Θ (X i = x i ) = i=1 [ ] N θ jk x ij x ik Φ(Θ) i=1 (j,k) E The gradient of the log-likelihood is Eq. (17.32, 17.33, 17.34) l(θ) θ jk = N x ij x ik N x j x k p(x, Θ) x X i=1 = Ê(X j X k ) E Θ (X j X k ) = 0 Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 28 / 30

Undirected Graphical Models for Discrete Variables Reference T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning. D. Koller, N. Friedman. Probabilistic Graphical Models: Principles and Techniques. Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 29 / 30

Undirected Graphical Models for Discrete Variables The End Biaobin Jiang (Purdue) 2014 Summer Reading October 30, 2014 30 / 30