Rapid Introduction to Machine Learning/ Deep Learning

Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/24

Lecture 5b Markov random field (MRF) November 13, 2015 2/24

Table of contents 1 1. Objectives of Lecture 5b 2 2. Markov random field (MRF) 2.1. Basics of MRF 2.2. Boltzmann machine 3/24

1. Objectives of Lecture 5b Objective 1 Learn minimal MRF formalism that is necessary for the understanding of the deep neural network pretraining using restricted Boltzmann machine Objective 2 Learn how the probability structure is encoded in the MRF, especially the energy based formalism of Boltzmann machine Objective 3 Learn about some basic formalism of restricted Boltzmann machine 4/24

2.1. Basics of MRF 2. Markov random field (MRF) 2.1.Basics of MRF Terminology G: undirected graph (not necessarily a tree) in which each node represents a random variable Let X i be the random variable represented by node i, and let x i be the value of X i [we frequently confuse node i with x i ] x = (x 1,, x n ) a list of the values of all random variables The joint probability is denoted by P(x) = P(x 1,, x n ) For each node x i, let N (x i ) be the neighbor of x i, i.e. N (x i ) is the set of nodes connected to x i 5/24

2.1. Basics of MRF Definition of MRF We say P(x) satisfies the Markov property, if P(X i = x i X j = x j, for j i) = P(X i = x i X j = x j, for x j N(x i )) G with P(x) satisfying the Markov property is called a Markov random field (MRF) Proposition Let G be a MRF. Let A, B, C be mutually disjoint sets of nodes of G. Assume A and B are separated by C, meaning that every path from a node in A to a node in B passes through some node in C, then P(A, B C) = P(A C)P(B C) 6/24

2.1. Basics of MRF i.e. A and B are conditionally independent given C. The converse is obviously true. Example 7/24

2.1. Basics of MRF Gibbs distributions Definition Clique is a set of nodes every node of which is connected to every other node in the set A probability distribution P(x) is called a Gibbs distribution if it is of the form P(x) = ψ c (x c ), c C where C is the set of maximal cliques and ψ c is a non-negative function of x c, where x c is the list of variables in the clique c 8/24

2.1. Basics of MRF Example maximal cliques: c 1 = {x 1, x 2, x 3 }, c 2 = {x 2, x 3, x 4 }, c 3 = {x 3, x 5 } P(x) = ψ 1 (x 1, x 2, x 3 )ψ 2 (x 2, x 3, x 4 )ψ 3 (x 3, x 5 ) Theorem (Clifford-Hammersley) Assume P(x) > 0 for all x. If P(x) is a Gibbs distribution, then G is an MRF 9/24

2.2. Boltzmann machine 2.2. Boltzmann machine G graph x i {1, 1} or x i {0, 1} 10/24

2.2. Boltzmann machine E: Energy E(x) = i j ω ij x i x j b i x i i i j means the sum over adjacent nodes i and j for i < j P: Probability P(x) = 1 Z exp( λe(x)), where Z is the partition function given by [We usually set λ = 1] Z = exp( λe(x)) x 11/24

Notation x = (x 1,, x d ) x = (h 1,, h n ) (x, h) = (x 1,, x d, h 1,, h n ) 12/24

Energy E(x, h) = w ij x i h j j b j x j c i h i i Probability P(x, h) = 1 Z exp( E(x, h)) Z = exp( E(x, h)) [= exp( E(x, h))] x,h Note The lower the energy, the higher the probability If w ij > 0, it is more likely that x j and h i have the same sign If w ij < 0, it is more likely that x j and h i have the opposite sign If b j > 0, it is more likely that x j > 0, and so on 13/24

Probabilities of RBM Write E(x, h) = h T Wx b T x c T h, where W = (w ij ), h = [h 1,, h n ] T, x = [x 1,, x d ] d P(x, h) = 1 Z exp( ht Wx b T x c T h) P(x) = P(x, h) h P(h) = P(x, h) x 14/24

P(h x) Given x, h i and h j are separated, i.e. conditionally independent. Thus P(h x) = P(h 1,, h n x) = P(h i x) i 15/24

Remark This fact can be proved directly as follows: Let W i = W i be the ith row of W. Then Thus h T Wx = h i W i x. i P(h x) = = exp(h T Wx + b T x + c T h) h exp(h T Wx + b T x + c T h) i exp(h i W i x + c i h i ) h1,,h n exp(h i W i x + c i h i ) 16/24

= i exp(h i W i x + c i h i ) i hi exp(h i W i x + c i h i ) 1 = Z exp(h iw i x + b T x + c i h i ) 1 i Z h i exp(h i W i x + b T x + c i h i ) = i P(x, h i ) hi P(x, h i ) = i P(h i x) 17/24

Special Case: binary neurons Assume x j = {0, 1}, h i {0, 1}. Then P(h i x) = exp(h i W i x + c i h i ) hi exp(h i W i x + c i h i ) 18/24

Thus P(h i = 1 x) = exp(w i x + c i ) 1 + exp(w i x + c i ) = sigm(w ix + c i ) By symmetry P(x j = 1 h) = sigm(w T j h + b j ), where W j is the jth column of W. Now P(x) = P(x, h) h = ebt x Z = ebt x Z i i h i exp(h i W i x + c i h i ) [1 + exp(w i x + c i )] 19/24

x = ebt Z exp log(1 + exp(w i x + c i )) i 1 = Z exp [bt x + log(1 + exp(w i x + c i ))] i 1 = Z exp [bt x + softplus(w i x + c i ))], i where softplus(t) = log(1 + e t ) 20/24

Example Ising model 21/24

x i {1, 1} Configuration x There are 2 n configurations Hamiltonian (Energy) x = {x 1,, x i,, x n } H = i j h ij x i x j b i x i i i j means the sum over adjacent nodes i and j for i < j Probability of configuration x P(x) exp( λh) 22/24

λ = 1 k B T k B Boltzmann constant (usually set to be 1) T temperature If most x i are aligned in the same direction the energy (Hamiltonian) tends to be smaller, thus the probability is bigger Ising model is an idealized magnet model Partition function Thus Z = P(x) x P(x) = 1 Z exp( λh) 23/24

Due to large number of configuration, it is impractical to compute z 24/24