Introduction to Restricted Boltzmann Machines

Size: px

Start display at page:

Download "Introduction to Restricted Boltzmann Machines"

Amice Norris
6 years ago
Views:

1 Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL October 13, 2014

Introduction Ingredients: 1. Probabilistic graphical models (undirected, also known as Markov random fields) - RBMs is a parametrized generative model representing a probability distribution. 2.

2 Introduction Ingredients: 1. Probabilistic graphical models (undirected, also known as Markov random fields) - RBMs is a parametrized generative model representing a probability distribution. 2. Sampling (MCMC) - goal is to approximate the likelihood and its gradient. 3. Learning RBM - given the training data, adjust the RBM parameters such that the probability distribution represented by the RBM fits the training data as well as possible.

3 Model Let V and H be the sets of all visible and hidden units respectively. Parameters θ = w, b, c. Entries w ij are pairwise weights between h i and v j. Vectors b and c are bias terms for V and H respectively. Energy function: E(v, h) = n i=1 j=1 Joint distribution of (V, H): m w ij h i v j m b j v i j=1 n c i h i i=1 p(v, H) = 1 Z e E(v,h), where Z = v,h e E(v,h).

4 Model The marginal distribution of V : p(v) = h p(v, h) = 1 Z e E(v,h) = 1 Z e F (v). h Given a single training example v, the log-likelihood is: lnp(v θ) = ln 1 Z e E(v,h) = ln h e E(v,h) ln v,h e E(v,h)

5 Gradient of the log-likelihood By taking the gradient: lnp(v θ) θ = h E(v, h) p(h v) + E(v, h) p(v, h). θ θ v,h Difference of two expectations: 1. the expected value of the energy function under the model distribution and 2. under the conditional distribution of the hidden variables given the training examples. Directly calculating the second sum is exponential in the number of variables! Solution: the expectation can be approximated by samples drawn from the corresponding distributions based on MCMC techniques.

6 Quick detour: Markov Chains Random variables X = {X (k) k N 0 } which take value in Ω. Markov property: ) p (k) i,j := P (X (k+1) = j X (k) = i, X (k) = i k 1,..., X (0) = i 0 ( ) = P X (k+1) = j X (k) = i. Homogeneous Markov chain: for every k the transition matrix P = (p i,j ) i,j Ω is the same. Starting distribution (the probability distribution of X (0) ): µ 0 = (µ 0 (i)) i Ω with µ 0 (i) = P(X 0 = i). Distribution at time k: µ (k)t = µ (0)T P k. Stationary distribution: π T = π T P. For an arbitrary starting distribution µ, aperiodic and irreducible Markov chain on a finite state space Ω is guaranteed to converge to its stationary distribution.

7 Gibbs Sampling Uses full conditional probabilities to construct a Markov chain whose stationary distribution is the joint distribution. At a given time step we draw a value for the random variable X i. For states x, y, let the transition probability be: π(x i {x j } j i ) if x = y P xy = π(y i {x j } j i ) if x i y i j i x j = x j 0 else The joint distribution π is the stationary distribution of the Markov chain defined by transition matrix P. P also satisfies that the chain is aperiodic and irreducible. Once converged, Gibbs sampling produces an unbiased sample from the joint distribution.

8 Gibbs Sampling Due to the RBM s bipartite structure, V is conditionally independent given H and and vice versa: P(h v) = n i=1 p(h i v) P(v h) = m j=1 p(v j h) Estimating these quantities is easy: P(H i = 1 v) = σ( m j=1 w ijv j + c j ) P(V j = 1 h) = σ( n i=1 w ijh i + b j ) Gibbs sampling is performed by iteratively estimating H t conditioned on V t, and V t+1 conditioned on H t.

9 Contrastive Divergence Gibbs sampling may require many iterations to converge. Contrastive divergence runs the Markov chain for k steps. Since k is usually small, the resulting sample is biased, but in many cases training works anyway. When it training does not work it is attributed to the Gibbs chain not converging quickly enough, causing the likelihood to diverge. Convergence may be sped up with weight decay, but λ is difficult to tune.

10 Persistent Contrastive Divergence Instead of starting Gibbs sampling from input v j, continue sampling from v (k), i.e. sample from the previous chain. In other words, update parameters θ during sampling. This follows the assumption that when the learning rate is small, the model distribution does not change much after a single update. Maintains multiple Gibbs chains at the same time to get a better estimate of the expected energy. Approximates the likelihood better than CD, but may still diverge. Another algorithm called Parallel Tempering seems to perform better. It maintains parallel Gibbs chains, each running on a smoother version of the original distribution. It stochastically swaps particles between adjacent chains, including the one running on the original distribution, which is the one used for the gradient update.

11 Extensions Some varieties and extensions of RBMs: Handling real values, multinomial values etc. Boltzmann machines, Semi-restricted Boltzmann machines - Allowing for hidden-to-hidden and/or visible-to-visible connections. Deep RBMs, convolutional RBMs

12 The End

13 References Fischer, Asja and Igel, Christian (2012) An introduction to restricted Boltzmann machines Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications,

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,