Deep Boltzmann Machines Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel University of Illinois Urbana Champaign agoel10@illinois.edu December 2, 2016 Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 1 / 16
Overview 1 Introduction Representation of the model 2 Learning in Boltzmann Machines Variational Lower Bound - Mean Field Approximation Stochastic Approximation Procedure - Persistent Markov Chains 3 Additional Tricks for DBM Greedy Pretraining of the Model Discriminative Finetuning 4 Simulation results Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 2 / 16
Introduction Boltzmann Machine - Pairwise Markov Random Fields. Consider a set of random variables as latent i.e. hidden (h) and others as visible (v). The probability distribution for binary random variables is given by P θ (v, h) = 1 Z θ e E θ(v,h), θ = {L, J, W} E θ (v, h) = 1 2 vt Lv 1 2 ht Jh v T Wh, Figure: Model for Boltzmann Machines Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 3 / 16
Representation While Boltzmann Machine is a powerful model over the data, it is computationally expensive to learn. So, one considers several approximations to Boltzmann machines. Figure: Boltzmann Machines vs RBM Deep Boltzmann Machine consider hidden nodes in several layers, with a layer being units that have no direct connections. Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 4 / 16
Learning in Boltzmann Machines Model can be trained using Maximum Likelihood. The gradient of the likelihood takes the following form - ( ) ln(l θ (v)) = ln(p θ (v)) = ln p θ (v, h) h (1) uslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 5 / 16
Learning in Boltzmann Machines Model can be trained using Maximum Likelihood. The gradient of the likelihood takes the following form - ( ) ln(l θ (v)) = ln(p θ (v)) = ln p θ (v, h) = ln h h exp( E θ (v, h)) ln exp( E θ (v, h)) ; (1) v,h uslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 5 / 16
Learning in Boltzmann Machines Model can be trained using Maximum Likelihood. The gradient of the likelihood takes the following form - ( ) ln(l θ (v)) = ln(p θ (v)) = ln p θ (v, h) = ln h h exp( E θ (v, h)) ln exp( E θ (v, h)) ; (1) v,h ln(l θ (v)) θ = p(h v) E θ(v, h) + p(v, h) E θ(v, h) θ θ h v,h }{{}}{{} Data Dependent Expectation Model Dependent Expectation Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 5 / 16
Learning in Boltzmann Machines Using gradient ascent by substituting E θ (v, h) in the gradient obtained in previous equation, one can obtain the update for the respective parameters as, W = α(e Pdata [vh T ] E Pmodel [vh T ]), L = α(e Pdata [vv T ] E Pmodel [vv T ]), J = α(e Pdata [hh T ] E Pmodel [hh T ]), b = α(e Pdata [v] E Pmodel [v]), c = α(e Pdata [h] E Pmodel [h]), (2) The parameters updates in the M.L.E. is very costly in the previous steps as would need to sum over exponential number of terms to compute both expectations. One needs Approximations. Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 6 / 16
Approximate Maximum Likelihood Learning in Boltzmann Machines One approximation is to use a variational lower bound on the log-likelihoodln (p θ (v)) = ln p θ (v, h) = ln ( ) ( ) q µ (h v) q µ (h v) p θ(v i, h) h h (3) uslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 7 / 16
Approximate Maximum Likelihood Learning in Boltzmann Machines One approximation is to use a variational lower bound on the log-likelihoodln (p θ (v)) = ln p θ (v, h) = ln ( ) ( ) q µ (h v) q µ (h v) p θ(v i, h) h h q µ (h v i )logp θ (v, h) + H e (q µ ) = L(q µ, θ) h (3) where q µ (h v) is an approximate posterior (variational) distribution and H e (.) is the entropy function with natural logarithm. uslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 7 / 16
Approximate Maximum Likelihood Learning in Boltzmann Machines One approximation is to use a variational lower bound on the log-likelihoodln (p θ (v)) = ln p θ (v, h) = ln ( ) ( ) q µ (h v) q µ (h v) p θ(v i, h) h h q µ (h v i )logp θ (v, h) + H e (q µ ) = L(q µ, θ) h (3) where q µ (h v) is an approximate posterior (variational) distribution and H e (.) is the entropy function with natural logarithm. Try to find the tightest lowerbound on the log-likelihood by optimizing on the distributions q µ and parameters θ. uslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 7 / 16
Variational Learning for Boltzmann Machines For Boltzmann Machines, the lower bound can be rewritten as (ignoring the bias terms) - L(q µ, θ) = h q µ (h v)( E θ (v, h)) log(z θ ) + H e (q µ ) (4) (5) uslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 8 / 16
Variational Learning for Boltzmann Machines For Boltzmann Machines, the lower bound can be rewritten as (ignoring the bias terms) - L(q µ, θ) = h q µ (h v)( E θ (v, h)) log(z θ ) + H e (q µ ) (4) Using Mean Field Approximation, q µ (h v) = M j=1 q(h j v), and one assumes that q(h j = 1) = µ j. (M is the number of hidden units.) (5) uslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 8 / 16
Variational Learning for Boltzmann Machines For Boltzmann Machines, the lower bound can be rewritten as (ignoring the bias terms) - L(q µ, θ) = h q µ (h v)( E θ (v, h)) log(z θ ) + H e (q µ ) (4) Using Mean Field Approximation, q µ (h v) = M j=1 q(h j v), and one assumes that q(h j = 1) = µ j. (M is the number of hidden units.) = h M i=1 ( 1 q µ (h i v i ) 2 vt Lv + 1 ) 2 ht Jh + v T Wh log(z θ ) + H e (q µ ) = 1 2 vt Lv + 1 2 µt Jµ + v T Wµ log(z θ ) + M H e (µ j ) j=1 (5) uslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 8 / 16
Variational EM Learning for Boltzmann Machines Maximize lower bound iteratively by maximizing over the variational parameters µ and θ iteratively - Typical EM learning idea. E-step : sup µ L(q µ, θ) = sup µ 1 2 vt Lv + 1 2 µt Jµ + v T Wµ log(z θ ) + M j=1 H e(µ j ) uslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 9 / 16
Variational EM Learning for Boltzmann Machines Maximize lower bound iteratively by maximizing over the variational parameters µ and θ iteratively - Typical EM learning idea. E-step : sup µ L(q µ, θ) = 1 sup µ 2 vt Lv + 1 2 µt Jµ + v T Wµ log(z θ ) + M j=1 H e(µ j ) Using alternate maximization over each variate, one gets the update µ j σ i W ij v i + J mj µ m, m j where σ(.) denotes the sigmoid function. After running these updates, the parameter µ converges to ˆµ. uslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 9 / 16
Stochastic Approximations or Persistent Markov Chains M-step : sup θ L(q µ, θ) = sup θ 1 2 vt Lv+ 1 2 µt Jµ+v T Wµ log(z θ )+ M j=1 H e(µ j ) Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 10 / 16
Stochastic Approximations or Persistent Markov Chains M-step : sup θ L(q µ, θ) = sup θ 1 2 vt Lv+ 1 2 µt Jµ+v T Wµ log(z θ )+ M j=1 H e(µ j ) MCMC Sampling and Persistent Markov Chains to approximate gradient of log-partition function log(z θ ) Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 10 / 16
Stochastic Approximations or Persistent Markov Chains M-step : sup θ L(q µ, θ) = sup θ 1 2 vt Lv+ 1 2 µt Jµ+v T Wµ log(z θ )+ M j=1 H e(µ j ) MCMC Sampling and Persistent Markov Chains to approximate gradient of log-partition function log(z θ ) The parameter updates for one training example can be written as, ) N W = α t ([vˆµ T ] ṽ h T i, L = α t ([vv T ] J = α t ([ˆµˆµ T ] i=1 ) N ṽ h T i, i=1 ) N ṽ h T i, i=1 (6) Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 10 / 16
Overall Algorithm for Training Boltzmann Machines Data: Training set S n of N binary data vectors v and M, the number of persistent Markov chains Initialize vector θ 0 and M samples : {ṽ 0,1, h 0,1 },..., {ṽ 0,M, h 0,M }; for t =0 to T (number of iterations) do for each n S n do Randomly( initalize µ n and run updates to obtain ˆµ n µ j σ i W ijv i + ) m j J mjµ m end for m = 1 to M (number of persistent markov chains) do Sample (ṽ t+1,m, h t+1,m ) given (ṽ t+1,m, h t+1,m ) by running Gibbs sampler end Update θ using equation (6) (adjusting for batch data) and decrease the learning rate α t. end Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 11 / 16
Learning for Deep Boltzmann Machines For Deep Boltzmann Machines, L = 0 and J would have many zero-blocks as hidden unit interactions layered. So some computations simplified. Gibbs sampling procedure is simplified as all units in one layer can be sampled parallely. But, learning observed slow, and Greedy Pretraining can result in faster convergence of parameters. Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 12 / 16
Pretraining in Deep Boltzmann Machines Training each RBM separately, with some weight scaling. Figure: Greedy Layerwise Pretraining for DBM Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 13 / 16
Discriminative Finetuning in Deep Boltzmann Machines Further, an additional step of finetuning is also considered to improve the performance. For example, for a 2 hidden layer DBM, an approximate posterior is used as an augmented input to a neural network with weights of network initialized using parameters of DBM. Figure: Finetuning the parameters of DBM Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 14 / 16
Some Experimental Results and Observations Training a DBM for modeling handwritten digits in MNIST dataset. (a) DBM Model used for Training (b) Examples of handwritten digits Figure: An example of DBM used for MNIST data generation with training done for 60000 examples Some interesting observations :- Without Greedy Pretraining, the models were not producing good results. Using Discriminative fine tuning, DBM gave 99.5% accuracy, best on MNIST dataset for recognition at that time. Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 15 / 16
Thank You Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel Deep (UIUC) Boltzmann Machines December 2, 2016 16 / 16