Modeling Documents with a Deep Boltzmann Machine

Size: px

Start display at page:

Download "Modeling Documents with a Deep Boltzmann Machine"

Pierce Sutton
5 years ago
Views:

1 Modeling Documents with a Deep Boltzmann Machine Nitish Srivastava, Ruslan Salakhutdinov & Geoffrey Hinton UAI 2013 Presented by Zhe Gan, Duke University November 14, / 15

2 Outline Replicated Softmax Model (Hinton and Salakhutdinov, 2009). Undirected topic model Over-Replicated Softmax Model (Srivastava et al., 2013) Only has one more parameter than the Replicated Softmax Related work DocNADE (Larochelle and Lauly, 2012) Experiments 20 Newsgroup & Reuters Corpus Volume I 2 / 15

3 Replicated Softmax Model Let K be the vocabulary size, N be the number of words, F be the number of hidden topic features A document can be represented as a binary matrix V R N K, with v ik = 1 if visible unit i takes the k th value. Hidden units are h R F. Energy function defined as E(V, h; θ) = N F i=1 =1 k=1 K W ik h v ik N i=1 k=1 K v ik b ik N Key assumption: Softmax units share the same set of weights. E(V, h; θ) = F =1 k=1 K W k h ˆv k K ˆv k b k N k=1 where ˆv k = N i=1 v ik, denotes the count for the k th word. F h a =1 F h a (1) =1 3 / 15

4 Replicated Softmax Model Conditional distribution: ( K ) p(h = 1 V) =σ W k ˆv k + Na (2) k=1 ( F ) exp =1 W kh + b k p(v ik = 1 h) = ( K k =1 exp F (3) =1 W k h + b k ) Figure : The Replicated Softmax model. 4 / 15

5 Over-Replicated Softmax Model Now, add the second hidden layer, which consists of M softmax units, i.e. H (2) R M K. Tie the 1-st and 2-nd layer weights to be the same. No additional parameters. In essence, the 2-nd layer can be considered as missing / pseudo words. Figure : The Over-Replicated Softmax model. 5 / 15

6 Over-Replicated Softmax Model Now, the energy is defined as E(V, h (1), H (2) ) = where ĥ(2) k = M pseudo document. i=1 h(2) ik F K =1 k=1 K ( k=1 ( W k h (1) ˆv k + ĥ(2) k ˆv k + ĥ(2) k ) ) b k (M + N) F =1 h (1) a is the pseudo count and M is the length of the Given L documents, the gradient of the log-likelihood is 1 L L l=1 log P(V) [ = E Pdata (ˆv k + W ĥ(2) k k )h(1) ] E Pmodel [ ] (ˆv k + ĥ(2) k )h(1) 6 / 15

7 Learning and Inference Since P data (h, V) = P(h V; θ)p data (V), we use mean-field inference to estimate the data-dependent expectations. Q MF (h V; µ) = F =1 q(h (1) V) M i=1 q(h (2) i V) (4) To be specific, q(h (1) = 1) = µ (1) and q(h (2) ik = 1) = µ (2) k,then ( µ (1) K ( ) ) =σ W k ˆv k + Mµ (2) (5) µ (2) k = k=1 ( F ) exp =1 W kµ (1) ( K k =1 exp F =1 W k µ(1) k ) (6) 7 / 15

8 Learning and Inference Use an MCMC based stochastic approximation procedure to approximate the model expectation, i.e. Contrastive Divergence. A typical update: V t {h (1) t, H (2) t } V t+1 {h (1) t+1, H(2) t+1 }. Use a point estimate at t + 1 to approximate the model distribution. Pre-training: Train an RBM with scaled weights. ( = 1 V) = σ (1 + M K ) N ) W k ˆv k P(h (1) Learning: already given in Equation (5) k=1 Choosing M: Needs to be chosen using a validation set. (7) 8 / 15

9 Related Work NADE (Neural Autoregressive Distribution Estimation): a generative model over binary vector v {0, 1} D. p(v i = 1 v <i ) = σ (b i + V i,: h i ) (8) h i (v i ) = σ(c + W :,<i v <i ) (9) where W R H D and V R D H. Modeling each conditional using a neural network. 9 / 15

10 Related Work Document NADE p(v i = w v <i ) = exp(b w + V w,: h i ) w exp(b w + V w,:h i ) (10) h i (v i ) = σ(c + k<i W :,vk ) (11) The large softmax over words is replaced by a probabilistic tree in which each path from the root to a leaf corresponds to a word. 10 / 15

11 Experiments Mini-batch size: 128; M = 100; Topic Number: 128 The model implemented on GPUs; pretraining took 3-4 hours, the proper training took hours. 11 / 15

12 Experiments - Document Retrieval 12 / 15

13 Experiments The Over-Replicated Softmax model works well on short documents. 13 / 15

14 Experiments 14 / 15

15 References I G. E. Hinton and R. Salakhutdinov. Replicated softmax: an undirected topic model. NIPS, H. Larochelle and S. Lauly. A neural autoregressive topic model. NIPS, N. Srivastava, R. R. Salakhutdinov, and G. E. Hinton. Modeling documents with deep boltzmann machines. UAI, / 15

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine Nitish Srivastava nitish@cs.toronto.edu Ruslan Salahutdinov rsalahu@cs.toronto.edu Geoffrey Hinton hinton@cs.toronto.edu