Replicated Softmax: an Undirected Topic Model. Stephen Turner

Replicated Softmax: an Undirected Topic Model Stephen Turner

1. Introduction 2. Replicated Softmax: A Generative Model of Word Counts 3. Evaluating Replicated Softmax as a Generative Model 4. Experimental Results 5. Conclusions and Extensions

Introduction Probabilistic models can be used to analyze and extract semantic topics from large text collections Documents are represented as a mixture of topics Issue: Words can be sharper than each individual topic. Example: Topics Television Politics and Real Estate are all broad topics but all together in one document would have a high probability for the word Trump RBMs have been used with some success but struggle to handle documents of different lengths

Introduction A possible solution to these issues is the use of a Replicated Softmax Can be efficiently trained using Contrastive Divergence Can better handle documents of different lengths Computing the posterior distribution is relatively easy

Replicated Softmax: A Generative Model of Word Counts Begin with a Boltzman machine with visible units v. K is the dictionary size and D is the document size h ϵ {0,1} F is the binary stochastic hidden topic features V is a K x D observed binary matrix The energy is:

Replicated Softmax: A Generative Model of Word Counts Probability the model assigns to the matrix V is: Z is the partition function or normalizing constant

Replicated Softmax: A Generative Model of Word Counts The conditional distributions are given by the softmax function (3) and logistic function (4): The logistic function is: σ(x) = 1/(1+exp(-x))

Replicated Softmax: A Generative Model of Word Counts Now for each document create a separate RBM with as many softmax units as there are words Each unit shares the same set of weights If the document contains D words the energy of the state is: The bias terms are scaled up by the length of the document and allows this machine to handle documents of different lengths

Replicated Softmax: A Generative Model of Word Counts If there are N documents the derivative of the log likelihood with respect to parameters W is: The exact maximum likelihood learning is intractable because the calculation takes time proportional to min{d,f} or the number of visible or hidden units. So we use Contrastive Divergence:

Replicated Softmax: A Generative Model of Word Counts Weights can be shared by the whole family of different sized RBMs created for documents of different lengths Using D softmax units is the same as using a single visible unit sample D times.

Evaluating Replicated Softmax as a Generative Model A Monte Carlo based method Annealed Importance Sampling (AIS) can be used to efficiently estimate the partition function of an RBM.

Evaluating Replicated Softmax as a Generative Model Take two distributions: Typically PA(x) is a simple distribution with a Known ZA, and PB is the target distribution of interest. We may estimate the ratio of normalizing constants by using the importance sampling method: X(i) is proportional to PA If PA and PB are not close enough, this estimation will be very poor

Evaluating Replicated Softmax as a Generative Model AIS can be viewed as simple importance sampling defined on a much higher dimensional state space It uses other variables to make PA and PB closer together AIS starts by defining a sequence of probability disributions: p0..ps where p0= PA and ps = PB The set can be defined: Inverse temperatures 0=β0< β1< βk=1 are chosen by the user

Evaluating Replicated Softmax as a Generative Model Using the bipartite structure of RBMs a better AIS scheme can be devised Reconsider our Softmax model with D words the joint distribution is: The sequence of intermediate distributions defined by temperatures β can be defined:

Evaluating Replicated Softmax as a Generative Model The AIS algorithm starts at distribution P0 and moves through the intermediate distributions toward the target distribution PS T is a Markov chain transition operator that must be defined for each intermediate distribution Finally after M runs the importance weights can be used to fine our models partition function:

Experimental Results 3 data sets were Used NIPS proceedings papers, 20-newsgroups, and Reuters Corpus Volume I The Replicated Softmax was compared to Latent Dialect Allocation (a Bayesian mixture)

Experimental Results This is the average test perplexity scores for each of the 50 held out documents under each training condition.

Experimental Results Precision Recall Curves

Conclusions and Extensions Learning with this model is easy and stable It can model documents of different lengths Scaling this method is not particularly difficult The model can also generalize much better than LDA in both log probability and retrieval accuracy. It is possible that using other variables slightly more complex versions of this model (such as using document specific metadata, authors, references, publishers) we could greatly increase the retrieval accuracy