Replicated Softmax: an Undirected Topic Model. Stephen Turner

Similar documents
Modeling Documents with a Deep Boltzmann Machine

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Lecture 16 Deep Neural Generative Models

Chapter 16. Structured Probabilistic Models for Deep Learning

Density estimation. Computing, and avoiding, partition functions. Iain Murray

Text Mining for Economics and Finance Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation

Restricted Boltzmann Machines for Collaborative Filtering

Recent Advances in Bayesian Inference Techniques

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Annealing Between Distributions by Averaging Moments

Generative Clustering, Topic Modeling, & Bayesian Inference

Greedy Layer-Wise Training of Deep Networks

The Origin of Deep Learning. Lili Mou Jan, 2015

Introduction to Restricted Boltzmann Machines

Ruslan Salakhutdinov Joint work with Geoff Hinton. University of Toronto, Machine Learning Group

STA 4273H: Statistical Machine Learning

Undirected Graphical Models

Training an RBM: Contrastive Divergence. Sargur N. Srihari

Latent Dirichlet Allocation

STA 4273H: Sta-s-cal Machine Learning

Deep unsupervised learning

Evaluation Methods for Topic Models

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Deep Generative Models. (Unsupervised Learning)

Dirichlet Enhanced Latent Semantic Analysis

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Markov Chains and MCMC

STA 4273H: Statistical Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

3 : Representation of Undirected GM

Inductive Principles for Restricted Boltzmann Machine Learning

Unsupervised Learning

STA 4273H: Statistical Machine Learning

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Latent Dirichlet Allocation (LDA)

Nonparametric Bayesian Methods (Gaussian Processes)

Classical Predictive Models

Machine Learning Techniques for Computer Vision

Bayesian Networks BY: MOHAMAD ALSABBAGH

Diversity-Promoting Bayesian Learning of Latent Variable Models

Scaling Neighbourhood Methods

Lecture 13 : Variational Inference: Mean Field Approximation

Restricted Boltzmann Machines

Chapter 20. Deep Generative Models

AN INTRODUCTION TO TOPIC MODELS

Robust Classification using Boltzmann machines by Vasileios Vasilakakis

Probabilistic Machine Learning

Probabilistic Latent Semantic Analysis

Topic Modelling and Latent Dirichlet Allocation

Applying LDA topic model to a corpus of Italian Supreme Court decisions

13: Variational inference II

Need for Sampling in Machine Learning. Sargur Srihari

Gaussian Cardinality Restricted Boltzmann Machines

Latent Dirichlet Allocation Introduction/Overview

Machine Learning Overview

Latent Dirichlet Allocation (LDA)

LDA with Amortized Inference

Pattern Recognition and Machine Learning

Contrastive Divergence

arxiv: v1 [cs.ne] 6 May 2014

Factor Modeling for Advertisement Targeting

Logistic Regression. Machine Learning Fall 2018

Latent variable models for discrete data

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Graphical Models and Kernel Methods

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Gaussian Models

Lecture 6: Graphical Models: Learning

Sampling from Bayes Nets

Midterm sample questions

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Learning Bayesian network : Given structure and completely observed data

Variational Scoring of Graphical Model Structures

Bayesian Learning in Undirected Graphical Models

Lecture 8: Graphical models for Text

Lecture 22 Exploratory Text Analysis & Topic Models

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Bayesian Machine Learning - Lecture 7

Machine Learning Summer School

Topic Modeling: Beyond Bag-of-Words

Lecture 15. Probabilistic Models on Graph

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

MCMC: Markov Chain Monte Carlo

Variational Autoencoder

Non-Parametric Bayes

Latent Variable Models

The Expectation-Maximization Algorithm

CS Lecture 18. Topic Models and LDA

Learning Deep Architectures

Information Retrieval

Machine Learning. Probabilistic KNN.

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Deep Boltzmann Machines

Directed Graphical Models

Transcription:

Replicated Softmax: an Undirected Topic Model Stephen Turner

1. Introduction 2. Replicated Softmax: A Generative Model of Word Counts 3. Evaluating Replicated Softmax as a Generative Model 4. Experimental Results 5. Conclusions and Extensions

Introduction Probabilistic models can be used to analyze and extract semantic topics from large text collections Documents are represented as a mixture of topics Issue: Words can be sharper than each individual topic. Example: Topics Television Politics and Real Estate are all broad topics but all together in one document would have a high probability for the word Trump RBMs have been used with some success but struggle to handle documents of different lengths

Introduction A possible solution to these issues is the use of a Replicated Softmax Can be efficiently trained using Contrastive Divergence Can better handle documents of different lengths Computing the posterior distribution is relatively easy

Replicated Softmax: A Generative Model of Word Counts Begin with a Boltzman machine with visible units v. K is the dictionary size and D is the document size h ϵ {0,1} F is the binary stochastic hidden topic features V is a K x D observed binary matrix The energy is:

Replicated Softmax: A Generative Model of Word Counts Probability the model assigns to the matrix V is: Z is the partition function or normalizing constant

Replicated Softmax: A Generative Model of Word Counts The conditional distributions are given by the softmax function (3) and logistic function (4): The logistic function is: σ(x) = 1/(1+exp(-x))

Replicated Softmax: A Generative Model of Word Counts Now for each document create a separate RBM with as many softmax units as there are words Each unit shares the same set of weights If the document contains D words the energy of the state is: The bias terms are scaled up by the length of the document and allows this machine to handle documents of different lengths

Replicated Softmax: A Generative Model of Word Counts If there are N documents the derivative of the log likelihood with respect to parameters W is: The exact maximum likelihood learning is intractable because the calculation takes time proportional to min{d,f} or the number of visible or hidden units. So we use Contrastive Divergence:

Replicated Softmax: A Generative Model of Word Counts Weights can be shared by the whole family of different sized RBMs created for documents of different lengths Using D softmax units is the same as using a single visible unit sample D times.

Evaluating Replicated Softmax as a Generative Model A Monte Carlo based method Annealed Importance Sampling (AIS) can be used to efficiently estimate the partition function of an RBM.

Evaluating Replicated Softmax as a Generative Model Take two distributions: Typically PA(x) is a simple distribution with a Known ZA, and PB is the target distribution of interest. We may estimate the ratio of normalizing constants by using the importance sampling method: X(i) is proportional to PA If PA and PB are not close enough, this estimation will be very poor

Evaluating Replicated Softmax as a Generative Model AIS can be viewed as simple importance sampling defined on a much higher dimensional state space It uses other variables to make PA and PB closer together AIS starts by defining a sequence of probability disributions: p0..ps where p0= PA and ps = PB The set can be defined: Inverse temperatures 0=β0< β1< βk=1 are chosen by the user

Evaluating Replicated Softmax as a Generative Model Using the bipartite structure of RBMs a better AIS scheme can be devised Reconsider our Softmax model with D words the joint distribution is: The sequence of intermediate distributions defined by temperatures β can be defined:

Evaluating Replicated Softmax as a Generative Model The AIS algorithm starts at distribution P0 and moves through the intermediate distributions toward the target distribution PS T is a Markov chain transition operator that must be defined for each intermediate distribution Finally after M runs the importance weights can be used to fine our models partition function:

Experimental Results 3 data sets were Used NIPS proceedings papers, 20-newsgroups, and Reuters Corpus Volume I The Replicated Softmax was compared to Latent Dialect Allocation (a Bayesian mixture)

Experimental Results This is the average test perplexity scores for each of the 50 held out documents under each training condition.

Experimental Results Precision Recall Curves

Conclusions and Extensions Learning with this model is easy and stable It can model documents of different lengths Scaling this method is not particularly difficult The model can also generalize much better than LDA in both log probability and retrieval accuracy. It is possible that using other variables slightly more complex versions of this model (such as using document specific metadata, authors, references, publishers) we could greatly increase the retrieval accuracy