Latent Dirichlet Allocation

Similar documents
Text Mining: Basic Models and Applications

Document and Topic Models: plsa and LDA

LSI, plsi, LDA and inference methods

Latent Dirichlet Allocation

Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability

CS Lecture 18. Topic Models and LDA

Latent Dirichlet Allocation

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Lecture 13 : Variational Inference: Mean Field Approximation

CS145: INTRODUCTION TO DATA MINING

Latent Dirichlet Alloca/on

Knowledge Discovery and Data Mining 1 (VO) ( )

Web-Mining Agents Topic Analysis: plsi and LDA. Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Latent Dirichlet Allocation (LDA)

Note 1: Varitional Methods for Latent Dirichlet Allocation

Introduction to Bayesian inference

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Latent Dirichlet Allocation Introduction/Overview

Probabilistic Latent Semantic Analysis

Topic Modelling and Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation

Dirichlet Enhanced Latent Semantic Analysis

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

an introduction to bayesian inference

Text Mining for Economics and Finance Latent Dirichlet Allocation

Recent Advances in Bayesian Inference Techniques

CSCI-567: Machine Learning (Spring 2019)

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Generative Clustering, Topic Modeling, & Bayesian Inference

STA 4273H: Statistical Machine Learning

Language Information Processing, Advanced. Topic Models

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Density Estimation. Seungjin Choi

Topic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51

Latent variable models for discrete data

Machine Learning Techniques for Computer Vision

Sparse Stochastic Inference for Latent Dirichlet Allocation

Probabilistic Latent Semantic Analysis

Lecture 8: Graphical models for Text

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Scaling Neighbourhood Methods

LDA with Amortized Inference

Latent Dirichlet Allocation (LDA)

Expectation Maximization

Web Search and Text Mining. Lecture 16: Topics and Communities

Linear Dynamical Systems

Directed Probabilistic Graphical Models CMSC 678 UMBC

Part 1: Expectation Propagation

topic modeling hanna m. wallach

CSC 2541: Bayesian Methods for Machine Learning

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Latent Dirichlet Allocation

Collapsed Variational Bayesian Inference for Hidden Markov Models

Latent Dirichlet Bayesian Co-Clustering

Clustering problems, mixture models and Bayesian nonparametrics

Lecture 16 Deep Neural Generative Models

Graphical Models for Collaborative Filtering

Modeling User Rating Profiles For Collaborative Filtering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Probablistic Graphical Models, Spring 2007 Homework 4 Due at the beginning of class on 11/26/07

Lecture 6: Graphical Models: Learning

Variational Principal Components

Variational Scoring of Graphical Model Structures

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

13: Variational inference II

STA 4273H: Statistical Machine Learning

Expectation Propagation for Approximate Bayesian Inference

Machine Learning Summer School

STA 4273H: Statistical Machine Learning

Topic Modeling: Beyond Bag-of-Words

Variational inference

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Probabilistic Latent Semantic Analysis

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Chapter 16. Structured Probabilistic Models for Deep Learning

Generative Models for Discrete Data

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Data Mining Techniques

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Machine Learning

Introduction to Machine Learning Midterm, Tues April 8

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Measuring Topic Quality in Latent Dirichlet Allocation

Unsupervised Learning

Bayesian Inference for Dirichlet-Multinomials

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Probabilistic Dyadic Data Analysis with Local and Global Consistency

Collapsed Variational Inference for Sum-Product Networks

STA 4273H: Statistical Machine Learning

Statistical Debugging with Latent Topic Models

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Diversity-Promoting Bayesian Learning of Latent Variable Models

Design of Text Mining Experiments. Matt Taddy, University of Chicago Booth School of Business faculty.chicagobooth.edu/matt.

Transcription:

Outlines Advanced Artificial Intelligence October 1, 2009

Outlines Part I: Theoretical Background Part II: Application and Results 1 Motive Previous Research Exchangeability 2 Notation and Terminology 3 Comparison Geometric Interpretation 4 Inference Variational Inference Parameter Estimation

Outlines Part I: Theoretical Background Part II: Application and Results 5 Example 6 Empirical Results Document Modeling Document Classification Collaborative Filtering 7 Summary

Part I Theoretical Background

Why did the research start? Motive Previous Research Exchangeability People want to know many things about various documents Large set of discrete data is hard to handle Don t want to lose the essential statistical relationships

Latent Semantic Indexing Motive Previous Research Exchangeability Deerwester et al. came up with LSI Uses tools from linear algebra (Singular Value Decomposition) occurrence matrix relation between terms & concepts relation between concepts & documents Dimensionality reduction Capturing basic linguistic notions

Motive Previous Research Exchangeability Generative Probabilistic Model of Text Corpora Developed to prove the claims regarding LSI No need to use LSI, just use Bayesian methods

Probabilistic LSI Motive Previous Research Exchangeability Models each word in a document as a sample from topics Has solid foundation in statistics No probabilistic model at document level Overfitting due to linear growth of model Not proper model for unseen documents

Probabilistic LSI Motive Previous Research Exchangeability Models each word in a document as a sample from topics Has solid foundation in statistics No probabilistic model at document level Overfitting due to linear growth of model Not proper model for unseen documents

de Finetti Theorem Motive Previous Research Exchangeability bag-of-words Any collection of exchangeable random variables has a representation as a mixture distribution Consider mixture models that capture the exchangeability of both words and documents

Notation Notation and Terminology word - unit of discrete data document - sequence of words corpus - collection of documents

What is LDA? Notation and Terminology Generative probabilistic model of a corpus Our view of document generation

Generative Process Notation and Terminology Choose N Poisson(ξ) Choose θ Dir(α) For each of the N words w n : Choose a topic z n Multinomial(θ) Choose a word w n from p(w n z n, β), a multinomial probability conditioned on the topic z n

Graphical Model Representation Notation and Terminology α, β - corpus level parameter θ d - document level variable z dn, w dn - word level variable

Unigram Model Comparison Geometric Interpretation The words of every document are drawn independently from a single multinomial distribution p(w) = N p(w n ) n=1

Mixture of Unigrams Comparison Geometric Interpretation Each document is generated by first choosing a topic z and then generating N words independently from the conditional multinomial p(w z). p(w) = z N p(z) p(w n z) n=1

Comparison Geometric Interpretation Probabilistic Latent Semantic Indexing A document label d and a word w n are conditionally independent given an unobserved topic z p(d, w n ) = p(d) z p(w n z)p(z d)

Geometric Interpretation Comparison Geometric Interpretation

Inference Inference Variational Inference Parameter Estimation Given α, β and a document, calculate the posterior distribution of hidden variables p(θ, z w, α, β) = p(θ, z, w α, β) p(w α, β) p(w α, β) = Γ( i α i) i Γ(α i) ( k i=1 N θ α i 1 i )( n=1 k i=1 j=1 Intractable due to the coupling between θ and β!! Use approximation Laplace approximation Variational approximation Markov chain Monte Carlo V (θ i β ij ) w j n )dθ

Graphical Model Representation Inference Variational Inference Parameter Estimation Modify the original graphical model so that the couple between θ and β disappers Figure: LDA Figure: Variational Distribution

Variational Distribution Inference Variational Inference Parameter Estimation The approximation q(θ, z γ, φ) = q(θ γ) N q(z n φ n ) n=1 Using the Kullback-Leibler divergence between the variational distribution and the true posterior as the dissimilarity function, (γ, φ ) = argmin (γ,φ) D(q(θ, z γ, φ) p(θ, z w, α, β))

Variational Inference Algorithm Inference Variational Inference Parameter Estimation initialize φ 0 ni := 1/k for all i and n initialize γ i := α i + N/k for all i repeat for n = 1 to N for i = 1 to k φ t+1 ni := β iwn exp(ψ(γi t)) normalize φ t+1 n to sum to 1 γ t+1 := α + N n=1 φt+1 n until convergence O(N 2 k) complexity

Variational EM Algorithm Inference Variational Inference Parameter Estimation Given a corpus of documents D, we wish to find parameters α and β that maximize the (marginal) log likelihood of the data Use variational EM algorithm 1 (E-step) For each document, find the optimizing values of γ d and φ d 2 (M-step) Maximize the resulting lower bound on the log likelihood with respect to α and β

Variational EM Algorithm Inference Variational Inference Parameter Estimation Given a corpus of documents D, we wish to find parameters α and β that maximize the (marginal) log likelihood of the data Use variational EM algorithm 1 (E-step) For each document, find the optimizing values of γ d and φ d 2 (M-step) Maximize the resulting lower bound on the log likelihood with respect to α and β

Example Empirical Results Summary Part II Application and Results

TREC AP corpus Example Empirical Results Summary Arts Budgets Children Education NEW MILLION CHILDREN SCHOOL FILM TAX WOMEN STUDENTS SHOW PROGRAM PEOPLE SCHOOLS MUSIC BUDGET CHILD EDUCATION MOVIE BILLION YEARS TEACHERS PLAY FEDERAL FAMILIES HIGH MUSICAL YEAR WORK PUBLIC BEST SPENDING PARENTS TEACHER ACTOR NEW SAYS BENNETT FIRST STATE FAMILY MANIGAT YORK PLAN WELFARE NAMPHY OPERA MONEY MEN STATE THEATER PROGRAMS PERCENT PRESIDENT ACTRESS GOVERNMENT CARE ELEMENTARY LOVE CONGRESS LIFE HAITI

Example Empirical Results Summary Inference of Unseen Document The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropolitan Opera Co., New York Philharmonic and Juilliard School. Our board felt that we had a real opportunity to make a mark on the future of the performing arts with these grants an act every bit as important as our traditional areas of support in health, medical reasearch, education and the social services, Hearst Foundation President Randolph A.Hearst said Monday in announcing the grants. Lincoln Center s share will be $200,000 for its new building, which will house young artists and provide new public facilities. The Metropolitan Opera Co. and New York Philharmonic will receive $400,000 each. The Juilliard School, where music and the performing arts are taught, will get $250,000. The Hearst Foundation, a leading supporter of the Lincoln Center Consolidated Corporate Fund, will make its usual annual $100,000 donation, too.

Document Modeling Example Empirical Results Summary Document Modeling Document Classification Collaborative Filtering Compare the generalization performance of models Used perplexity to evaluate the models { } M d=1 perplexity(d test ) = exp logp(w d) M d=1 N d

Perplexity Results Example Empirical Results Summary Document Modeling Document Classification Collaborative Filtering

Example Empirical Results Summary Document Classification Document Modeling Document Classification Collaborative Filtering Binary classification experiment Low-dimensional representation by LDA vs. all the word feature Used SVM for training

Classification Result Example Empirical Results Summary Document Modeling Document Classification Collaborative Filtering Almost no reduction in classification performance while the feature space is reduced by 99.6 percent

Example Empirical Results Summary Document Modeling Document Classification Collaborative Filtering Collaborative Filtering { } M d=1 predictive-perplexity(d test ) = exp logp(w d,n d w d,1:nd 1) M

Perplexity Result Example Empirical Results Summary Document Modeling Document Classification Collaborative Filtering

Summary Example Empirical Results Summary Generative probabilistic model for collections of discrete data Based on a simple exchangeability assumption Exact inference is intractable Approximate inference algorithms are used

Correctness Example Empirical Results Summary Hard to evaluate the actual correctness of the model Experiments do not provide how many times they were tested Do not provide whether they have statistically significant difference Classification is done only for binary case

Assessment Example Empirical Results Summary Big impact in the field Refered many times Many applications of LDA

Example Empirical Results Summary Thank you!