Modeling Environment

Similar documents
Bayesian Inference and MCMC

Computational Cognitive Science

CSC321 Lecture 18: Learning Probabilistic Models

Note for plsa and LDA-Version 1.1

Generative Clustering, Topic Modeling, & Bayesian Inference

Topic Modelling and Latent Dirichlet Allocation

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Computational Cognitive Science

Bayesian Models in Machine Learning

Bayesian Methods: Naïve Bayes

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayesian RL Seminar. Chris Mansley September 9, 2008

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Bayesian Methods for Machine Learning

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Probabilistic Latent Semantic Analysis

CS 361: Probability & Statistics

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Lecture 13 : Variational Inference: Mean Field Approximation

Non-Parametric Bayes

Bayesian Analysis for Natural Language Processing Lecture 2

Introduction to Bayesian inference

Probabilistic Latent Semantic Analysis

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Language as a Stochastic Process

Topic Models. Charles Elkan November 20, 2008

Latent Dirichlet Allocation Introduction/Overview

Graphical Models for Collaborative Filtering

Probability Review and Naïve Bayes

Document and Topic Models: plsa and LDA

CS145: INTRODUCTION TO DATA MINING

Lecture : Probabilistic Machine Learning

A primer on Bayesian statistics, with an application to mortality rate estimation

PMR Learning as Inference

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Text Mining for Economics and Finance Latent Dirichlet Allocation

CS 361: Probability & Statistics

COMP90051 Statistical Machine Learning

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Generative Models for Discrete Data

16 : Approximate Inference: Markov Chain Monte Carlo

Latent Dirichlet Alloca/on

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Machine Learning

University of Illinois at Urbana-Champaign. Midterm Examination

Latent Dirichlet Allocation (LDA)

Expectation maximization tutorial

Evaluation Methods for Topic Models

Introduc)on to Bayesian Methods

Review of Probabilities and Basic Statistics

Compute f(x θ)f(θ) dθ

AN INTRODUCTION TO TOPIC MODELS

Collapsed Variational Bayesian Inference for Hidden Markov Models

Computational Perception. Bayesian Inference

Introduction to Machine Learning CMU-10701

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Bayesian Learning (II)

CS540 Machine learning L9 Bayesian statistics

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Introduction to Probabilistic Machine Learning

Latent Dirichlet Allocation

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Computational Cognitive Science

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Non-parametric Clustering with Dirichlet Processes

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Lecture 2: Conjugate priors

Density Estimation. Seungjin Choi

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Discrete Binary Distributions

Graphical Models and Kernel Methods

Classification & Information Theory Lecture #8

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

A Discussion of the Bayesian Approach

Language Models. Hongning Wang

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

(1) Introduction to Bayesian statistics

Bayes Nets: Sampling

Lecture 8: Graphical models for Text

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Frequentist Statistics and Hypothesis Testing Spring

CS Lecture 18. Topic Models and LDA

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Bayesian Updating: Discrete Priors: Spring

Approximate Inference

Bayesian Updating: Discrete Priors: Spring

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

STA 4273H: Sta-s-cal Machine Learning

Probability and Estimation. Alan Moses

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

CHAPTER 2 Estimating Probabilities

A Bayesian model for event-based trust

Advanced Probabilistic Modeling in R Day 1

Transcription:

Topic Model

Modeling Environment What does it mean to understand/ your environment? Ability to predict Two approaches to ing environment of words and text Latent Semantic Analysis (LSA) Topic Model

LSA The set up D documents W distinct words F = WxD coocurrence matrix f wd = frequence of word w in document d Transforming the coocurrence matrix where f wd /f w is probability that randomly chosen instance of w in corpus comes from document d H w = value in - where =word appears in only doc; =word spread across all documents (-H w ) = specificity: = word tells you nothing about the document; = word tells you a lot about the document

G = WxD normalized coocurrence matrix log transform common for word freq analysis + ensures no log() weighted by specificity Representation of word i row i of G problem: this is high dimensional problem: doesn t capture similarity structure of documents Dimensionality reduction via SVD G = A D B [WxD] = [WxR] [RxR] [RxD] if R = min(w,d) reconstruction is perfect if R < min(w,d) least squares reconstruction, i.e., capture whatever structure there is in matrix with a reduced number of parameters Reduced representation of word i: row i of (AD) Can used reduced representation to determine semantic relationships

Topic Model (Hoffmann, 999) Probabilistic of the way language is produced Generative Select a document with probability P(D) Select a (latent) topic with probability P(Z D) Generate a word with probability P(W Z) Produce pair <d i, w i > on draw i P(D, W, Z) = P(D) P(Z D) P(W Z) P(D, W) = sum z P(D) P(z D) P(W z) D Z W Learning Minimize cross entropy (difference between distribution) of data and sum x Q(x) log P(x) = sum w,d n(d,w) P(d,w)

Topic Model (Griffiths & Steyvers): Notation P(w i ) is the same as P(W=w i D=d i ) P(z i =j) is same as P(Z=j D=d i ) is the same as θ j d i P(w i z i =j) is same as P(W=w i Z=j) is the same as φwi j Thus, equation means P(W D) = sum j P(W D, Z=j) P(Z=j D) = sum j P(W Z=j) P(Z=j D) D Z W

Topic Model (Griffiths & Steyvers) Doing the Bayesian Thing The two conditional distributions aare over discrete alternatives. P(Z=j D=d i ) or P(W=w i Z=j) or θ j d i φj wi If n alternatives, distribution can be represented by n parameters. But suppose you don t represent the distribution directly but rather you do the Bayesian thing of representing a whole bunch of s a distribution of distributions...

Intuitive Example Coin with unknown bias, ρ = probability of heads Sequence of observations: H T T H T T T H Maximum likelihood approach ρ = / 8 Bayesian approach set of s M = { }, where probability associated with m ρ is ρ m ρ e.g., M = { m., m., m.,, m. }

Bayesian Model Updates Bayes rule posterior likelihood prior pmd ( ) = pdm ( )pm ( ) --------------------------------- pd ( ) Likelihood pheadm ( ρ ) = ρ ptailm ( ρ ) = ρ Priors pm ( ρ ) = -----

Coin Flip Sequence: H T T H T T T priors trial : H trial : T trial : T.5.5.5.5.....5.5.5.5.....5.5.5.5 trial 4: H trial 5: T trial 6: T trial 7: T.5.5.5.5.....5.5.5.5.....5.5.5.5

Infinite Model Spaces This all sounds great if you have just a few s, but what if you have infinite s? e.g., ρ continuous in [, ] If you limit the form of the probability distributions, you can often do so efficiently. e.g., beta distribution to represent priors and posterior in coin flip example Requires only two parameters to update, one representing count of heads, one representing count of tails.

Coin Flip Sequence: H T T H T T T priors trial : H trial : T trial : T beta(,) beta(,) beta(,) beta(,).5.5.5.5.5.5.5.5 trial 4: H trial 5: T trial 6: T trial 7: T beta(,) beta(,) beta(,4) beta(,5).5.5.5.5.5.5.5.5

Effect of Prior Knowledge low head-probability bias high head-probability bias priors trial : H trial : T trial : T priors trial : H trial : T trial : T.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5 trial 4: H trial 5: T trial 6: T trial 7: T trial 4: H trial 5: T trial 6: T trial 7: T.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5

Dirichlet Distribution Dirichlet distribution is a generalization of beta distribution. Beta distribution is a conjugate prior for a binomial RV; Dirichlet is a conjugate prior for a multinomial RV Rather than representing uncertainty in the probabilities over two alternatives, a Dirichlet represents uncertainty in the probabilities over n alternatives. You can think of the uncertainty space over n probabilities constrained such that P(x) = if (sum i x i )!= or if x i <......or the representational space over n probabilities constrained such that P(x)= if (sum i x i ) > or if x i <. Dirichlet for multinomial RV with n alternatives has n parameters (beta has ). Each parameter is a count of the number of occurrences.

Back to the Topic Model (Griffiths & Steyvers) To learn P(Z D) and P(W Z), we need to estimate latent var. Z Computing P(Z D,W) P(D, W, Z) = P(D) P(Z D) P(W Z) P(D, W) = sum z P(D) P(z D) P(W z) P(Z D,W) = P(D, W, Z) / P(D, W) = P(Z D) P(W Z) / [sum z P(z D) P(W z)] Doing the Bayesian thing D Z W Treat the θ and φ as random variables with a Dirichlet distribution. i.e., numerator P(Z D)P(W D) = integral θ,φ P(Z D,θ) P(W D,φ) P(θ) P(φ) and similarly for denominator So you don t need to represent θ and φ explicitly, but instead just the parameters of the Dirichlet These parameters are counts of occurrence.

Equation Ignore α and β for the moment First term: proportion of topic j draws in which w i picked Second term: proportion of words in document d i assigned to topic j This formula integrates out the Dirichlet uncertainty over the multinomial probabilities! What are α and β? Uniform ( symmetric ) prior over multinomial alternatives Larger value of α and β means to trust the prior more...how heavily the distributions are smoothed

Procedure MCMC: procedure for obtaining samples from a complicated distribution, e.g., from P(Z D,W). Randomly assign each <d i, w i > pair a z i value.. For each i resample according to Equation (one iteration). Repeat for a iteration burn in 4. Use current z s as sample : assignment of each <d i, w i > pair to topic z i 5. Run for another iterations 6. Repeat steps 4 and 5 for a total of times 7. Repeat steps -6 for a total of times. -> samples of the z s Use the x 568867 assignments to determine the n s in equation 4

Results Table Predicting word association norms the ->? dog ->? Figure : median rank of k th associate Combining syntax and semantics in a more complex generative HMM to generate tokens from different syntactic categories One category produces words from topic Table Details of work Found optimal dimensionality for LSA; used same dim. for Topic Model