# Study Notes on the Latent Dirichlet Allocation

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection of documents: ={ (),, () }. {,} Dirichlet mixture prior Dirichlet prior Poisson prior ξ Word Word frequency belief (-D) Graphical representation of two-level model Document ={ 1,, } Poisson prior ξ Topic Word Document ={ 1,, } Topic belief (-D) Discrete distribution Dirichlet priors Graphical representation of LDA In two-level generative model, the probability of seeing an -words document is parameterized by hyper parameters {,} of the Dirichlet mixture prior, where contains mixture parameters and contains -dimensional pseudo-count vectors. The probability can be evaluated via conditioning on the word frequency belief as follows (,) = (,,)(,) = ( )(,) = ( ) (,) Γ( = ), (1) Γ( ) Γ( ()) Γ( ) 1

2 which is an explicit formula and where is the -th mixture parameter, is the pseudo-count of the -th keyword of the -th component, =, () is the count of the -th keyword, and Γ stands for gamma function. The two-level model does not explicitly consider the topic issue. Each Dirichlet component may represent a mix of topics. Latent Dirichlet allocation (LDA) aggressively considers the topic issue by incorporating the distribution of the word over topic, that is (, ), where the matrix =[ ] parameterizes this distribution and stands for the probability of seeing the -th keyword under the -th topic. LDA is a three-level generative model in which there is a topic level between the word level and the belief level and in LDA, becomes topic belief. LDA assumes that whenever a key word is observed, there is an associated topic, hence an -words document is associated with a topic sequence of length. By conditioning on and conditional independence, the parameterized probability (,) of seeing can be written as (,) = (,,)(,) = (, )( ). (2) By chain rule and conditional independence, we have (, ) =(,,,,,) =(,,,,,, )(,,,,,) =(, )(,,,,,) = =(,) (,). (3) Plugging (3) into (2) yields (,) = (,) ( ). (4) By conditioning on and conditional independence, we have ( ) =,( ) = ( ) = ( ). (5) Plugging (5) into (4) and exchanging the order of integration and summation yields 2

3 (,) = (,) ( ) Note that = (,) ( ) = (,) ( ) = (,) ( ). (6) (,) = (,) (,) = Plugging (7) into (6) yields the final simplified form (,) = (,) ( ),, = (,). (7) = (,) ( ), (8) which unfortunately does not have explicit formula (the integral cannot be removed because is latent). Note that (,) = (, ), we have (,) = (,) ( ), (9) which can be understood as a form of two level model. But in LDA, there is no direct arc from to, it has to go through the intermediate latent topic level and apply the distribution of word over topic. The derivation of the probability (8) shows that each document can exhibit multiple topics and each word can also associate with multiple topics. This flexibility gives LDA strong power to model a large collection of documents, as compared with some old models like unigram and mixture of unigram. Compared with the plsi, LDA has natural way of assigning probability to a previously unseen document. To estimate, from a corpus ={ (),, () }, one needs to maximize the log likelihood of data: ln (,) = ln (),, (10) 3

4 which is computationally intractable. However, the variational inference method can provide a computationally tractable lower bound. Given a set of estimated parameters {,}, one way to test it is to calculate the perplexity of a test set of documents = { (),, () }. The perplexity has the form perplexity ( ) = exp (),, (11) where is the total number of keywords in -th document. In order to evaluate (11), one has to compute ln (), for each. Let () = (, ). By (8), we have ln (,) =ln()( ) = ln (()). An estimator is ln (() ), where all () are drawn from the Dirichlet distribution Dir(). 2. Variational Inference The key of the variational inference method is to apply the Jensen s inequality to obtain a lower bound of ln (,). Let (, ) be a joint probability density function of,, applying the idea of importance sampling yields ln (,) =ln,,, =ln,,, (, ) (,) (, )ln,,, =(,), (12) (,) where the last step is by Jensen s inequality. Note that (,) = (, )ln,,, (,) Hence, = (, )ln,,,(,) (,) = (, )ln,,, +ln(,) = (, )ln (,) (,),,, + (, )ln(, ) = (, ),,, + ln(, ). 4

5 ln (,) =(,) +(, ),,,, (13) which says ln (,) is the sum of the variational lower bound and the KL-distance between the variational posterior and the true posterior. In variational method, ones wants to find a (,) that is as much close to ln (,) as possible. And this is equivalent to find a (parameterized) (, ) that has as small KL-distance to,,, as possible. Using the fact (factorization by conditional independence) that,,, =,,,, we have =,,, =,,,,, =(, ), (14) (,) = (, )ln (,) (,) = (ln (,)) + ln + ln ln (, ). (15) Now, let (, ) be parameterized: () Graphical representation of variational distribution (,), =, (),, (), =, (),, (), (),, (), = (),, () =,, (),, () =,,, (),, (),, (),, () 5

6 = (),, (),, () = = (), (16) where is Dirichlet parameter and () is multinomial parameter. Applying this variational distribution to (15) yields the expansion of each term as (ln (, )) = (, )ln(, ) = () ln(, ) = () ln(, ) = () ln(, ) = () (ln (,) ) = () ( ln (,) ) = () ln, = () ln, + +ln, = ln, () + + ln, () = () ln, \ () = () ln, = () ln, = () ln,, (17) ln = (, )ln = () ln = () ln = () ln = () ln 6

7 = () ln + +ln = ln () = ln () () = ln () () = ln = () ln. = () ln = () Ψ( ) Ψ. (18) ln = (, )ln = () ln = ln () = ln \ = (ln Γ( ) ln Γ( ) + ( 1)ln ) =lnγ( ) ln Γ( ) + ( 1) ln =lnγ( ) ln Γ( ) + ( 1)Ψ( ) Ψ, (19) ln (, ) = ln () = ln + ln (), ln = ln Γ( ) ln Γ( ) + ( 1)Ψ( ) Ψ, (20) ln () = (, )ln () = () ln () = () ln () = () ln () 7

8 = () ln () () = () ln () = () ln () = () () ln, (21) where Ψ is digamma function. Therefore, (,) =; (),, (), ;, = () ln, + () Ψ( ) Ψ \ + ln Γ( ) ln Γ( ) + ( 1)Ψ( ) Ψ lnγ( )+ ln Γ( ) ( 1)Ψ( ) Ψ () () ln. (22) 3 EM Algorithm The purpose of the EM algorithm is to maximize (10) by maximizing its variational lower bound. Since ln (), () ; (,),, (,), () ;,, we have ln (,) = ln (), () ; (,),, (,), () ;, =(, ;,), (23) which is overall variational lower bound and where = (,) is a tensor and =[ () ] is a matrix. The idea of the EM algorithm is to start from an initial (,), and iteratively improve the estimate via the following alternating E-step and M-step: 8

9 E-step: Given (,), find (, ) to maximize (, ;,), M-step: Given (, ) found from the E-step, find (,) to maximize (, ;,). The two steps are repeated until the value of (, ;,) converges. The update equations can be derived (using Lagrange multipliers) from the overall variational lower bound (, ;,) = () ; (,),, (,), () ;, = (,) ln, () + (,) Ψ () () Ψ + ln Γ( ) ln Γ( ) + ( 1)Ψ () () Ψ () ln Γ( ) ln Γ( () ) + ( () 1)Ψ () () Ψ (,) (,) ln (24) (,) and the constraints = 1, =1,,; =1, =1,,, =1,..,. In E-step, the update equations for and are (,), ()exp Ψ () () Ψ, (25) () = +. (26) (,) And in order for the variational bound converges, it requires alternating between these two equations. Once and are obtained in E-step, the update equation for can be determined by using Lagrange multipliers, it s (,) ( () =), (27) where is indicator function. Finally, there is no analytical form of the update equation for in the M-step, the Newton s method can be employed for obtaining updated by maximizing = ln Γ( ) ln Γ( ) + ( 1)Ψ () () Ψ. (28) Once the EM algorithm converges, it can return two target distributions:, which is word distribution over topic and /( ), which is topic distribution. 9

10 4 Gibbs Sampling Given the data ={ (),, () }, let ={ (),, () } be the corresponding latent variables such that () is associated with (). The central quantity of the Gibbs sampling methods for learning a LDA model is the full conditional posterior distribution ( () \ (),). Note that (,) () \ (), = (). (29) (\,) And most Gibbs sampling methods put a prior on so that (29) is parameterized as (,,) () \ (),,, =. (30) () (\,,) The nominator of right hand side of (30) is (,,) =(,,)(,) =(, )( ). (31) To evaluate (31), one may refer to the conditional independence represented by the following graph: Dirichlet prior Poisson prior Topic ξ Word () () () Topic belief (-D) Discrete distribution Dirichlet priors Document () ={ () () 1,, } Graphical representation of LDA for corpus By the conditional independence for and, we have ( ) = (, )( ) ( ={ (),, () }) = ( )( ) = ( () () ) (() ) () () 10

11 = ( () () ) (() ) () () = (() () ) () = () () (() ) () = () () ( ) = (), (32) () where () is count of topic in in document, and Β is Beta function, (, ) = (,, )(, ) = (, )( ) = ( () (),) ( ) = () (), ( ) = ( ) = (() () ) () () = (() () )() = () () () () = (), (33) () where is count of topic to word in all documents. Plugging (32) and (33) into (31) yields (,,) = () (). (34) () () For the denominator of the right hand side of (30), note that 11

12 \ (),, = (),\ (),\ (), = () \ (),\ (),,\ (),\ (), = () () (),( () ) \ (),\ (), \ (),\ (), = () () () (), (35) where and () respectively correspond to and (),with count associated with () excluded. Let = () and = (), then (30) is reduced to () =\ (), () =,\ (),, () () () () = () () () () () () () () = () () = () () () () () () () () () () () () (). (36) Given this full conditional posterior, the Gibbs sampling procedure is as straightforward as follows: Step 0: Initialize,, Step 1: Iteratively update by drawing () according to (36) Step 2: Update, by maximizing the joint likelihood function (34). Go to step (1) The procedure above is generic. The step 1 usually requires huge number of iterations. The objective function in step 2 is tractable compared with (10) and there is no need for a variational lower bound. 12

13 One can use Newton s method to maximize the log of this likelihood function. Once the burn-out phase is reached, the two target distributions can be estimated as (() )= () + / +, which represents the word distribution over topic and (() )=+ () () / +, which represents the topic distribution in document. () As a special case, is fixed, then the full conditional posterior (36) can be reduced to () =\ (), () =,\ (),, + () 1. (37) In fact, by reducing (34) and (35), we have (,,) =(,,)(,) =(, )( ) () () = \ (),, = (),\ (),\ (),, (38) = () \ (),\ (),,\ (),\ (), = () () (),( () ) \ (),\ (), \ (),\ (), = and dividing (38) by (39) yields (37). () (), (39) Compared with the variational inference method, which is deterministic, the Gibbs sampling is stochastic (in step 1), hence, it s easier for Gibbs sampling s iterations to jump out of local optima. Moreover, it s easier for the Gibbs sampling algorithm to incorporate the uncertainty on. The variational inference method (together with its EM algorithm) introduced in sections 2, 3 assumes that is fixed, whereas it s easy for the Gibbs sampling algorithm to relax this assumption (as the full conditional posterior formula (36) has shown). And as is fixed, the full conditional posterior formula has simpler form. Another advantage is that the Gibbs sampling algorithm relaxes the assumption that documents are independent. In variational inference method, the likelihood of seeing the corpus is the product of the likelihoods of seeing individual documents, however no such assumption is made in the inference based on the Gibbs sampling method. Given the hyperparameters,, the step 1 of the Gibbs sampling procedure can be applied to a corpus = { (),, () } to calculate the perplexity. The burn-out phases of multiple runs can be averaged to 13

14 obtain the estimates of and ={ (),, () }. Given,, the joint likelihood of seeing is simply expressed as (, ) = (,, )(, ) = (, )( ) () = () (), () () = (), () () = (), () () () () (), () () / (), () () () (),\ () () (), () () = () () (), () \ () = (), () () =, () () \ () = (), () () () / (), () () () / (), () () () = (), (40) () where () is number of -th keyword in document. This joint likelihood of corpus leads to a new perplexity formula as perplexity () = exp () Particularly, () () (). (41) =( (), () ). Although this perplexity formula assumes the independence between documents, it does not deal with the integral over. However, the estimation method is still based on sampling. To obtain the expected perplexity, the multiple runs of drawing () are required for obtaining the expected, corresponding to the target corpus. There is a very elegant toy example for testing any implementation of an LDA learning algorithm. Suppose there are 10 topics and 25 words, and the ground truth topic-to-word distribution (10 25 matrix) is represented by the following figure 14

15 The ground truth of (10 25 matrix) Also suppose =1, which means each topic is expected to have equal chance to appear. After generating 2000 documents with each has length 100, it is expected to recover the ground truth of from the artificially generated data. The following figure shows that it s well recovered by the Gibbs sampling procedure. Gibbs sampling results for learning 15

16 5 Application 5.1 Classification For the classification purpose, suppose an LDA model (), () is trained from a corpus that is labeled, then for an unlabeled document, the posterior probability that it belongs to class is ( ) = ( )() = (), () () ( )() (), () (), (42) where (), () is the -th conditional likelihood that can be evaluated according to (8), and () is the portion of documents that is labeled. Practically, due to the small scale of the joint probability of the keywords sequence of a document, only ln (), () is available for each. Let =ln (), () +ln() and = max, then ( ) = ( ) ( ), (43) which is an usual formula for avoiding numerical overflow. Since the size of the dictionary could be very large, the problem of feature selection arises. An ingenious idea that David Blei proposed is to take all the labeled and unlabelled documents as a single corpus and apply the variational inference method for the purpose of feature selection. Note that the objective function is (, ;,) = () ; (,),, (,), () ;,. Upon the termination of the algorithm, we will have () for each. And () is a document-specific parameter vector of the Dirichlet posterior of the topic belief. Since () has much lower dimension than (), we can use () as the new representation of the document. Compared with the usual feature selection methods like mutual information and statistics, which are basically greedy selection methods that represent each feature with a single word, each feature generated by LDA's variational inference process is represented by a topic that is related to many words. This implies that the LDA's feature vector of dimensions carry more information than selecting words. Without the worry about which key-word is informative for whether assigning a document to a class, the LDA features are theoretically superior. Moreover, this feature selection idea can also be implemented when the Gibbs sampling algorithm is used. That is, the Gibbs sampling algorithm can be applied on all the labeled and unlabelled documents. consequently, for each document, we can estimate the expected topic composition (() ) as () =+ () () / +, where () is counted from () that is generated by the Gibbs sampling. We can use () as the new dimension-reduced representation of the document for the purpose of document classification. Sometimes it's interesting to estimate ( ) where is a topic hidden in unlabelled documents. By conditioning on unlabelled document, we have ( ) = (,)( ) 16

17 = ( )( ) = ( )( ) () () = ( )( ). (44) () Practically there are only limited number of unlabelled documents, say ={ (),, () }. But one estimator of (44) can be built as ( ) = () () () = () ( () ) ( () ) = () () ( ) ( () ) = () (),, (45) ( () ) where () can be computed using (43) and ( () ) can be computed as ( () )= (), () (), () (). One may need special care for computing = in order to ( () ) avoid numerical overflow. Let =ln (), = ln (), and =ln( () ), then = ( ) =exp( ( ) ). Another method for computing (45) is to apply the Gibbs sampling to infer the latent topics sequence () for (). Given (), () can be estimated as the portion of the topic in (). And finally () can be estimated as the portion of the topic in (),..., (). Since () = ( )() =( ), one estimator of () can be (), which yields ( ) = () () () () () = (), (46) which is the weighted average of the ( ) over (),, (). It may need the model averaging in burn-out phase of the Gibbs sampling procedure to obtain averaged estimate of (). 5.2 Similarity The similarity between two documents () and () can be measured by the distance between (() ) and (() ), each of which is the expected topic composition. Given the LDA hyper- 17

18 parameters {,}, they are evaluated as () =+ () () / + and () = + () () / +, where () and () are counted from () and () that are generated from Gibbs sampling. One can calculate the similarity measure of () and (). If the variational inference method is used, then one can calculate the similarity measure of () / () and () () /. 5.3 Supervised LDA Dirichlet prior, response value Topic belief (-D) Poisson prior ξ Topic Discrete distribution Word Dirichlet priors Document ={ 1,, } Graphical representation of slda,,,, =,,,,,,,, =,, (, )( ) =,, (, ) =,, (, ) =,, (,). (47),, ~,, (48) where is the counts vector of.,,,, =,,,,. (49) (,) 18

19 It's not quite clear how this method (by David Blei and Jon McAliffe, 2007) is superior to the method of first finding the LDA feature representations and then do regression. Also, compared with (8), the summation Σ in (47) cannot be simplified. However, the experimental results reported in the 2007 paper shows better results than the regression using LDA features and Lasso. (to be continued ) 19

### Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models

### Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) A review of topic modeling and customer interactions application 3/11/2015 1 Agenda Agenda Items 1 What is topic modeling? Intro Text Mining & Pre-Processing Natural Language

### CS Lecture 18. Topic Models and LDA

CS 6347 Lecture 18 Topic Models and LDA (some slides by David Blei) Generative vs. Discriminative Models Recall that, in Bayesian networks, there could be many different, but equivalent models of the same

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

### Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Information retrieval LSI, plsi and LDA Jian-Yun Nie Basics: Eigenvector, Eigenvalue Ref: http://en.wikipedia.org/wiki/eigenvector For a square matrix A: Ax = λx where x is a vector (eigenvector), and

### VIBES: A Variational Inference Engine for Bayesian Networks

VIBES: A Variational Inference Engine for Bayesian Networks Christopher M. Bishop Microsoft Research Cambridge, CB3 0FB, U.K. research.microsoft.com/ cmbishop David Spiegelhalter MRC Biostatistics Unit

### Lecture 8: Graphical models for Text

Lecture 8: Graphical models for Text 4F13: Machine Learning Joaquin Quiñonero-Candela and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/

### Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

### Two Useful Bounds for Variational Inference

Two Useful Bounds for Variational Inference John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We review and derive two lower bounds on the

### Lecture 21: Spectral Learning for Graphical Models

10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

### Gentle Introduction to Infinite Gaussian Mixture Modeling

Gentle Introduction to Infinite Gaussian Mixture Modeling with an application in neuroscience By Frank Wood Rasmussen, NIPS 1999 Neuroscience Application: Spike Sorting Important in neuroscience and for

### Basic math for biology

Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

### Latent Variable View of EM. Sargur Srihari

Latent Variable View of EM Sargur srihari@cedar.buffalo.edu 1 Examples of latent variables 1. Mixture Model Joint distribution is p(x,z) We don t have values for z 2. Hidden Markov Model A single time

### Latent Dirichlet Allocation

Latent Dirichlet Allocation 1 Directed Graphical Models William W. Cohen Machine Learning 10-601 2 DGMs: The Burglar Alarm example Node ~ random variable Burglar Earthquake Arcs define form of probability

### Modeling User Rating Profiles For Collaborative Filtering

Modeling User Rating Profiles For Collaborative Filtering Benjamin Marlin Department of Computer Science University of Toronto Toronto, ON, M5S 3H5, CANADA marlin@cs.toronto.edu Abstract In this paper

### Variational Principal Components

Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

### Unsupervised Learning

Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

### A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

THIS IS A DRAFT VERSION. FINAL VERSION TO BE PUBLISHED AT NIPS 06 A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee Whye Teh School of Computing National University

### Exponential Families

Exponential Families David M. Blei 1 Introduction We discuss the exponential family, a very flexible family of distributions. Most distributions that you have heard of are in the exponential family. Bernoulli,

### Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

### Expectation Maximization (EM)

Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

### Bayesian Hidden Markov Models and Extensions

Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling

### ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:

### Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

### Statistical Pattern Recognition

Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

### STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

STATS 306B: Unsupervised Learning Spring 2014 Lecture 3 April 7th Lecturer: Lester Mackey Scribe: Jordan Bryan, Dangna Li 3.1 Recap: Gaussian Mixture Modeling In the last lecture, we discussed the Gaussian

### Linear Dynamical Systems

Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

### Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

### CSE 150. Assignment 6 Summer Maximum likelihood estimation. Out: Thu Jul 14 Due: Tue Jul 19

SE 150. Assignment 6 Summer 2016 Out: Thu Jul 14 ue: Tue Jul 19 6.1 Maximum likelihood estimation A (a) omplete data onsider a complete data set of i.i.d. examples {a t, b t, c t, d t } T t=1 drawn from

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

### METHODS FOR IDENTIFYING PUBLIC HEALTH TRENDS. Mark Dredze Department of Computer Science Johns Hopkins University

METHODS FOR IDENTIFYING PUBLIC HEALTH TRENDS Mark Dredze Department of Computer Science Johns Hopkins University disease surveillance self medicating vaccination PUBLIC HEALTH The prevention of disease,

### Introduction to Machine Learning Midterm Exam Solutions

10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

### Today. Calculus. Linear Regression. Lagrange Multipliers

Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject

### COM336: Neural Computing

COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

### Kernel Density Topic Models: Visual Topics Without Visual Words

Kernel Density Topic Models: Visual Topics Without Visual Words Konstantinos Rematas K.U. Leuven ESAT-iMinds krematas@esat.kuleuven.be Mario Fritz Max Planck Institute for Informatics mfrtiz@mpi-inf.mpg.de

### Gibbs Sampling in Linear Models #2

Gibbs Sampling in Linear Models #2 Econ 690 Purdue University Outline 1 Linear Regression Model with a Changepoint Example with Temperature Data 2 The Seemingly Unrelated Regressions Model 3 Gibbs sampling

### Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability

Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability Ramesh Nallapati, William Cohen and John Lafferty Machine Learning Department Carnegie Mellon

### Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign sun22@illinois.edu Hongbo Deng Department of Computer Science University

### STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised

### Introduction to Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models Kyu-Baek Hwang and Byoung-Tak Zhang Biointelligence Lab School of Computer Science and Engineering Seoul National University Seoul 151-742 Korea E-mail: kbhwang@bi.snu.ac.kr

### PROBABILISTIC LATENT SEMANTIC ANALYSIS

PROBABILISTIC LATENT SEMANTIC ANALYSIS Lingjia Deng Revised from slides of Shuguang Wang Outline Review of previous notes PCA/SVD HITS Latent Semantic Analysis Probabilistic Latent Semantic Analysis Applications

### Generative Models for Discrete Data

Generative Models for Discrete Data ddebarr@uw.edu 2016-04-21 Agenda Bayesian Concept Learning Beta-Binomial Model Dirichlet-Multinomial Model Naïve Bayes Classifiers Bayesian Concept Learning Numbers

### Introduction to Graphical Models

Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

### Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

### Notes on Markov Networks

Notes on Markov Networks Lili Mou moull12@sei.pku.edu.cn December, 2014 This note covers basic topics in Markov networks. We mainly talk about the formal definition, Gibbs sampling for inference, and maximum

### bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o

Category: Algorithms and Architectures. Address correspondence to rst author. Preferred Presentation: oral. Variational Belief Networks for Approximate Inference Wim Wiegerinck David Barber Stichting Neurale

### The Expectation Maximization Algorithm

The Expectation Maximization Algorithm Frank Dellaert College of Computing, Georgia Institute of Technology Technical Report number GIT-GVU-- February Abstract This note represents my attempt at explaining

### Contrastive Divergence

Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer Science TU Graz June 9, 2010 Contents 1 Theory 2 Argument 3 Contrastive

### Bayesian time series classification

Bayesian time series classification Peter Sykacek Department of Engineering Science University of Oxford Oxford, OX 3PJ, UK psyk@robots.ox.ac.uk Stephen Roberts Department of Engineering Science University

### arxiv: v1 [cs.lg] 16 Sep 2014

Journal of Machine Learning Research (2000) -48 Submitted 4/00; Published 0/00 Collapsed Variational Bayes Inference of Infinite Relational Model arxiv:409.4757v [cs.lg] 6 Sep 204 Katsuhiko Ishiguro NTT

### Probabilistic Graphical Models

Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

### Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

### Non-parametric Clustering with Dirichlet Processes

Non-parametric Clustering with Dirichlet Processes Timothy Burns SUNY at Buffalo Mar. 31 2009 T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 1 / 24 Introduction

### Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures

### Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

### Hierarchical Dirichlet Processes

Hierarchical Dirichlet Processes Yee Whye Teh, Michael I. Jordan, Matthew J. Beal and David M. Blei Computer Science Div., Dept. of Statistics Dept. of Computer Science University of California at Berkeley

### Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

### GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

### Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

15-0: Learning vs. Deduction Artificial Intelligence Programming Bayesian Learning Chris Brooks Department of Computer Science University of San Francisco So far, we ve seen two types of reasoning: Deductive

### Markov Networks.

Markov Networks www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts Markov network syntax Markov network semantics Potential functions Partition function

### Measuring Topic Quality in Latent Dirichlet Allocation

Measuring Topic Quality in Sergei Koltsov Olessia Koltsova Steklov Institute of Mathematics at St. Petersburg Laboratory for Internet Studies, National Research University Higher School of Economics, St.

### Clustering, K-Means, EM Tutorial

Clustering, K-Means, EM Tutorial Kamyar Ghasemipour Parts taken from Shikhar Sharma, Wenjie Luo, and Boris Ivanovic s tutorial slides, as well as lecture notes Organization: Clustering Motivation K-Means

Advanced Machine Learning Nonparametric Bayesian Models --Learning/Reasoning in Open Possible Worlds Eric Xing Lecture 7, August 4, 2009 Reading: Eric Xing Eric Xing @ CMU, 2006-2009 Clustering Eric Xing

### p L yi z n m x N n xi

y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

### arxiv: v6 [stat.ml] 11 Apr 2017

Improved Gibbs Sampling Parameter Estimators for LDA Dense Distributions from Sparse Samples: Improved Gibbs Sampling Parameter Estimators for LDA arxiv:1505.02065v6 [stat.ml] 11 Apr 2017 Yannis Papanikolaou

### Review and Motivation

Review and Motivation We can model and visualize multimodal datasets by using multiple unimodal (Gaussian-like) clusters. K-means gives us a way of partitioning points into N clusters. Once we know which

### PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY. Arto Klami

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY Arto Klami 1 PROBABILISTIC PROGRAMMING Probabilistic programming is to probabilistic modelling as deep learning is to neural networks (Antti Honkela,

### Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Machine Learning Probability Basics Basic definitions: Random variables, joint, conditional, marginal distribution, Bayes theorem & examples; Probability distributions: Binomial, Beta, Multinomial, Dirichlet,

### CPSC 540: Machine Learning

CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

### Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General

### Does the Wake-sleep Algorithm Produce Good Density Estimators?

Does the Wake-sleep Algorithm Produce Good Density Estimators? Brendan J. Frey, Geoffrey E. Hinton Peter Dayan Department of Computer Science Department of Brain and Cognitive Sciences University of Toronto

### PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY Arto Klami Adapted from my talk in AIHelsinki seminar Dec 15, 2016 1 MOTIVATING INTRODUCTION Most of the artificial intelligence success stories

### σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

### Maximum Likelihood Estimation of Dirichlet Distribution Parameters

Maximum Lielihood Estimation of Dirichlet Distribution Parameters Jonathan Huang Abstract. Dirichlet distributions are commonly used as priors over proportional data. In this paper, I will introduce this

### Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Sequential Monte Carlo and Particle Filtering Frank Wood Gatsby, November 2007 Importance Sampling Recall: Let s say that we want to compute some expectation (integral) E p [f] = p(x)f(x)dx and we remember

### PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

### LEARNING WITH BAYESIAN NETWORKS

LEARNING WITH BAYESIAN NETWORKS Author: David Heckerman Presented by: Dilan Kiley Adapted from slides by: Yan Zhang - 2006, Jeremy Gould 2013, Chip Galusha -2014 Jeremy Gould 2013Chip Galus May 6th, 2016

### Source Separation Tutorial Mini-Series III: Extensions and Interpretations to Non-Negative Matrix Factorization

Source Separation Tutorial Mini-Series III: Extensions and Interpretations to Non-Negative Matrix Factorization Nicholas Bryan Dennis Sun Center for Computer Research in Music and Acoustics, Stanford University

### Bayesian linear regression

Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

### UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

### Petr Volf. Model for Difference of Two Series of Poisson-like Count Data

Petr Volf Institute of Information Theory and Automation Academy of Sciences of the Czech Republic Pod vodárenskou věží 4, 182 8 Praha 8 e-mail: volf@utia.cas.cz Model for Difference of Two Series of Poisson-like

### Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Econ 690 Purdue University In virtually all of the previous lectures, our models have made use of normality assumptions. From a computational point of view, the reason for this assumption is clear: combined

### 19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

10-708: Probabilistic Graphical Models, Spring 2015 19 : Bayesian Nonparametrics: The Indian Buffet Process Lecturer: Avinava Dubey Scribes: Rishav Das, Adam Brodie, and Hemank Lamba 1 Latent Variable

### Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process

Haupthseminar: Machine Learning Chinese Restaurant Process, Indian Buffet Process Agenda Motivation Chinese Restaurant Process- CRP Dirichlet Process Interlude on CRP Infinite and CRP mixture model Estimation

### Introduction to Gaussian Process

Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

Advanced topics from statistics Anders Ringgaard Kristensen Advanced Herd Management Slide 1 Outline Covariance and correlation Random vectors and multivariate distributions The multinomial distribution

### IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 3, MARCH

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 3, MARCH 2010 589 Gamma Markov Random Fields for Audio Source Modeling Onur Dikmen, Member, IEEE, and A. Taylan Cemgil, Member,

### Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

### The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is

### Introduction to Applied Bayesian Modeling. ICPSR Day 4

Introduction to Applied Bayesian Modeling ICPSR Day 4 Simple Priors Remember Bayes Law: Where P(A) is the prior probability of A Simple prior Recall the test for disease example where we specified the

### Tractable Inference in Hybrid Bayesian Networks with Deterministic Conditionals using Re-approximations

Tractable Inference in Hybrid Bayesian Networks with Deterministic Conditionals using Re-approximations Rafael Rumí, Antonio Salmerón Department of Statistics and Applied Mathematics University of Almería,

### Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

### CPSC 540: Machine Learning

CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

### Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Dan Oneaţă 1 Introduction Probabilistic Latent Semantic Analysis (plsa) is a technique from the category of topic models. Its main goal is to model cooccurrence information

### Based on slides by Richard Zemel

CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

### CS 188: Artificial Intelligence

CS 188: Artificial Intelligence Hidden Markov Models Instructor: Anca Dragan --- University of California, Berkeley [These slides were created by Dan Klein, Pieter Abbeel, and Anca. http://ai.berkeley.edu.]

### Priors for Random Count Matrices with Random or Fixed Row Sums

Priors for Random Count Matrices with Random or Fixed Row Sums Mingyuan Zhou Joint work with Oscar Madrid and James Scott IROM Department, McCombs School of Business Department of Statistics and Data Sciences