CS 6140: Machine Learning Spring 2017 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu
Logis@cs Assignment 3 is due on 3/30. 4/13: course project presenta@on. 4/20: final exam.
What we learned last @me Sequen@al labeling models Hidden Markov Models Maximum-entropy Markov model Condi@onal Random Fields
Sample Markov Model for POS 0.1 Det 0.95 Noun 0.9 0.5 0.5 start 0.1 0.4 0.1 PropNoun 0.05 0.8 0.25 0.25 0.1 Verb stop
The Markov Assump@on
Hidden Markov Models (HMMs) Words Part-of-Speech tags
Formally
Viterbi Backtrace s 1 s 0 s 2 s N s F t 1 t 2 t 3 t T-1 t T Most likely Sequence: s 0 s N s 1 s 2 s 2 s F
Log-Linear Models
Using Log-Linear Models
Condi@onal Random Fields (CRFs)
Today s Outline Bayesian Networks Mixture Models Expecta@on Maximiza@on Latent Dirichlet Alloca@on [Some slides are borrowed from Christopher Bishop and David Sontag]
Today s Outline Bayesian Networks Mixture Models Expecta@on Maximiza@on Latent Dirichlet Alloca@on
K-means Algorithm Goal: represent a data set in terms of K clusters each of which is summarized by a prototype (mean) Ini@alize prototypes, then iterate between two phases: Step 1: assign each data point to nearest prototype Step 2: update prototypes to be the cluster means Simplest version is based on Euclidean distance
BCS Summer School, Exeter, 2003 Christopher M. Bishop
BCS Summer School, Exeter, 2003 Christopher M. Bishop
BCS Summer School, Exeter, 2003 Christopher M. Bishop
BCS Summer School, Exeter, 2003 Christopher M. Bishop
BCS Summer School, Exeter, 2003 Christopher M. Bishop
BCS Summer School, Exeter, 2003 Christopher M. Bishop
BCS Summer School, Exeter, 2003 Christopher M. Bishop
BCS Summer School, Exeter, 2003 Christopher M. Bishop
BCS Summer School, Exeter, 2003 Christopher M. Bishop
The Gaussian Distribu@on Mul@variate Gaussian mean covariance
Gaussian Mixtures Linear super-posi@on of Gaussians Normaliza@on and posi@vity require Can interpret the mixing coefficients as prior probabili@es
Example: Mixture of 3 Gaussians
Contours of Probability Distribu@on
Sampling from the Gaussian To generate a data point: first pick one of the components with probability then draw a sample from that component Repeat these two steps for each new data point
Synthe@c Data Set
Synthe@c Data Set Without Labels
Fieng the Gaussian Mixture We wish to invert this process given the data set, find the corresponding parameters: mixing coefficients means Covariances
Fieng the Gaussian Mixture We wish to invert this process given the data set, find the corresponding parameters: mixing coefficients means covariances If we knew which component generated each data point, the maximum likelihood solu@on would involve fieng each component to the corresponding cluster Problem: the data set is unlabelled We shall refer to the labels as latent (= hidden) variables
Synthe@c Data Set Without Labels
Posterior Probabili@es We can think of the mixing coefficients as prior probabili@es for the components For a given value of we can evaluate the corresponding posterior probabili@es, called responsibili,es These are given from Bayes theorem by
Posterior Probabili@es (colour coded)
Today s Outline Bayesian Networks Mixture Models Expecta@on Maximiza@on Latent Dirichlet Alloca@on
BCS Summer School, Exeter, 2003 Christopher M. Bishop
BCS Summer School, Exeter, 2003 Christopher M. Bishop
BCS Summer School, Exeter, 2003 Christopher M. Bishop
BCS Summer School, Exeter, 2003 Christopher M. Bishop
BCS Summer School, Exeter, 2003 Christopher M. Bishop
BCS Summer School, Exeter, 2003 Christopher M. Bishop
EM in General Consider arbitrary distribu@on over the latent variables (p is the true distribu@on) The following decomposi@on always holds where
Decomposi@on
Op@mizing the Bound E-step: maximize with respect to equivalent to minimizing KL divergence sets equal to the posterior distribu@on M-step: maximize bound with respect to equivalent to maximizing expected complete-data log likelihood Each EM cycle must increase incomplete-data likelihood unless already at a (local) maximum
E-step
M-step
Today s Outline Bayesian Networks Mixture Models Expecta@on Maximiza@on Latent Dirichlet Alloca@on [Slides are based on David Blei s ICML 2012 tutorial]
Genera@ve model for a document in LDA
Genera@ve model for a document in LDA
Comparison of mixture and admixture models
Usage of LDA
EM for mixture models
EM for mixture models
What We Learned Today Bayesian Networks Mixture Models Expecta@on Maximiza@on Latent Dirichlet Alloca@on
Homework Reading Murphy 11.1-11.2, 11.4.1-11.4.4, 27.1-27.3 More about EM hhp://cs229.stanford.edu/notes/cs229-notes7b.pdf hhp://cs229.stanford.edu/notes/cs229-notes8.pdf More about LDA hhp://menome.com/wp/wp-content/uploads/ 2014/12/Blei2011.pdf hhp://obphio.us/pdfs/lda_tutorial.pdf