CSC2515 Winter 2015 Introduc3on to Machine Learning. Lecture 5: Clustering, mixture models, and EM

CSC2515 Winter 2015 Introdu3on to Mahine Learning Leture 5: Clustering, mixture models, and EM All leture slides will be available as.pdf on the ourse website: http://www.s.toronto.edu/~urtasun/ourses/csc2515/ CSC2515_Winter15.html

Overview Clustering with K- means, and a proof of onvergene Clustering with K- medians Clustering with a mixture of Gaussians The EM algorithm, and a proof of onvergene Variational inferene

Unsupervised Learning Supervised learning algorithms have a lear goal: produe desired outputs for given inputs Goal of unsupervised learning algorithms (no expliit feedbak whether outputs of system are orret) less lear: Redue dimensionality Find lusters Model data density Find hidden auses Key utility Compress data Detet outliers Failitate other learning

Major types Primary problems, approahes in unsupervised learning fall into three types: 1. Dimensionality redution: represent eah input ase using a small number of variables (e.g., prinipal omponents analysis, fator analysis, independent omponents analysis) 2. Clustering: represent eah input ase using a prototype example (e.g., k- means, mixture models) 3. Density estimation: estimating the probability distribution over the data spae

Clustering Grouping N examples into K lusters one of anonial problems in unsupervised learning Motivations: predition; lossy ompression; outlier detetion We assume that the data was generated from a number of different lasses. The aim is to luster data from the same lass together. How many lasses? Why not put eah datapoint into a separate lass? What is the objetive funtion that is optimized by sensible lusterings?

The K- means algorithm Assume the data lives in a Eulidean spae. Assume we want K lasses. Assignments Initialization: randomly loated luster enters The algorithm alternates between two steps: Assignment step: Assign eah datapoint to the losest luster. Refitted means Re]itting step: Move eah luster enter to the enter of gravity of the data assigned to it.

K- means algorithm Initialization: Set K means {m k } to random values Assignment: Eah datapoint n assigned to nearest mean responsibilities: ˆk (n) = argmin k {d(m k, x (n) )} r k (n) =1 ˆk (n) = k Update: Model parameters, means, are adjusted to math sample means of datapoints they are responsible for: m k = n (n) r n k r k (n) x (n) Repeat assignment and update steps until assignments do not hange

Ques3ons about K- means Why does update set m k to mean of assigned points? Where does distane d ome from? What if we used a different distane measure? How an we hoose best distane? How to hoose K? How an we hoose between alternative lusterings? Will it onverge? Hard ases unequal spreads, non- irular spreads, inbetween points

Why K- means onverges Whenever an assignment is hanged, the sum squared distanes of datapoints from their assigned luster enters is redued. Whenever a luster enter is moved the sum squared distanes of the datapoints from their urrently assigned luster enters is redued. Test for onvergene: If the assignments do not hange in the assignment step, we have onverged (to at least a loal minimum).

K- means: Details Objetive: minimize sum squared distane of datapoints to their assigned luster enters E({m},{r}) = (n) r k n k m k x (n) 2 s.t. r (n) k =1, n; r (n) k {0,1}, k, n k Optimization method is a form of oordinate desent ( blok oordinate desent ) Fix enters, optimize assignments (hoose luster whose mean is losest) Fix assignments, optimize means (average of assigned datapoints)

Loal minima There is nothing to prevent k- means getting stuk at loal minima. A bad loal optimum We ould try many random starting points We ould try non- loal split- and- merge moves: Simultaneously merge two nearby lusters and split a big luster into two.

Applia3on of K- Means Clustering

K- medioids K- Means: Choose number of lusters K; algorithm s primary aim is to strategially position these K means Alternative: allow eah datapoint to potentially at as a luster representative; algorithm assigns eah point to one of these representatives (exemplars)

K- medioids algorithm Initialization: Set random set K of datapoints as medioids Assignment: Eah datapoint assigned to nearest medioid ˆk () = argmin k {d(x k, x () )} r k () =1 ˆk () = k Update: For eah medioid k, for eah datapoint, swap k and and ompute total ost J Selet on]iguration with lowest ost k J({x},{r}) = r k d(x (), x k ) Repeat assignment and update steps until assignments do not hange

K- medioids vs. K- means Both partition data into K partitions Both an utilize various distane funtions, but K- medioids an pre- ompute pairwise distanes K- medioids hooses datapoints as enters (medioids/exemplars), while K- means allows the means to be arbitrary loations à disrete vs. ontinuous optimization K- medioids more robust to noise and outliers

SoP k- means Instead of making hard assignments of datapoints to lusters, we an make soft assignments. One luster may have a responsibility of.7 for a datapoint and another may have a responsibility of.3. Allows a luster to use more information about the data in the re]itting step. What happens to our onvergene guarantee? How do we deide on the soft assignments?

SoP K- means algorithm Initialization: Set K means {m k } to random values Assignment: Eah datapoint n given soft degree of assignment to eah luster mean k, based on responsibilities r k (n) = k' exp[ βd(m k, x (n) )] exp[ βd(m k', x (n) )] Update: Model parameters, means, are adjusted to math sample means of datapoints they are responsible for: m k = n (n) r n k r k (n) x (n) Repeat assignment and update steps until assignments do not hange

Ques3ons about sop K- means How to set β? What about problems with elongated lusters? Clusters with unequal weight and width

Latent variable models Adopt a different view of lustering, in terms of a model with latent variables: variables that are always unobserved We may want to intentionally introdue latent variables to model omplex dependenies between variables without looking at dependenies between them diretly - - this an atually simplify the model Form of divide- and- onquer: use simple parts to build omplex models (e.g., multimodal densities, or pieewise- linear regression)

Mixture models Most ommon example is a mixture model: most basi form of latent variable model, with single disrete latent variable We have de]ined the hidden ause to be disrete: a multinomial random variable And the observation is Gaussian The model allows for other distributions Example: Bernoulli observations Another example: Continuous hidden (latent) variable see next leture Hidden ause Visible effet But these are only the simplest models: an add many hidden & visible nodes, layers

Learning is harder with latent variables In fully observed settings, probability model is a produt, so the log likelihood is a sum, terms deouple: With latent variables, probability ontains a sum so the log- likelihood has all parameters oupled together + = = ), ( log ) ( log ), ( log ) ; ( y x p p p D θ θ θ θ x y x y x l ), ( ) ( log ), ( log ) ; ( x z p p p D θ θ θ θ z x z z x z z = = l

Diret learning in mixtures of Gaussians We an treat likelihood as an objetive funtion and try to optimize it as a funtion of θ by taking gradients (as we did before, for example in neural networks) If you work thru the gradients, you ll ]ind that In a mixture of Gaussians for example: To use optimization methods (e.g., onjugate gradient), have to ensure that parameters respet onstraints reparametrize in terms of unonstrained values = = k ) ( log ) ( ) ( k k k k k k k k k p r r θ θ α θ θ α θ θ x l l = k 2 ) / ( ) ( k k k k k r σ α θ µ x µ l

EM: An alterna3ve learning approah Use the posterior weightings to softly label the data Then solve for parameters given these urrent weightings, and realulate posteriors (weights) given new parameters Expetation- Maximization is a form of bound optimization, as opposed to gradient desent With respet to latent variables, guessing their values makes the learning fully- observed l( θ; D) = l( θ; D) = log p( x, z θ) = log p( z θz ) + log log z p( z p( x z, θ ) Note: EM is not a ost funtion, suh as ross- entropy; and EM is not a model, suh as a mixture- of- Gaussians With latent variables, probability ontains a sum so the log- likelihood has all parameters oupled together θ ) p( x z z, θ ) x x

Graphial model view of mixture models Eah node is a random variable Blue node: observed variable (data, aka visibles) Red node: hidden variable [luster assignment] The model de]ines a probability distribution over all the nodes The model generates data by piking state for hidden node based on prior The distribution over leaf node (data) is de]ined by its parent(s). Hidden ause Visible effet

The mixture of Gaussians genera3ve model First pik one of the k Gaussians with a probability that is alled its mixing proportion. Then generate a random point from the hosen Gaussian. The probability of generating the exat data we observed is zero, but we an still try to maximize the probability density. Adjust the means of the Gaussians Adjust the varianes of the Gaussians on eah dimension. Adjust the mixing proportions of the Gaussians.

FiTng a mixture of Gaussians Optimization uses the Expetation Maximization algorithm, whih alternates between two steps: E- step: Compute the posterior probability that eah Gaussian generates eah datapoint..95.05.5.5.05.95.5.5 M- step: Assuming that the data really was generated this way, hange the parameters of eah Gaussian to maximize the probability that it would generate the data it is urrently responsible for.

The E- step: Compu3ng responsibili3es In order to adjust the parameters, we must ]irst solve the inferene problem: Whih Gaussian generated eah datapoint? We annot be sure, so it s a distribution over all possibilities. Use Bayes theorem to get posterior probabilities Posterior for Gaussian i p(i x () ) = p(i)p(x() i) p(x () ) j p(x () ) = p( j)p(x () j) p(i) = π i p(x () i) = Prior for Gaussian i d=d d=1 Mixing proportion 1 2πσ i,d x d e Bayes theorem () µ i,d 2 2 2σ i,d Produt over all data dimensions

The M- step: Compu3ng new mixing propor3ons Eah Gaussian gets a ertain amount of posterior probability for eah datapoint. The optimal mixing proportion to use (given these posterior probabilities) is just the fration of the data that the Gaussian gets responsibility for. π i new = Posterior for Gaussian i =N =1 p(i x () ) N Data for training ase Number of training ases

More M- step: Compu3ng the new means We just take the enter- of gravity of the data that the Gaussian is responsible for. Just like in K- means, exept the data is weighted by the posterior probability of the Gaussian. Guaranteed to lie in the onvex hull of the data Could be big initial jump µ i new = p(i x () ) x () p(i x () )

More M- step: Compu3ng the new varianes We ]it the variane of eah Gaussian, i, on eah dimension, d, to the posterior- weighted data Its more ompliated if we use a full- ovariane Gaussian that is not aligned with the axes. 2 σ i,d = p(i x () ) x () d µ new i,d 2 p(i x () )

Visualizing a Mixture of Gaussians

Mixture of Gaussians vs. K- means EM for mixtures of Gaussians is just like a soft version of K- means, with ]ixed priors and ovariane Instead of hard assignments in the E- step, we do soft assignments based on the softmax of the squared distane from eah point to eah luster. Eah enter moved by weighted means of the data, with weights given by soft assignments In K- means, weights are 0 or 1

How do we know that the updates improve things? Updating eah Gaussian de]initely improves the probability of generating the data if we generate it from the same Gaussians after the parameter updates. But we know that the posterior will hange after updating the parameters. A good way to show that this is OK is to show that there is a single funtion that is improved by both the E- step and the M- step. The funtion we need is alled Free Energy.

Deriving varia3onal free energy We an derive variational free energy as the objetive funtion that is minimized by both steps of the Expetation and Maximization algorithm (EM).

Why EM onverges Free energy F is a ost funtion that is redued by both the E- step and the M- step. Cost = F = expeted energy entropy The expeted energy term measures how dif]iult it is to generate eah datapoint from the Gaussians it is assigned to. It would be happiest assigning eah datapoint to the Gaussian that generates it most easily (as in K- means). The entropy term enourages soft assignments. It would be happiest spreading the assignment probabilities for eah datapoint equally between all the Gaussians.

The expeted energy of a datapoint The expeted energy of datapoint is the average negative log probability of generating the datapoint The average is taken using the probabilities of assigning the datapoint to eah Gaussian. Can use any probabilities we like use some distribution q (more on this soon) q(i x () ) logπ i log p(x () µ i,σ i 2 ) probability of assigning to Gaussian i i parameters of Gaussian i ( ) datapoint Gaussian Loation of datapoint

The entropy term This term wants the assignment probabilities to be as uniform as possible (max. entropy) It ]ights the expeted energy term. entropy = q(i x () ) logq(i x () ) i log probabilities are always negative

The E- step hooses assignment probabili3es that minimize F (with parameters of Gaussians fixed) How do we ]ind assignment probabilities for a datapoint that minimize the ost and sum to 1? The optimal solution to the trade- off between expeted energy and entropy is to make the probabilities be proportional to the exponentiated negative energies: energy of assigning to i = logπ i log p(x () µ i,σ i 2 ) optimal value of q(i x () ) exp( energy(,i)) π i p(x () i) So using the posterior probabilities as assignment probabilities minimizes the ost funtion!

M- step hooses parameters that minimize F (with the assignment probabili3es held fixed) This is easy. We just ]it eah Gaussian to the data weighted by the assignment probabilities that the Gaussian has for the data. The entropy term is unaffeted (sine it only depends on the assignment probabilities) When you ]it a Gaussian to data you are maximizing the log probability of the data given the Gaussian. This is the same as minimizing the energies of the datapoints that the Gaussian is responsible for. If a Gaussian is assigned a probability of 0.7 for a datapoint the ]itting treats it as 0.7 of an observation. Sine both the E- step and the M- step derease the same ost funtion, EM onverges.

Summary: EM is oordinate desent in Free Energy F(x () ) = q(i x () ) logπ i log p(x () i) i ( ) q(i x () )( logq(i x () )) Think of eah different setting of the hidden and visible variables as a on]iguration. The energy of the on]iguration has two terms: The log prob of generating the hidden values The log prob of generating the visible values from the hidden ones The E- step minimizes F by ]inding the best distribution over hidden on]igurations for eah data point. The M- step holds the distribution ]ixed and minimizes F by hanging the parameters that determine the energy of a on]iguration. i

Reap: EM algorithm A way of maximizing likelihood for latent variable models EM is general algorithm: ]inds ML parameters when the original hard problem an be broken up into two easier piees: Infer distribution over hidden variables (given urrent parameters) Using this omplete data, ]ind the maximum likelihood parameter estimates Allows onstraints to be enfored easily (versus Lagrange multipliers in gradient desent) Works ]ine if distribution over hidden variables easy to ompute

The advantage of using F to understand EM There is learly no need to use the optimal distribution over hidden on]igurations. We an use any distribution that is onvenient so long as: we always update the distribution in a way that improves F We hange the parameters to improve F given the urrent distribution. This is very liberating. It allows us to justify all sorts of weird algorithms

A trade- off between how well the model fits the data and the auray of inferene parameters data approximating posterior distribution true posterior distribution F(q,θ) = d log p(d θ) KL(q(d) p(d)) new objetive funtion How well the model fits the data The inauray of inferene This makes it feasible to ]it very ompliated models, but the approximations that are tratable may be poor.