CSC2515 Winter 2015 Introduc3on to Machine Learning. Lecture 5: Clustering, mixture models, and EM

Similar documents
Probabilistic Graphical Models

Maximum Entropy and Exponential Families

10.5 Unsupervised Bayesian Learning

max min z i i=1 x j k s.t. j=1 x j j:i T j

Model-based mixture discriminant analysis an experimental study

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

But if z is conditioned on, we need to model it:

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

CMSC 451: Lecture 9 Greedy Approximation: Set Cover Thursday, Sep 28, 2017

2 The Bayesian Perspective of Distributions Viewed as Information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Lecture 3 - Lorentz Transformations

CSC321: 2011 Introduction to Neural Networks and Machine Learning. Lecture 11: Bayesian learning continued. Geoffrey Hinton

Gaussian Mixture Models

Lecture 7: Sampling/Projections for Least-squares Approximation, Cont. 7 Sampling/Projections for Least-squares Approximation, Cont.

CSC321: 2011 Introduction to Neural Networks and Machine Learning. Lecture 10: The Bayesian way to fit models. Geoffrey Hinton

The Hanging Chain. John McCuan. January 19, 2006

Advanced Computational Fluid Dynamics AA215A Lecture 4

Some facts you should know that would be convenient when evaluating a limit:

Millennium Relativity Acceleration Composition. The Relativistic Relationship between Acceleration and Uniform Motion

Methods of evaluating tests

Complexity of Regularization RBF Networks

Danielle Maddix AA238 Final Project December 9, 2016

18.05 Problem Set 6, Spring 2014 Solutions

Probabilistic Graphical Models

Lecture 7: Con3nuous Latent Variable Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Expectation Maximization Algorithm

Hankel Optimal Model Order Reduction 1

General Equilibrium. What happens to cause a reaction to come to equilibrium?

CSE446: Clustering and EM Spring 2017

Modeling Probabilistic Measurement Correlations for Problem Determination in Large-Scale Distributed Systems

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering, K-Means, EM Tutorial

DIGITAL DISTANCE RELAYING SCHEME FOR PARALLEL TRANSMISSION LINES DURING INTER-CIRCUIT FAULTS

Based on slides by Richard Zemel

Expectation Maximization

A Queueing Model for Call Blending in Call Centers

Data Preprocessing. Cluster Similarity

Gaussian Mixture Models, Expectation Maximization

Mixtures of Gaussians continued

Statistical learning. Chapter 20, Sections 1 4 1

Modes are solutions, of Maxwell s equation applied to a specific device.

Latent Variable Models and EM Algorithm

Chapter 8 Hypothesis Testing

Likelihood-confidence intervals for quantiles in Extreme Value Distributions

Sensitivity Analysis in Markov Networks

CPSC 540: Machine Learning

Supplementary Materials

Local quality functions for graph clustering

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Final Review. A Puzzle... Special Relativity. Direction of the Force. Moving at the Speed of Light

CSC321 Lecture 18: Learning Probabilistic Models

SOA/CAS MAY 2003 COURSE 1 EXAM SOLUTIONS

Relative Maxima and Minima sections 4.3

Quantum Mechanics: Wheeler: Physics 6210

Speed-feedback Direct-drive Control of a Low-speed Transverse Flux-type Motor with Large Number of Poles for Ship Propulsion

( ) ( ) Volumetric Properties of Pure Fluids, part 4. The generic cubic equation of state:

Transformation to approximate independence for locally stationary Gaussian processes

MOLECULAR ORBITAL THEORY- PART I

A simple expression for radial distribution functions of pure fluids and mixtures

Convergence of reinforcement learning with general function approximators

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

An Adaptive Optimization Approach to Active Cancellation of Repeated Transient Vibration Disturbances

Name Solutions to Test 1 September 23, 2016

Optimization of replica exchange molecular dynamics by fast mimicking

EE 321 Project Spring 2018

Mixture of Gaussians Models

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Latent Variable Models

Properties of Quarks

4.3 Singular Value Decomposition and Analysis

Lecture 2: Simple Classifiers

ELECTROMAGNETIC WAVES

Probabilistic & Unsupervised Learning

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

The Expectation Maximization or EM algorithm

Latent Variable Models and Expectation Maximization

LECTURE NOTES FOR , FALL 2004

Determination of the reaction order

4. (12) Write out an equation for Poynting s theorem in differential form. Explain in words what each term means physically.

Average Rate Speed Scaling

A Spatiotemporal Approach to Passive Sound Source Localization

Measuring & Inducing Neural Activity Using Extracellular Fields I: Inverse systems approach

Latent Variable Models and Expectation Maximization

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Statistical Pattern Recognition

Mixtures of Gaussians. Sargur Srihari

Math 151 Introduction to Eigenvectors

FINITE WORD LENGTH EFFECTS IN DSP

RIEMANN S FIRST PROOF OF THE ANALYTIC CONTINUATION OF ζ(s) AND L(s, χ)

ELG 5372 Error Control Coding. Claude D Amours Lecture 2: Introduction to Coding 2

Computer Science 786S - Statistical Methods in Natural Language Processing and Data Analysis Page 1

CPSC 540: Machine Learning

The Concept of Mass as Interfering Photons, and the Originating Mechanism of Gravitation D.T. Froedge

Singular Event Detection

Remark 4.1 Unlike Lyapunov theorems, LaSalle s theorem does not require the function V ( x ) to be positive definite.

Machine Learning Lecture 5

Where as discussed previously we interpret solutions to this partial differential equation in the weak sense: b

Transcription:

CSC2515 Winter 2015 Introdu3on to Mahine Learning Leture 5: Clustering, mixture models, and EM All leture slides will be available as.pdf on the ourse website: http://www.s.toronto.edu/~urtasun/ourses/csc2515/ CSC2515_Winter15.html

Overview Clustering with K- means, and a proof of onvergene Clustering with K- medians Clustering with a mixture of Gaussians The EM algorithm, and a proof of onvergene Variational inferene

Unsupervised Learning Supervised learning algorithms have a lear goal: produe desired outputs for given inputs Goal of unsupervised learning algorithms (no expliit feedbak whether outputs of system are orret) less lear: Redue dimensionality Find lusters Model data density Find hidden auses Key utility Compress data Detet outliers Failitate other learning

Major types Primary problems, approahes in unsupervised learning fall into three types: 1. Dimensionality redution: represent eah input ase using a small number of variables (e.g., prinipal omponents analysis, fator analysis, independent omponents analysis) 2. Clustering: represent eah input ase using a prototype example (e.g., k- means, mixture models) 3. Density estimation: estimating the probability distribution over the data spae

Clustering Grouping N examples into K lusters one of anonial problems in unsupervised learning Motivations: predition; lossy ompression; outlier detetion We assume that the data was generated from a number of different lasses. The aim is to luster data from the same lass together. How many lasses? Why not put eah datapoint into a separate lass? What is the objetive funtion that is optimized by sensible lusterings?

The K- means algorithm Assume the data lives in a Eulidean spae. Assume we want K lasses. Assignments Initialization: randomly loated luster enters The algorithm alternates between two steps: Assignment step: Assign eah datapoint to the losest luster. Refitted means Re]itting step: Move eah luster enter to the enter of gravity of the data assigned to it.

K- means algorithm Initialization: Set K means {m k } to random values Assignment: Eah datapoint n assigned to nearest mean responsibilities: ˆk (n) = argmin k {d(m k, x (n) )} r k (n) =1 ˆk (n) = k Update: Model parameters, means, are adjusted to math sample means of datapoints they are responsible for: m k = n (n) r n k r k (n) x (n) Repeat assignment and update steps until assignments do not hange

Ques3ons about K- means Why does update set m k to mean of assigned points? Where does distane d ome from? What if we used a different distane measure? How an we hoose best distane? How to hoose K? How an we hoose between alternative lusterings? Will it onverge? Hard ases unequal spreads, non- irular spreads, inbetween points

Why K- means onverges Whenever an assignment is hanged, the sum squared distanes of datapoints from their assigned luster enters is redued. Whenever a luster enter is moved the sum squared distanes of the datapoints from their urrently assigned luster enters is redued. Test for onvergene: If the assignments do not hange in the assignment step, we have onverged (to at least a loal minimum).

K- means: Details Objetive: minimize sum squared distane of datapoints to their assigned luster enters E({m},{r}) = (n) r k n k m k x (n) 2 s.t. r (n) k =1, n; r (n) k {0,1}, k, n k Optimization method is a form of oordinate desent ( blok oordinate desent ) Fix enters, optimize assignments (hoose luster whose mean is losest) Fix assignments, optimize means (average of assigned datapoints)

Loal minima There is nothing to prevent k- means getting stuk at loal minima. A bad loal optimum We ould try many random starting points We ould try non- loal split- and- merge moves: Simultaneously merge two nearby lusters and split a big luster into two.

Applia3on of K- Means Clustering

K- medioids K- Means: Choose number of lusters K; algorithm s primary aim is to strategially position these K means Alternative: allow eah datapoint to potentially at as a luster representative; algorithm assigns eah point to one of these representatives (exemplars)

K- medioids algorithm Initialization: Set random set K of datapoints as medioids Assignment: Eah datapoint assigned to nearest medioid ˆk () = argmin k {d(x k, x () )} r k () =1 ˆk () = k Update: For eah medioid k, for eah datapoint, swap k and and ompute total ost J Selet on]iguration with lowest ost k J({x},{r}) = r k d(x (), x k ) Repeat assignment and update steps until assignments do not hange

K- medioids vs. K- means Both partition data into K partitions Both an utilize various distane funtions, but K- medioids an pre- ompute pairwise distanes K- medioids hooses datapoints as enters (medioids/exemplars), while K- means allows the means to be arbitrary loations à disrete vs. ontinuous optimization K- medioids more robust to noise and outliers

SoP k- means Instead of making hard assignments of datapoints to lusters, we an make soft assignments. One luster may have a responsibility of.7 for a datapoint and another may have a responsibility of.3. Allows a luster to use more information about the data in the re]itting step. What happens to our onvergene guarantee? How do we deide on the soft assignments?

SoP K- means algorithm Initialization: Set K means {m k } to random values Assignment: Eah datapoint n given soft degree of assignment to eah luster mean k, based on responsibilities r k (n) = k' exp[ βd(m k, x (n) )] exp[ βd(m k', x (n) )] Update: Model parameters, means, are adjusted to math sample means of datapoints they are responsible for: m k = n (n) r n k r k (n) x (n) Repeat assignment and update steps until assignments do not hange

Ques3ons about sop K- means How to set β? What about problems with elongated lusters? Clusters with unequal weight and width

Latent variable models Adopt a different view of lustering, in terms of a model with latent variables: variables that are always unobserved We may want to intentionally introdue latent variables to model omplex dependenies between variables without looking at dependenies between them diretly - - this an atually simplify the model Form of divide- and- onquer: use simple parts to build omplex models (e.g., multimodal densities, or pieewise- linear regression)

Mixture models Most ommon example is a mixture model: most basi form of latent variable model, with single disrete latent variable We have de]ined the hidden ause to be disrete: a multinomial random variable And the observation is Gaussian The model allows for other distributions Example: Bernoulli observations Another example: Continuous hidden (latent) variable see next leture Hidden ause Visible effet But these are only the simplest models: an add many hidden & visible nodes, layers

Learning is harder with latent variables In fully observed settings, probability model is a produt, so the log likelihood is a sum, terms deouple: With latent variables, probability ontains a sum so the log- likelihood has all parameters oupled together + = = ), ( log ) ( log ), ( log ) ; ( y x p p p D θ θ θ θ x y x y x l ), ( ) ( log ), ( log ) ; ( x z p p p D θ θ θ θ z x z z x z z = = l

Diret learning in mixtures of Gaussians We an treat likelihood as an objetive funtion and try to optimize it as a funtion of θ by taking gradients (as we did before, for example in neural networks) If you work thru the gradients, you ll ]ind that In a mixture of Gaussians for example: To use optimization methods (e.g., onjugate gradient), have to ensure that parameters respet onstraints reparametrize in terms of unonstrained values = = k ) ( log ) ( ) ( k k k k k k k k k p r r θ θ α θ θ α θ θ x l l = k 2 ) / ( ) ( k k k k k r σ α θ µ x µ l

EM: An alterna3ve learning approah Use the posterior weightings to softly label the data Then solve for parameters given these urrent weightings, and realulate posteriors (weights) given new parameters Expetation- Maximization is a form of bound optimization, as opposed to gradient desent With respet to latent variables, guessing their values makes the learning fully- observed l( θ; D) = l( θ; D) = log p( x, z θ) = log p( z θz ) + log log z p( z p( x z, θ ) Note: EM is not a ost funtion, suh as ross- entropy; and EM is not a model, suh as a mixture- of- Gaussians With latent variables, probability ontains a sum so the log- likelihood has all parameters oupled together θ ) p( x z z, θ ) x x

Graphial model view of mixture models Eah node is a random variable Blue node: observed variable (data, aka visibles) Red node: hidden variable [luster assignment] The model de]ines a probability distribution over all the nodes The model generates data by piking state for hidden node based on prior The distribution over leaf node (data) is de]ined by its parent(s). Hidden ause Visible effet

The mixture of Gaussians genera3ve model First pik one of the k Gaussians with a probability that is alled its mixing proportion. Then generate a random point from the hosen Gaussian. The probability of generating the exat data we observed is zero, but we an still try to maximize the probability density. Adjust the means of the Gaussians Adjust the varianes of the Gaussians on eah dimension. Adjust the mixing proportions of the Gaussians.

FiTng a mixture of Gaussians Optimization uses the Expetation Maximization algorithm, whih alternates between two steps: E- step: Compute the posterior probability that eah Gaussian generates eah datapoint..95.05.5.5.05.95.5.5 M- step: Assuming that the data really was generated this way, hange the parameters of eah Gaussian to maximize the probability that it would generate the data it is urrently responsible for.

The E- step: Compu3ng responsibili3es In order to adjust the parameters, we must ]irst solve the inferene problem: Whih Gaussian generated eah datapoint? We annot be sure, so it s a distribution over all possibilities. Use Bayes theorem to get posterior probabilities Posterior for Gaussian i p(i x () ) = p(i)p(x() i) p(x () ) j p(x () ) = p( j)p(x () j) p(i) = π i p(x () i) = Prior for Gaussian i d=d d=1 Mixing proportion 1 2πσ i,d x d e Bayes theorem () µ i,d 2 2 2σ i,d Produt over all data dimensions

The M- step: Compu3ng new mixing propor3ons Eah Gaussian gets a ertain amount of posterior probability for eah datapoint. The optimal mixing proportion to use (given these posterior probabilities) is just the fration of the data that the Gaussian gets responsibility for. π i new = Posterior for Gaussian i =N =1 p(i x () ) N Data for training ase Number of training ases

More M- step: Compu3ng the new means We just take the enter- of gravity of the data that the Gaussian is responsible for. Just like in K- means, exept the data is weighted by the posterior probability of the Gaussian. Guaranteed to lie in the onvex hull of the data Could be big initial jump µ i new = p(i x () ) x () p(i x () )

More M- step: Compu3ng the new varianes We ]it the variane of eah Gaussian, i, on eah dimension, d, to the posterior- weighted data Its more ompliated if we use a full- ovariane Gaussian that is not aligned with the axes. 2 σ i,d = p(i x () ) x () d µ new i,d 2 p(i x () )

Visualizing a Mixture of Gaussians

Mixture of Gaussians vs. K- means EM for mixtures of Gaussians is just like a soft version of K- means, with ]ixed priors and ovariane Instead of hard assignments in the E- step, we do soft assignments based on the softmax of the squared distane from eah point to eah luster. Eah enter moved by weighted means of the data, with weights given by soft assignments In K- means, weights are 0 or 1

How do we know that the updates improve things? Updating eah Gaussian de]initely improves the probability of generating the data if we generate it from the same Gaussians after the parameter updates. But we know that the posterior will hange after updating the parameters. A good way to show that this is OK is to show that there is a single funtion that is improved by both the E- step and the M- step. The funtion we need is alled Free Energy.

Deriving varia3onal free energy We an derive variational free energy as the objetive funtion that is minimized by both steps of the Expetation and Maximization algorithm (EM).

Why EM onverges Free energy F is a ost funtion that is redued by both the E- step and the M- step. Cost = F = expeted energy entropy The expeted energy term measures how dif]iult it is to generate eah datapoint from the Gaussians it is assigned to. It would be happiest assigning eah datapoint to the Gaussian that generates it most easily (as in K- means). The entropy term enourages soft assignments. It would be happiest spreading the assignment probabilities for eah datapoint equally between all the Gaussians.

The expeted energy of a datapoint The expeted energy of datapoint is the average negative log probability of generating the datapoint The average is taken using the probabilities of assigning the datapoint to eah Gaussian. Can use any probabilities we like use some distribution q (more on this soon) q(i x () ) logπ i log p(x () µ i,σ i 2 ) probability of assigning to Gaussian i i parameters of Gaussian i ( ) datapoint Gaussian Loation of datapoint

The entropy term This term wants the assignment probabilities to be as uniform as possible (max. entropy) It ]ights the expeted energy term. entropy = q(i x () ) logq(i x () ) i log probabilities are always negative

The E- step hooses assignment probabili3es that minimize F (with parameters of Gaussians fixed) How do we ]ind assignment probabilities for a datapoint that minimize the ost and sum to 1? The optimal solution to the trade- off between expeted energy and entropy is to make the probabilities be proportional to the exponentiated negative energies: energy of assigning to i = logπ i log p(x () µ i,σ i 2 ) optimal value of q(i x () ) exp( energy(,i)) π i p(x () i) So using the posterior probabilities as assignment probabilities minimizes the ost funtion!

M- step hooses parameters that minimize F (with the assignment probabili3es held fixed) This is easy. We just ]it eah Gaussian to the data weighted by the assignment probabilities that the Gaussian has for the data. The entropy term is unaffeted (sine it only depends on the assignment probabilities) When you ]it a Gaussian to data you are maximizing the log probability of the data given the Gaussian. This is the same as minimizing the energies of the datapoints that the Gaussian is responsible for. If a Gaussian is assigned a probability of 0.7 for a datapoint the ]itting treats it as 0.7 of an observation. Sine both the E- step and the M- step derease the same ost funtion, EM onverges.

Summary: EM is oordinate desent in Free Energy F(x () ) = q(i x () ) logπ i log p(x () i) i ( ) q(i x () )( logq(i x () )) Think of eah different setting of the hidden and visible variables as a on]iguration. The energy of the on]iguration has two terms: The log prob of generating the hidden values The log prob of generating the visible values from the hidden ones The E- step minimizes F by ]inding the best distribution over hidden on]igurations for eah data point. The M- step holds the distribution ]ixed and minimizes F by hanging the parameters that determine the energy of a on]iguration. i

Reap: EM algorithm A way of maximizing likelihood for latent variable models EM is general algorithm: ]inds ML parameters when the original hard problem an be broken up into two easier piees: Infer distribution over hidden variables (given urrent parameters) Using this omplete data, ]ind the maximum likelihood parameter estimates Allows onstraints to be enfored easily (versus Lagrange multipliers in gradient desent) Works ]ine if distribution over hidden variables easy to ompute

The advantage of using F to understand EM There is learly no need to use the optimal distribution over hidden on]igurations. We an use any distribution that is onvenient so long as: we always update the distribution in a way that improves F We hange the parameters to improve F given the urrent distribution. This is very liberating. It allows us to justify all sorts of weird algorithms

A trade- off between how well the model fits the data and the auray of inferene parameters data approximating posterior distribution true posterior distribution F(q,θ) = d log p(d θ) KL(q(d) p(d)) new objetive funtion How well the model fits the data The inauray of inferene This makes it feasible to ]it very ompliated models, but the approximations that are tratable may be poor.