Lecture 8: Clustering & Mixture Models

Similar documents
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Latent Variable Models and Expectation Maximization

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Latent Variable Models and Expectation Maximization

Clustering and Gaussian Mixture Models

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

ECE 5984: Introduction to Machine Learning

Machine Learning for Data Science (CS4786) Lecture 12

Clustering and Gaussian Mixtures

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Expectation Maximization

CS145: INTRODUCTION TO DATA MINING

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Probabilistic clustering

Unsupervised Learning

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Clustering, K-Means, EM Tutorial

Latent Variable Models and EM Algorithm

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Statistical Pattern Recognition

Introduction to Machine Learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Mixture Models and EM

Clustering with k-means and Gaussian mixture distributions

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

CSCI-567: Machine Learning (Spring 2019)

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Mixtures of Gaussians. Sargur Srihari

Generative Clustering, Topic Modeling, & Bayesian Inference

EM Algorithm LECTURE OUTLINE

Linear Dynamical Systems

Lecture 11: Unsupervised Machine Learning

Machine Learning Lecture 5

Data Mining Techniques

Statistical Pattern Recognition

Clustering with k-means and Gaussian mixture distributions

Data Mining Techniques

Topic Models and Applications to Short Documents

Statistical learning. Chapter 20, Sections 1 4 1

Lecture 7: Con3nuous Latent Variable Models

L11: Pattern recognition principles

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Probabilistic Latent Semantic Analysis

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Gaussian Mixture Models, Expectation Maximization

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Introduction to Machine Learning

Probabilistic Latent Semantic Analysis

STA 4273H: Statistical Machine Learning

Advanced Introduction to Machine Learning

Probabilistic Graphical Models

Clustering using Mixture Models

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543

Dimension Reduction (PCA, ICA, CCA, FLD,

Latent Variable Models and EM algorithm

Lecture 6: Gaussian Mixture Models (GMM)

Linear Dynamical Systems (Kalman filter)

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

CS6220: DATA MINING TECHNIQUES

STA 414/2104: Machine Learning

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Clustering with k-means and Gaussian mixture distributions

Gaussian Mixture Models

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Clustering VS Classification

Advanced Introduction to Machine Learning

Notes on Latent Semantic Analysis

Expectation maximization

Latent Variable View of EM. Sargur Srihari

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Kernel Density Topic Models: Visual Topics Without Visual Words

The Expectation-Maximization Algorithm

COM336: Neural Computing

Click Prediction and Preference Ranking of RSS Feeds

Hidden Markov Models and Gaussian Mixture Models

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Latent Dirichlet Allocation

Review and Motivation

CS6220: DATA MINING TECHNIQUES

Expectation Maximization

Machine Learning Techniques for Computer Vision

Lecture 16 Deep Neural Generative Models

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

CS534 Machine Learning - Spring Final Exam

K-Means and Gaussian Mixture Models

MIXTURE MODELS AND EM

Probabilistic Graphical Models

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

CS181 Midterm 2 Practice Solutions

CS281 Section 4: Factor Analysis and PCA

Transcription:

Lecture 8: Clustering & Mixture Models C4B Machine Learning Hilary 2011 A. Zisserman K-means algorithm GMM and the EM algorithm plsa clustering

K-means algorithm K-means algorithm Partition data into K sets Initialize: choose K centres (at random) Repeat: 1. Assign points to the nearest centre 2. New centre = mean of points assigned to it Until no change

Example Cost function K-means minimizes a measure of distortion for a set of vectors {x i },,...,N NX D = kx k i c kk 2 where x k i is the subset assigned to the cluster k. The objective is to find the set of centres {c k },k =1,...,K that minimize the distortion: min c k X N kx k i c kk 2 Introducing binary assignment variables r ik, the distortion can be written as NX KX D = r ik kx i c k k 2 k=1 where if x i isassignedtoclusterk then ( 1 j = k r ij = 0 j 6= k

Minimizing the Cost function We want to determine min D = c k,r ik NX KX k=1 Step 1: minimize over assignments r ik r ik kx i c k k 2 Each term in x i can be minimized independently by assigning x i to the closest centre c k Step 2: minimize over centres c k Hence d NX KX r dc ik kx i c k k 2 NX =2 r ik (x i c k )=0 k k=1 P N r c k = ik x i P N r ik i.e. c k is the mean (centroid) of the vectors assigned to it. Note, since both steps decrease the cost D, the algorithm will converge but it can converge to a local rather than global minimum. Decrease in distortion cost with iterations D

Sensitive to initialization 1.2 1 0.8 0.6 1.2 0.4 1 0.2 0.8 0 0.6 0.4 0.2 0-0.2-1.5-1 -0.5 0 0.5 1 D -0.2-1.5-1 -0.5 0 0.5 1 F 0.16 0.14 0.12 0.1 0.08 0.06 0.04 1 2 3 4 5 6 7 8 t Sensitive to initialization 1.2 1 0.8 0.6 1.2 0.4 1 0.2 0.8 0 0.6 0.4 0.2 0-0.2-1.5-1 -0.5 0 0.5 1 0.15 0.14 0.13-0.2-1.5-1 -0.5 0 0.5 1 D F 0.12 0.11 0.1 0.09 0.08 0.07 0.06 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 t

Practicalities always run algorithm several times with different initializations and keep the run with lowest cost choice of K suppose we have data for which a distance is defined, but it is non-vectorial (so can t be added). Which step needs to change? many other clustering methods: hierarchical K-means, K-medoids, agglomerative clustering

Example application 1: vector quantization 1.2 1 1.2 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0-0.2-1.5-1 -0.5 0 0.5 1-0.2-1.5-1 -0.5 0 0.5 1 all vectors in a cluster are considered equivalent they can be represented by a single vector the cluster centre applications in compression, segmentation, noise reduction Example: image segmentation K-means cluster all pixels using their colour vectors (3D) assign pixels to their clusters colour pixels by their cluster assignment

Example application 2: face clustering Determine the principal cast of a feature film Approach: view this as a clustering problem on faces Algorithm outline 1. Detect faces for every fifth frame in the movie 2. Describe the face by a vector of intensities 3. Cluster using a K-means algorithm Example Ground Hog Day 2000 frames

Subset of detected faces in temporal order Clusters for K = 4 Gaussian Mixture Models

Hard vs soft assignments In K-means, there is a hard assignment of vectors to a cluster However, for vectors near the boundary this may be a poor representation Instead, can consider a soft-assignment, where the strength of the assignment depends on distance Gaussian Mixture Model (GMM) Combine simple models into a complex model: Component Mixing coefficient K=3

Single Gaussian Mixture of two Gaussians Cost function for fitting a GMM For a point x i p(x i )= KX k=1 π k N (x i μ k, Σ k ) The likelihood of the GMM for N points (assuming independence) is NY p(x i )= NY KX k=1 and the (negative) log-likelihood is L(θ) = NX ln KX k=1 π k N (x i μ k, Σ k ) π k N (x i μ k, Σ k ) where θ are the parameters we wish to estimate (i.e. μ k and Σ k in this case).

To minimize L(θ), differentiate first wrt μ k dl(θ) dμ k = NX π k N (x i μ k, Σ k ) P Kj=1 π j N (x i μ j, Σ j ) Σ 1 k (x i μ k ) Rearranging γ ik and hence where NX γ ik μ k = μ k = 1 N k NX N X γ ik = π kn (x i μ k, Σ k ) P Kj=1 π j N (x i μ j, Σ j ) γ ik x i γ ik x i N k = weighted mean NX γ ik and γ ik are the responsibilities of mixture component k for vector x i. N k is the effective number of vectors assigned to component k. γ ik play a similar role to the assignment variables r ik in K-means, but γ ik is not binary, 0 γ ik 1 Differentiating wrt Σ k gives Σ k = 1 N k N X γ ik (x i μ k )(x i μ k ) > and wrt π k (enforcing the constraint that P k π k =1withaLagrange multiplier) gives π k = N k N which is the average responsibility for the component weighted covariance Now, an algorithm for minimizing the cost function

Expectation Maximization (EM) Algorithm Step 1 Expectation: Compute responsibilities using current parameters μ k, Σ k (assignment) γ ik = π kn (x i μ k, Σ k ) P Kj=1 π j N (x i μ j, Σ j ) Step 2 Maximization: Re-estimate parameters using computed responsibilities μ k = 1 N k N X Σ k = 1 N k N X γ ik x i π k = N k N where N k = γ ik (x i μ k )(x i μ k ) > NX γ ik Repeat until convergence Example in 1D Data: -4-3 -2-1 0 1 2 3 4 5 OBJECTIVE: Fit mixture of Gaussian model with K=2 components Model: where Parameters: P(x θ) keep fixed i.e. only estimate x

Intuition of EM E-step: Compute soft assignment of the points, using current parameters M-step: Update parameters using current responsibilities E M E M E Likelihood function Likelihood is a function of parameters, θ Probability is a function of r.v. x

E-step: What do we actually compute? Point 1 Point 2 Point 6 ncomponents x npoints matrix (columns sum to 1): Component 1 Component 2

2D example: fitting means and covariances Practicalities Usually initialize EM algorithm using K-means Choice of K Can converge to a local rather than global minimum

Probabilistic Latent Semantic Analysis (plsa) non-examinable Unsupervised learning of topics Given a large collections of text documents (e.g. a website, or news archive) Discover the principal semantic topics in the collection Hence can retrieve/organize documents according to topic Method involves fitting a mixture model to a representation of the collection

Document-Term Matrix - bag of words model D = Document collection W = Lexicon/Vocabulary intelligence w j Texas Instruments said it has developed the first 32-bit computer chip designed specifically for artificial intelligence applications [...] d i t d i = artifact... 0... 2... 1artificial intelligence 0interest Document-Term Matrix D d 1... d i... W w 1... w j... w J............ d I [Hofmann 99] d documents w words z topics Model fitting: find topic vectors P(w z) common to all documents, and mixture coefficients P(z d) specific to each document.

[Hofmann 99] d documents w words z topics P(w z) are the latent aspects Non-negative matrix factorization each document histogram explained as a sum over topics Fitting plsa parameters Observed counts of word i in document j Maximize likelihood of data using EM M number of words N number of documents

Expectation Maximization Algorithm for plsa E step: posterior probability of latent variables ( topics ) Probability that the occurence of term w in document d can be explained by topic z M step: parameter estimation based on completed statistics how often is term w associated with topic z? how often is document d associated with topic z? Example (1) Topics (3 of 100) extracted from Associated Press news Topic 1 securities 94.96324 firm 88.74591 drexel 78.33697 investment 75.51504 bonds 64.23486 sec 61.89292 bond 61.39895 junk 61.14784 milken 58.72266 firms 51.26381 investors 48.80564 lynch 44.91865 insider 44.88536 shearson 43.82692 boesky 43.74837 lambert 40.77679 merrill 40.14225 brokerage 39.66526 corporate 37.94985 burnham 36.86570 Topic 2 ship 109.41212 coast 93.70902 guard 82.11109 sea 77.45868 boat 75.97172 fishing 65.41328 vessel 64.25243 tanker 62.55056 spill 60.21822 exxon 58.35260 boats 54.92072 waters 53.55938 valdez 51.53405 alaska 48.63269 ships 46.95736 port 46.56804 hazelwood 44.81608 vessels 43.80310 ferry 42.79100 fishermen 41.65175 Topic 3 india 91.74842 singh 50.34063 militants 49.21986 gandhi 48.86809 sikh 47.12099 indian 44.29306 peru 43.00298 hindu 42.79652 lima 41.87559 kashmir 40.01138 tamilnadu 39.54702 killed 39.47202 india's 39.25983 punjab 39.22486 delhi 38.70990 temple 38.38197 shining 37.62768 menem 35.42235 hindus 34.88001 violence 33.87917

Example (2) Topics (10 of 128) extracted from Science Magazine articles (12K) P(w z) P(w z) Background reading Bishop, chapter 9.1 9.3.2 Other topics you should know about: random forest classifiers and regressors semi-supervised learning collaborative filtering More on web page: http://www.robots.ox.ac.uk/~az/lectures/ml