Lecture 8: Clustering & Mixture Models

Size: px

Start display at page:

Download "Lecture 8: Clustering & Mixture Models"

Kelly Holt
5 years ago
Views:

1 Lecture 8: Clustering & Mixture Models C4B Machine Learning Hilary 2011 A. Zisserman K-means algorithm GMM and the EM algorithm plsa clustering

2 K-means algorithm K-means algorithm Partition data into K sets Initialize: choose K centres (at random) Repeat: 1. Assign points to the nearest centre 2. New centre = mean of points assigned to it Until no change

3 Example Cost function K-means minimizes a measure of distortion for a set of vectors {x i },,...,N NX D = kx k i c kk 2 where x k i is the subset assigned to the cluster k. The objective is to find the set of centres {c k },k =1,...,K that minimize the distortion: min c k X N kx k i c kk 2 Introducing binary assignment variables r ik, the distortion can be written as NX KX D = r ik kx i c k k 2 k=1 where if x i isassignedtoclusterk then ( 1 j = k r ij = 0 j 6= k

4 Minimizing the Cost function We want to determine min D = c k,r ik NX KX k=1 Step 1: minimize over assignments r ik r ik kx i c k k 2 Each term in x i can be minimized independently by assigning x i to the closest centre c k Step 2: minimize over centres c k Hence d NX KX r dc ik kx i c k k 2 NX =2 r ik (x i c k )=0 k k=1 P N r c k = ik x i P N r ik i.e. c k is the mean (centroid) of the vectors assigned to it. Note, since both steps decrease the cost D, the algorithm will converge but it can converge to a local rather than global minimum. Decrease in distortion cost with iterations D

5 Sensitive to initialization D F t Sensitive to initialization D F t

6 Practicalities always run algorithm several times with different initializations and keep the run with lowest cost choice of K suppose we have data for which a distance is defined, but it is non-vectorial (so can t be added). Which step needs to change? many other clustering methods: hierarchical K-means, K-medoids, agglomerative clustering

Example application 1: vector quantization 1.2 1 1.2 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0-0.2-1.5-1 -0.5 0 0.

7 Example application 1: vector quantization all vectors in a cluster are considered equivalent they can be represented by a single vector the cluster centre applications in compression, segmentation, noise reduction Example: image segmentation K-means cluster all pixels using their colour vectors (3D) assign pixels to their clusters colour pixels by their cluster assignment

Example application 2: face clustering Determine the principal cast of a feature film Approach: view this as a clustering problem on faces Algorithm outline 1.

8 Example application 2: face clustering Determine the principal cast of a feature film Approach: view this as a clustering problem on faces Algorithm outline 1. Detect faces for every fifth frame in the movie 2. Describe the face by a vector of intensities 3. Cluster using a K-means algorithm Example Ground Hog Day 2000 frames

9 Subset of detected faces in temporal order Clusters for K = 4 Gaussian Mixture Models

Hard vs soft assignments In K-means, there is a hard assignment of vectors to a cluster However, for vectors near the boundary this may be a poor representation Instead, can

10 Hard vs soft assignments In K-means, there is a hard assignment of vectors to a cluster However, for vectors near the boundary this may be a poor representation Instead, can consider a soft-assignment, where the strength of the assignment depends on distance Gaussian Mixture Model (GMM) Combine simple models into a complex model: Component Mixing coefficient K=3

Single Gaussian Mixture of two Gaussians Cost function for fitting a GMM For a point x i p(x i )= KX k=1 π k N (x i μ k, Σ k ) The likelihood of the GMM for N points (assuming independence) is NY

11 Single Gaussian Mixture of two Gaussians Cost function for fitting a GMM For a point x i p(x i )= KX k=1 π k N (x i μ k, Σ k ) The likelihood of the GMM for N points (assuming independence) is NY p(x i )= NY KX k=1 and the (negative) log-likelihood is L(θ) = NX ln KX k=1 π k N (x i μ k, Σ k ) π k N (x i μ k, Σ k ) where θ are the parameters we wish to estimate (i.e. μ k and Σ k in this case).

12 To minimize L(θ), differentiate first wrt μ k dl(θ) dμ k = NX π k N (x i μ k, Σ k ) P Kj=1 π j N (x i μ j, Σ j ) Σ 1 k (x i μ k ) Rearranging γ ik and hence where NX γ ik μ k = μ k = 1 N k NX N X γ ik = π kn (x i μ k, Σ k ) P Kj=1 π j N (x i μ j, Σ j ) γ ik x i γ ik x i N k = weighted mean NX γ ik and γ ik are the responsibilities of mixture component k for vector x i. N k is the effective number of vectors assigned to component k. γ ik play a similar role to the assignment variables r ik in K-means, but γ ik is not binary, 0 γ ik 1 Differentiating wrt Σ k gives Σ k = 1 N k N X γ ik (x i μ k )(x i μ k ) > and wrt π k (enforcing the constraint that P k π k =1withaLagrange multiplier) gives π k = N k N which is the average responsibility for the component weighted covariance Now, an algorithm for minimizing the cost function

13 Expectation Maximization (EM) Algorithm Step 1 Expectation: Compute responsibilities using current parameters μ k, Σ k (assignment) γ ik = π kn (x i μ k, Σ k ) P Kj=1 π j N (x i μ j, Σ j ) Step 2 Maximization: Re-estimate parameters using computed responsibilities μ k = 1 N k N X Σ k = 1 N k N X γ ik x i π k = N k N where N k = γ ik (x i μ k )(x i μ k ) > NX γ ik Repeat until convergence Example in 1D Data: OBJECTIVE: Fit mixture of Gaussian model with K=2 components Model: where Parameters: P(x θ) keep fixed i.e. only estimate x

14 Intuition of EM E-step: Compute soft assignment of the points, using current parameters M-step: Update parameters using current responsibilities E M E M E Likelihood function Likelihood is a function of parameters, θ Probability is a function of r.v. x

15 E-step: What do we actually compute? Point 1 Point 2 Point 6 ncomponents x npoints matrix (columns sum to 1): Component 1 Component 2

18 2D example: fitting means and covariances Practicalities Usually initialize EM algorithm using K-means Choice of K Can converge to a local rather than global minimum

19 Probabilistic Latent Semantic Analysis (plsa) non-examinable Unsupervised learning of topics Given a large collections of text documents (e.g. a website, or news archive) Discover the principal semantic topics in the collection Hence can retrieve/organize documents according to topic Method involves fitting a mixture model to a representation of the collection

20 Document-Term Matrix - bag of words model D = Document collection W = Lexicon/Vocabulary intelligence w j Texas Instruments said it has developed the first 32-bit computer chip designed specifically for artificial intelligence applications [...] d i t d i = artifact artificial intelligence 0interest Document-Term Matrix D d 1... d i... W w 1... w j... w J d I [Hofmann 99] d documents w words z topics Model fitting: find topic vectors P(w z) common to all documents, and mixture coefficients P(z d) specific to each document.

21 [Hofmann 99] d documents w words z topics P(w z) are the latent aspects Non-negative matrix factorization each document histogram explained as a sum over topics Fitting plsa parameters Observed counts of word i in document j Maximize likelihood of data using EM M number of words N number of documents

22 Expectation Maximization Algorithm for plsa E step: posterior probability of latent variables ( topics ) Probability that the occurence of term w in document d can be explained by topic z M step: parameter estimation based on completed statistics how often is term w associated with topic z? how often is document d associated with topic z? Example (1) Topics (3 of 100) extracted from Associated Press news Topic 1 securities firm drexel investment bonds sec bond junk milken firms investors lynch insider shearson boesky lambert merrill brokerage corporate burnham Topic 2 ship coast guard sea boat fishing vessel tanker spill exxon boats waters valdez alaska ships port hazelwood vessels ferry fishermen Topic 3 india singh militants gandhi sikh indian peru hindu lima kashmir tamilnadu killed india's punjab delhi temple shining menem hindus violence

23 Example (2) Topics (10 of 128) extracted from Science Magazine articles (12K) P(w z) P(w z) Background reading Bishop, chapter Other topics you should know about: random forest classifiers and regressors semi-supervised learning collaborative filtering More on web page:

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a