Clustering and Gaussian Mixtures Oliver Schulte - CMPT 883
2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15 2 25 5 1 15 2 25 detected tures detected using using The n M. second The second is is variable and the and last the last ll ible possible occlusion occlusion l. shape s the shape and ocpearancthe appearance and and compasses many many of of and oc- obabilistic istic way, this way, this constrained objects figs from objects Fergus et al. Figure Figure 2: Output 2: Output of theof feature the feature detector detector ll a small covariance) covariance) cking but lacking geomettight, be tight, but thebut the the image the image and rescaled and rescaled to thetosize of sizea of small a small (typically (typically geometuld Once Once the regions the regions are identified, are identified, they are theycropped are cropped from from From the equations the equations11 11) 11 11) pixel patch. pixel patch. Thus, Thus, each patch each patch exists exists in a 121 in adimen- 121 dimen- tative p relative to a ref- which which has has In practice In practice this method this method gives gives stable stable identification identification of fea- of fea- to a ref-annitydensity thereafter and thereafter entropy the entropy decreases decreases theasscale scale increases. increases. K-Means re arts assumed are assumed to totures turesaover variety a variety of sizes of sizes and copes and copes well with well with intra-class intra-class ckground model model asin(within a range a range r). r). ant toant scaling, to scaling, although although experimental experimental tests show tests show that this thatis this is as-variability. Learning variability. The saliency The saliency with measure measure Latent is designed is designed Variables to betoinvari- invari- not entirely not entirely the case thedue casetodue aliasing to aliasing and other and other effects. effects. Note, Note, d p p, Ur f p ) dp r f only monochrome only monochrome information information is used is used to detect to detect and repre- and repre- er. e finder. 1 p(d θ) (N, f) p(d θ) In many sent features. sent settings features. not all variables are observed (labelled) in the training data: x i = (x i, h i ) 2.3. e.g. 2.3. Feature Speech Feature representation recognition: have speech signals, but not The phoneme feature The feature detector labels detector identifies identifies regions regions of interest of interest on each on each image. e.g. image. Object The coordinates Therecognition: coordinates of theof centre the have centre give object us give Xus and labels Xthe andsize (car, the size bicycle), of but theof not region thepart region gives labels gives S. Figure S. (wheel, Figure 2 illustrates 2door, illustrates this seat) on this two ontypical two typical images images from the frommotorbike the motorbike dataset. dataset. Unobserved variables are called latent variables
Latent Variables and Simplicity Latent variables can explain observed correlations with a simple model. Fewer parameters. Common in science: The heart, genes, energy, gravity,... 2 2 2 Smoking Diet Exercise 2 2 2 Smoking Diet Exercise 54 HeartDisease 6 6 6 Symptom 1 Symptom 2 Symptom 3 54 162 486 Symptom 1 Symptom 2 Symptom 3 (a) (b) Fig. Russell and Norvig 2.1
Latent Variable Models: Pros Statistically powerful, often good predictions. Many applications: Learning with missing data. Clustering: missing cluster label for data points. Principal Component Analysis: data points are generated in linear fashion from a small set of unobserved components. (more later) Matrix Factorization, Recommender Systems: Assign users to unobserved user types, assign items to unobserved item types. Use similarity between user type, item type to predict preference of user for item. Winner of $1M Netflix challenge. If latent variables have an intuitive interpretation (e.g., action movies, factors ), discovers new features.
Latent Variable Models: Cons From a user s point of view, like a black box if latent variables don t have an intuitive interpretation. Statistically, hard to guarantee convergence to a correct model with more data (the identifiability problem). Harder computationally, usually no closed form for maximum likelihood estimates.
Latent Learning Methods The Expectation-Maximization algorithm provides a general-purpose local search algorithm for learning parameter values in probabilistic models with latent variables. Deep Learning investigates methods for finding new latent variables.
Outline K-Means
Outline K-Means
Clustering 2 (a) Given a dataset {x 1,..., x N }, each 2 x i R D, partition the dataset into K clusters Intuitively, a cluster is a group of points, which are close together and far from others
Distortion Measure 2 (a) 2 2 (i) 2 Formally, introduce prototypes (or cluster centers) µ k R D Use binary r nk, 1 if point n is in cluster k, otherwise (1-of-K coding scheme again) Find {µ k }, {r nk } to minimize distortion measure: J = N n=1 k=1 K r nk x n µ k 2
Minimizing Distortion Measure Minimizing J directly is hard N K J = r nk x n µ k 2 n=1 k=1 However, two things are easy If we know µ k, minimizing J wrt r nk If we know r nk, minimizing J wrt µ k This suggests an iterative procedure Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Minimize J wrt µ k Rinse and repeat until convergence
Minimizing Distortion Measure Minimizing J directly is hard N K J = r nk x n µ k 2 n=1 k=1 However, two things are easy If we know µ k, minimizing J wrt r nk If we know r nk, minimizing J wrt µ k This suggests an iterative procedure Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Minimize J wrt µ k Rinse and repeat until convergence
Minimizing Distortion Measure Minimizing J directly is hard N K J = r nk x n µ k 2 n=1 k=1 However, two things are easy If we know µ k, minimizing J wrt r nk If we know r nk, minimizing J wrt µ k This suggests an iterative procedure Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Minimize J wrt µ k Rinse and repeat until convergence
2 (a) Determining Membership Variables Step 1 in an iteration of K-means is to minimize distortion measure J wrt cluster membership variables r nk 2 2 (b) 2 J = N n=1 k=1 K r nk x n µ k 2 Terms for different data points x n are independent, for each data point set r nk to minimize: K r nk x n µ k 2 How? k=1 Simply set r nk = 1 for the cluster center µ with smallest distance
2 (a) Determining Membership Variables Step 1 in an iteration of K-means is to minimize distortion measure J wrt cluster membership variables r nk 2 2 (b) 2 J = N n=1 k=1 K r nk x n µ k 2 Terms for different data points x n are independent, for each data point set r nk to minimize: K r nk x n µ k 2 How? k=1 Simply set r nk = 1 for the cluster center µ with smallest distance
2 (a) Determining Membership Variables Step 1 in an iteration of K-means is to minimize distortion measure J wrt cluster membership variables r nk 2 2 (b) 2 J = N n=1 k=1 K r nk x n µ k 2 Terms for different data points x n are independent, for each data point set r nk to minimize: K r nk x n µ k 2 How? k=1 Simply set r nk = 1 for the cluster center µ with smallest distance
Determining Cluster Centers Step 2: fix r nk, minimize J wrt the cluster centers µ k 2 (b) K N J = r nk x n µ k 2 switch order of sums k=1 n=1 2 2 (c) 2 Exercise: minimize wrt each µ k separately. Take derivative, set to zero: 2 N r nk (x n µ k ) = n=1 µ k = n r nkx n n r nk i.e. mean of datapoints x n assigned to cluster k
Determining Cluster Centers Step 2: fix r nk, minimize J wrt the cluster centers µ k 2 (b) K N J = r nk x n µ k 2 switch order of sums k=1 n=1 2 2 (c) 2 Exercise: minimize wrt each µ k separately. Take derivative, set to zero: 2 N r nk (x n µ k ) = n=1 µ k = n r nkx n n r nk i.e. mean of datapoints x n assigned to cluster k
K-means Algorithm Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Assign points to nearest cluster center Minimize J wrt µ k Set cluster center as average of points in cluster Rinse and repeat until convergence
K-means example 2 (a) 2
K-means example 2 (b) 2
K-means example 2 (c) 2
K-means example 2 (d) 2
K-means example 2 (e) 2
K-means example 2 (f) 2
K-means example 2 (g) 2
K-means example 2 (h) 2
K-means example 2 (i) 2 Next step doesn t change membership stop
K-means Convergence Repeat steps until no change in cluster assignments For each step, value of J either goes down, or we stop Finite number of possible assignments of data points to clusters, so we are guaranteed to converge eventually Note it may be a local maximum rather than a global maximum to which we converge
K-means Example - Image Segmentation Original image K-means clustering on pixel colour values Pixels in a cluster are coloured by cluster mean Represent each pixel (e.g. 24-bit colour value) by a cluster number (e.g. 4 bits for K = 1), compressed version
Hierarchical Clustering See Powerpoint slides.
Outline K-Means
Hard Assignment vs. Soft Assignment 2 (i) In the K-means algorithm, a hard 2 assignment of points to clusters is made However, for points near the decision boundary, this may not be such a good idea Instead, we could think about making a soft assignment of points to clusters
Gaussian Mixture Model 1 (a) 1 (b).5.2.5.5.3.5 1.5 1 The Gaussian mixture model (or mixture of Gaussians MoG) models the data as a combination of Gaussians. a: constant density contours. b: marginal probability p(x). c: surface plot. Widely used general approximation for multi-modal distributions. (Dog Show)
Gaussian Mixture Model 1 (b) 1 (a).5.5.5 1.5 1 Above shows a dataset generated by drawing samples from three different Gaussians
Generative Model z 1 (c).5 x.5 1 The mixture of Gaussians is a generative model To generate a datapoint x n, we first generate a value for a discrete variable z n {1,..., K} We then generate a value x n N (x µ k, Σ k ) for the corresponding Gaussian.
Graphical Model π z n 1 (a).5 x n µ Σ N.5 1 Full graphical model using plate notation Note z n is a latent variable, unobserved Bayes net needs distributions p(z n ) and p(x n z n ) The one-of-k representation is helpful here: z nk {, 1}, z n = (z n1,..., z nk )
Graphical Model - Latent Component Variable π z n 1 (a).5 x n µ Σ N.5 1 Use a Bernoulli distribution for p(z n ) i.e. p(z nk = 1) = π k Parameters to this distribution {π k } Must have π k 1 and K k=1 π k = 1 p(z n ) = K k=1 πz nk k
Graphical Model - Observed Variable π z n 1 (a).5 x n µ Σ N.5 1 Use a Gaussian distribution for p(x n z n ) Parameters to this distribution {µ k, Σ k } p(x n z nk = 1) = N (x n µ k, Σ k ) K p(x n z n ) = N (x n µ k, Σ k ) z nk k=1
Graphical Model - Joint distribution π z n 1 (a).5 x n µ Σ N.5 1 The full joint distribution is given by: p(x, z) = = N p(z n )p(x n z n ) n=1 N K n=1 k=1 π z nk k N (x n µ k, Σ k ) z nk
MoG Marginal over Observed Variables The marginal distribution p(x n ) for this model is: p(x n ) = z n p(x n, z n ) = z n p(z n )p(x n z n ) = K π k N (x n µ k, Σ k ) k=1 A mixture of Gaussians
MoG Conditional over Latent Variable 1 (b) 1 (c).5.5.5 1.5 1 To apply EM, need the conditional distribution p(z nk = 1 x n, θ) where θ are the model parameters. It is denoted by γ(z nk ) and can be computed as: Exercise how? γ(z nk ) p(z nk = 1 x n ) = p(z nk = 1)p(x n z nk = 1) K j=1 p(z nj = 1)p(x n z nj = 1) = π k N (x n µ k, Σ k ) K j=1 π jn (x n µ j, Σ j ) γ(z nk ) is the responsibility of component k for datapoint n
MoG Conditional over Latent Variable 1 (b) 1 (c).5.5.5 1.5 1 To apply EM, need the conditional distribution p(z nk = 1 x n, θ) where θ are the model parameters. It is denoted by γ(z nk ) and can be computed as: Exercise how? γ(z nk ) p(z nk = 1 x n ) = p(z nk = 1)p(x n z nk = 1) K j=1 p(z nj = 1)p(x n z nj = 1) = π k N (x n µ k, Σ k ) K j=1 π jn (x n µ j, Σ j ) γ(z nk ) is the responsibility of component k for datapoint n
MoG Conditional over Latent Variable 1 (b) 1 (c).5.5.5 1.5 1 To apply EM, need the conditional distribution p(z nk = 1 x n, θ) where θ are the model parameters. It is denoted by γ(z nk ) and can be computed as: Exercise how? γ(z nk ) p(z nk = 1 x n ) = p(z nk = 1)p(x n z nk = 1) K j=1 p(z nj = 1)p(x n z nj = 1) = π k N (x n µ k, Σ k ) K j=1 π jn (x n µ j, Σ j ) γ(z nk ) is the responsibility of component k for datapoint n
Expectation-Maximization for Gaussian Mixtures Initialize parameters, then iterate: E step: Calculate responsibilities using current parameters γ(z nk ) = π kn (x n µ k, Σ k ) K j=1 π jn (x n µ j, Σ j ) M step: Re-estimate parameters using these γ(z nk ) N k N γ(z nk ) n=1 µ k = 1 N γ(z nk )x n N k n=1 Σ k = 1 N γ(z nk )(x n µ N k )(x n µ k ) T k n=1 π k = N k N Think of N k as effective number of points in component k.
Expectation-Maximization for Gaussian Mixtures Initialize parameters, then iterate: E step: Calculate responsibilities using current parameters γ(z nk ) = π kn (x n µ k, Σ k ) K j=1 π jn (x n µ j, Σ j ) M step: Re-estimate parameters using these γ(z nk ) N k N γ(z nk ) n=1 µ k = 1 N γ(z nk )x n N k n=1 Σ k = 1 N γ(z nk )(x n µ N k )(x n µ k ) T k n=1 π k = N k N Think of N k as effective number of points in component k.
MoG EM - Example 2 (a) 2 Same initialization as with K-means before Often, K-means is actually used to initialize EM
MoG EM - Example 2 (b) 2 Calculate responsibilities γ(z nk )
MoG EM - Example 2 (c) 2 Calculate model parameters {π k, µ k, Σ k } using these responsibilities
MoG EM - Example 2 (d) 2 Iteration 2
MoG EM - Example 2 (e) 2 Iteration 5
MoG EM - Example 2 (f) 2 Iteration 2 - converged
EM - Summary EM finds local maximum to likelihood p(x θ) = Z p(x, Z θ) Iterates two steps: E step calculates the distribution of the missing variables Z (Hard EM fills in the variables). M step maximizes expected complete log likelihood (expectation wrt E step distribution)
EM and Maximum Likelihood 1. The EM procedure is guaranteed to increase at each step, the data log-likelihood ln p(x θ) = Z ln p(x, Z θ). 2. Therefore converges to local log-likelihood maximum. More theoretical analysis in Bishop. Log-likelihood L 7 6 5 4 3 2 1-1 -2 5 1 15 2 Iteration number Fig. Russell and Norvig 2.12
Conclusion Readings: Bishop Ch. 9.1, 9.2, 9.4 K-means clustering Gaussian mixture model: latent variables as part of a model for how data are generated. Expectation-maximization, a general method for learning parameters of models when not all variables are observed