Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades posted today Out of 82 points Today: Review: PCA Start: unsupervised learning 2 1
Dimensionality Reduction PCA Machine Learning CSE4546 Sham Kakade University of Washington November 8, 2016 3 Linear projections, a review Project a point into a (lower dimensional) space: point: x = (x 1,,x d ) select a basis set of basis vectors (u 1,,u k ) we consider orthonormal basis: u i u i =1, and u i u j =0 for i¹ j select a center x, defines offset of space best coordinates in lower dimensional space defined by dot-products: (z 1,,z k ), z i = (x-x) u i Sham Kakade 4 2016 2
PCA finds projection that minimizes reconstruction error Given N data points: x i = (x i 1,,x i d ), i=1 N Will represent each point as a projection: where: and N N PCA: Given k<<d, find (u 1,,u k ) minimizing reconstruction error: N x 2 x 1 Sham Kakade 5 2016 Understanding the reconstruction error Note that x i can be represented exactly by d-dimensional projection: d Given k<<d, find (u 1,,u k ) minimizing reconstruction error: N Rewriting error: Sham Kakade 6 2016 3
Reconstruction error and covariance matrix N d N N Sham Kakade 7 2016 Minimizing reconstruction error and eigen vectors Minimizing reconstruction error equivalent to picking orthonormal basis (u 1,,u d ) minimizing: N d Eigen vector definition: Solution: use the eigenvectors from the SVD Sham Kakade 8 2016 4
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 9 Clustering images Set of Images [Goldberger et al.] 10 5
Clustering web search results 11 Some Data 12 6
K-means 1. Ask user how many clusters they d like. (e.g. k=5) 13 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 14 7
K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. (Thus each Center owns a set of datapoints) 15 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 16 8
K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 5. and jumps there 6. Repeat until 2016 terminated! Sham Kakade 17 K-means Randomly initialize k centers µ (0) = µ 1 (0),, µ k (0) Classify: Assign each point jî {1, N} to nearest center: Recenter: µ i becomes centroid of its point: Equivalent to µ i average of its points! 18 9
What is K-means optimizing? Potential function F(µ,C) of centers µ and point allocations C: N Optimal K-means: min µ min C F(µ,C) 19 Does K-means converge??? Part 1 Optimize potential function: Fix µ, optimize C 20 10
Does K-means converge??? Part 2 Optimize potential function: Fix C, optimize µ 21 Coordinate descent algorithms Want: min a min b F(a,b) Coordinate descent: fix a, minimize b fix b, minimize a repeat Converges!!! if F is bounded to a (often good??) local optimum (For LASSO it converged to the global optimum, because of convexity) Some theory of quality of local opt K-means is a coordinate descent algorithm! 22 11
Mixtures of Gaussians Machine Learning CSE546 Sham Kakade University of Washington November 8, 2016 23 (One) bad case for k-means Clusters may overlap Some clusters may be wider than others 24 12
Density Estimation Estimate a density based on x 1,,x N 25 Density Estimation Contour Plot of Joint Density 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 26 13
Density as Mixture of Gaussians Approximate density with a mixture of Gaussians Mixture of 3 Gaussians Contour Plot of Joint Density 0.75 0.7 0.65 0.6 0.55 0.7 0.65 0.6 0.55 0.5 0.5 0.45 0.45 0.4 0.4 0.35 0.35 0.3 0.3 0.25 0 0.2 0.4 0.6 0.8 1 0.25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 27 Gaussians in d Dimensions 1 P(x) = (2π ) m/2 Σ exp # 1 1/2 $ % 2 x µ ( ) T Σ 1 ( x µ ) & ' ( 28 14
Density as Mixture of Gaussians Approximate density with a mixture of Gaussians 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 Mixture of 3 Gaussians 0 0.2 0.4 0.6 0.8 1 p(x i,µ, ) = 29 Density as Mixture of Gaussians Approximate with density with a mixture of Gaussians Mixture of 3 Gaussians Our actual observations 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0 0.2 0.4 0.6 0.8 1 1 0.5 0 (b) 0 0.5 1 30 15
Clustering our Observations Imagine we have an assignment of each x i to a Gaussian Our actual observations 1 (a) 1 (b) 0.5 0.5 0 0 0 0.5 1 Complete data labeled by true cluster assignments 0 0.5 1 31 Clustering our Observations Imagine we have an assignment of each x i to a Gaussian 1 (a) Introduce latent cluster indicator variable z i 0.5 Then we have p(x i z i,,µ, ) = 0 0 0.5 1 Complete data labeled by true cluster assignments 32 16
Clustering our Observations We must infer the cluster assignments from the observations 1 0.5 (c) Posterior probabilities of assignments to each cluster *given* model parameters: r ik = p(z i = k x i,,µ, ) = 0 0 0.5 1 Soft assignments to clusters 33 Unsupervised Learning: not as hard as it looks Sometimes easy Sometimes impossible and sometimes in between 34 17
Summary of GMM Concept Estimate a density based on x 1,,x N KX p(x i, µ, ) = z in (x i µ z i, z i) z i =1 1 (a) 0.5 0 0 0.5 1 Complete data labeled by true cluster assignments Surface Plot of Joint Density, Marginalizing Cluster Assignments 35 Summary of GMM Components Observations x i i 2 R d, i =1, 2,...,N Hidden cluster labels z i 2{1, 2,...,K}, i =1, 2,...,N Hidden mixture means µ k 2 R d, k =1, 2,...,K Hidden mixture covariances Hidden mixture probabilities k 2 R d d, k =1, 2,...,K KX k, k =1 k=1 Gaussian mixture marginal and conditional likelihood : KX p(x i, µ, ) = z i p(x i z i,µ, ) z i =1 p(x i z i,µ, ) =N (x i µ z i, z i) 36 18
Expectation Maximization Machine Learning CSE546 Sham Kakade University of Washington November 8, 2016 37 Next back to Density Estimation What if we want to do density estimation with multimodal or clumpy data? 38 19
But we don t see class labels!!! MLE: argmax Õ i P(z i,x i ) But we don t know z i Maximize marginal likelihood: argmax Õ i P(x i ) = argmax Õ i å k=1 K P(z i =k,x i ) 39 Special case: spherical Gaussians and hard assignments P(z i = k, x i 1 ) = (2π ) m/2 Σ k exp # 1 1/2 2 xi µ k $ % ( ) T Σ 1 k ( x i µ k ) & ' ( P(zi = k) If P(X z=k) is spherical, with same s for all classes: # P(x i z i = k) exp 1 x i 2& µ $ % 2σ 2 k ' ( If each x i belongs to one class C(i) (hard assignment), marginal likelihood: N K N % P(x i, z i = k) exp 1 & ' 2σ 2 i=1 k=1 i=1 x i µ C(i) ( ) * 2 Same as K-means!!! 40 20
Supervised Learning of Mixtures of Gaussians Mixtures of Gaussians: Prior class probabilities: P(z=k) Likelihood function per class: P(x z=k) Suppose, for each data point, we know location x and class z Learning is easy J For prior P(z) For likelihood function: 41 EM: Reducing Unsupervised Learning to Supervised Learning If we knew assignment of points to classes è Supervised Learning! Expectation-Maximization (EM) Guess assignment of points to classes In standard ( soft ) EM: each point associated with prob. of being in each class Recompute model parameters Iterate 42 21
Form of Likelihood Conditioned on class of point x i... p(x i z i,µ, ) = Marginalizing class assignment: p(x i,µ, ) = 43 Gaussian Mixture Model Most commonly used mixture model Observations: Parameters: Likelihood: z i Ex. = country of origin, = height of i th person k th mixture component = distribution of heights in country k x i 44 22
Example (Taken from Kevin Murphy s ML textbook) Data: gene expression levels Goal: cluster genes with similar expression trajectories 45 Mixture models are useful for Density estimation Allows for multimodal density Clustering Want membership information for each observation e.g., topic of current document Soft clustering: p(z i = k x i, )= Hard clustering: z i = arg max p(z i = k x i, )= k 46 23
Issues Label switching Color = label does not matter Can switch labels and likelihood is unchanged Log likelihood is not convex in the parameters Problem is simpler for complete data likelihood 47 ML Estimate of Mixture Model Params Log likelihood L x ( ), log p({x i } ) = X i log X z i p(x i,z i ) Want ML estimate ˆ ML = Neither convex nor concave and local optima 48 24
If complete data were observed Assume class labels z i were observed in addition to x i L x,z ( ) = X log p(x i,z i ) i Compute ML estimates Separates over clusters k! Example: mixture of Gaussians (MoG) = { k,µ k, k } K k=1 49 Iterative Algorithm Motivates a coordinate ascent-like algorithm: 1. Infer missing values z i given estimate of parameters ˆ 2. Optimize parameters to produce new ˆ given filled in data z i 3. Repeat Example: MoG (derivation soon + HW) 1. Infer responsibilities r ik = p(z i = k x i, ˆ (t 1) )= 2. Optimize parameters max w.r.t. k : max w.r.t. µ k, k : 50 25
E.M. Convergence EM is coordinate ascent on an interesting potential function Coord. ascent for bounded pot. func. è convergence to a local optimum guaranteed This algorithm is REALLY USED. And in high dimensional state spaces, too. E.G. Vector Quantization for Speech Data 51 Gaussian Mixture Example: Start 52 26
After first iteration 53 After 2nd iteration 54 27
After 3rd iteration 55 After 4th iteration 56 28
After 5th iteration 57 After 6th iteration 58 29
After 20th iteration 59 Some Bio Assay data 60 30
GMM clustering of the assay data 61 Resulting Density Estimator 62 31
Expectation Maximization (EM) Setup More broadly applicable than just to mixture models considered so far Model: x y observable incomplete data not (fully) observable complete data parameters Interested in maximizing (wrt ): p(x ) = X y p(x, y ) Special case: x = g(y) 63 Expectation Maximization (EM) Derivation Step 1 Rewrite desired likelihood in terms of complete data terms p(y ) =p(y x, )p(x ) Step 2 Assume estimate of parameters Take expectation with respect to ˆ p(y x, ˆ ) 64 32
Expectation Maximization (EM) Derivation Step 3 Consider log likelihood of data at any L x ( ) L x (ˆ ) relative to log likelihood at ˆ Aside: Gibbs Inequality Proof: E p [log p(x)] E p [log q(x)] 65 Motivates EM Algorithm Initial guess: Estimate at iteration t: E-Step Compute M-Step Compute 66 33
Expectation Maximization (EM) Derivation L x ( ) L x (ˆ ) =[U(, ˆ ) U(ˆ, ˆ )] [V (, ˆ ) V (ˆ, ˆ )] Step 4 Determine conditions under which log likelihood at Using Gibbs inequality: exceeds that at ˆ If Then L x ( ) L x (ˆ ) 67 Example Mixture Models E-Step Compute M-Step Compute U(, ˆ (t) )=E[log p(y ) x, ˆ (t) ] ˆ (t+1) = arg max U(, ˆ (t) ) Consider y i = {z i,x i } i.i.d. p(x i,z i ) = z ip(x i E qt [log p(y )] = X z i)= E qt [log p(x i,z i )] = i 68 34
Coordinate Ascent Behavior Bound log likelihood: L x ( ) = U(, ˆ (t) )+V(, ˆ (t) ) L x (ˆ (t) )= U(ˆ (t), ˆ (t) )+V(ˆ (t), ˆ (t) ) Figure from KM textbook 69 Comments on EM Since Gibbs inequality is satisfied with equality only if p=q, any step that changes should strictly increase likelihood In practice, can replace the M-Step with increasing U instead of maximizing it (Generalized EM) Under certain conditions (e.g., in exponential family), can show that EM converges to a stationary point of L x ( ) Often there is a natural choice for y has physical meaning If you want to choose any y, not necessarily x=g(y), replace p(y ) in U with p(y, x ) 70 35
Initialization In mixture model case where y i = {z i,x i } there are many ways to initialize the EM algorithm Examples: Choose K observations at random to define each cluster. Assign other observations to the nearest centriod to form initial parameter estimates Pick the centers sequentially to provide good coverage of data Grow mixture model by splitting (and sometimes removing) clusters until K clusters are formed Can be quite important to convergence rates in practice 71 What you should know K-means for clustering: algorithm converges because it s coordinate ascent EM for mixture of Gaussians: How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data Be happy with this kind of probabilistic analysis Remember, E.M. can get stuck in local minima, and empirically it DOES EM is coordinate ascent 72 36