Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

Roadmap so far: (large-margin) supervised learning binary, multiclass, and structured classifications online learning: avg perceptron/mira, convergence proof y=+1 y=-1 kernels and kernelized perceptron in dual SVMs: formulation, KKT, dual, convex optimization, slacks the man bit the dog structured perceptron, HMM, and Viterbi algorithm DT NN VBD DT NN what we left out: many classical algorithms next up: unsupervised learning nearest neighbors (instance-based), decision trees, logistic regression... clustering: k-means, EM, mixture models, hierarchical dimensionality reduction: linear (PCA/ICA, MDS), nonlinear (isomap) 2

Sup=>Unsup: Nearest Neighbor=> k-means let s look at a supervised learning method: nearest neighbor SVM, perceptron (in dual) and NN are all instance-based learning instance-based learning: store a subset of examples for classification compression rate: SVM: very high, perceptron: medium high, NN: 0 3

k-nearest Neighbor one way to prevent overfitting => more stable results 4

NN Voronoi in 2D and 3D 5

Voronoi for Euclidian and Manhattan 6

Unsupervised Learning cost of supervised learning labeled data: expensive to annotate! but there exists huge data w/o labels unsupervised learning can only hallucinate the labels infer some internal structures of data still the compression view of learning too much data => reduce it! clustering: reduce # of examples 2 0 2 (a) 2 0 2 dimensionality reduction: reduce # of dimensions 7

Challenges in Unsupervised Learning how to evaluate the results? there is no gold standard data! internal metric? how to interpret the results? how to name the clusters? how to initialize the model/guess? 2 0 2 (a) 2 0 2 a bad initial guess can lead to very bad results unsup is very sensitive to initialization (unlike supervised) how to do optimization => in general no longer convex! 8

k-means (randomly) pick k points to be initial centroids repeat the two steps until convergence assignment to centroids: voronoi, like NN recomputation of centroids based on the new assignment how to define convergence? after a fixed number of iterations, or assignments do not change, or centroids do not change (equivalent?) or change in objective function falls below threshold 2 0 2 (i) 2 0 2 18

k-means objective function residual sum of squares (RSS) sum of distances from points to their centroids guaranteed to decrease monotonically convergence proof: decrease + finite # of clusterings 1000 J 500 0 1 2 3 4 19

k-means for image segmentation 20

Problems with k-means problem: sensitive to initialization the objective function is non-convex: many local minima why? k-means works well if clusters are spherical clusters are well separated clusters of similar volumes clusters have similar # of examples 21

Better ( soft ) k-means? random restarts -- definitely helps soft clusters => EM with Gaussian Mixture Model 2 (i) p(x) 1 (a) 0 0.5 0.2 x 0 0.5 0.3 2 0 0.5 1 2 0 2 1 (b) 0.5 0 22

k-means randomize k initial centroids repeat the two steps until convergence E-step: assignment each example to centroids (Voronoi) M-step: recomputation of centroids (based on the new assignment) 2 (a) 0 2 2 0 2 23

EM for Gaussian Mixtures randomize k means, covariances, mixing coefficients repeat the two steps until convergence E-step: evaluate the responsibilities using current parameters M-step: reestimate parameters using current responsibilities 2 0 2 2 0 (a) 2 24

EM for Gaussian Mixtures 2 2 L =1 0 0 2 2 2 0 (b) 2 2 0 (c) 2 31

Convergence EM converges much slower than k-means can t use assignment doesn t change for convergence use log likelihood of the data stop if increase in log likelihood smaller than threshold or a maximum # of iterations has reached L = log P (data) = log j P (x j ) = X j log P (x j ) = X j log X i P (c i )P (x j c i ) 32

EM: pros and cons (vs. k-means) EM: pros doesn t need the data to be spherical doesn t need the data to be well-separated doesn t need the clusters to be in similar sizes/volumes EM: cons converges much slower than k-means per-iteration computation also slower (to speedup EM): use k-means to burn-in (same as k-means) local minimum! 33

k-means is a special case of EM k-means is hard EM covariance matrix is diagonal -- i.e., spherical diagonal variances are approaching 0 2 (i) 2 L = 20 0 0 2 2 2 0 2 2 0 (f) 2 34

Why EM increases p(data) iteratively? CS 562 - EM 35

Why EM increases p(data) iteratively? = D converge to local maxima convex auxiliary function KL-divergence CS 562 - EM 36

How to maximize the auxiliary? ln p(x θ) L (q,θ) θ old θ new CS 562 - EM 37

Machine Learning A Geometric Approach CUNY Graduate Center, Spring 2013 Lectures 13: Unsupervised Learning 2 (dimensionality reduction: linear: PCA, ICA, CCA, LDA 1, LDA 2 ; non-linear: kernel PCA, isomap, LLE, SDE) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

Dimensionality Reduction CS 562 - EM 39

Dimensionality Reduction CS 562 - EM 40

Fingers extension Dimensionality Reduction from the Isomap paper (Tenebaum, de Silva, Langford, Science 2000) CS 562 - EM Wrist rotation 41

Algorithms linear methods non-linear methods PCA - principle... kernelized PCA ICA - independent... isomap CCA - canonical... LLE - locally linear embedding MDS - multidim. scaling SDE - semidefinite embedding LEM - laplacian eignen maps LDA1 - linear discriminant analysis LDA2 - latent dirichlet allocation all are spectral methods! -- i.e., using eigenvalues CS 562 - EM 42

PCA greedily find d orthogonal axes onto which the variance under projection is maximal the max variance subspace formulation 1st PC: direction of greatest variability in data 2nd PC: the next unrelated max-var direction remove all variance in 1 st PC, redo max-var another equivalent formulation: minimum reconstruction error find orthogonal vectors onto which the projection yields min MSE reconstruction CS 562 - EM 43

PCA optimization: max-var proj. first translate data to zero mean compute co-variance matrix find top d eigenvalues and eigenvectors of covar matrix project data onto those eigenvectors CS 562 - EM 44

PCA for k-means and whitening rescaling to zero mean and unit variance as preprocessing we did that in perceptron HW1/2 also! but PCA can do more: whitening (spherication) CS 562 - EM 45

Eigendigits CS 562 - EM 46

Eigenfaces CS 562 - EM 47

LDA: Fisher s linear discriminant analysis PCA finds a small representation of the data LDA performs dimensionality reduction while preserving as much the given class discrimination info as possible it s a linear classification method (like perceptron) find a scalar projection that maximizes separability max separation b/w projected means min variance within each projected class CS 562 - EM 48

PCA vs LDA CS 562 - EM 49

Linear vs. non-linear LLE or isomap CS 562 - EM 50

Linear vs. non-linear PCA CS 562 - EM 51

Linear vs. non-linear PCA CS 562 - EM 52

Linear vs. non-linear LLE CS 562 - EM 53

Linear vs. non-linear PCA CS 562 - EM 54

Linear vs. non-linear LLE CS 562 - EM 55

PCA vs Kernel PCA CS 562 - EM 56

Brain does non-linear dim. redux CS 562 - EM 57