Unsupervised Learning: K- Means & PCA

Size: px

Start display at page:

Download "Unsupervised Learning: K- Means & PCA"

Madison Byrd
5 years ago
Views:

1 Unsupervised Learning: K- Means & PCA

2 Unsupervised Learning Supervised learning used labeled data pairs (x, y) to learn a func>on f : X Y But, what if we don t have labels? No labels = unsupervised learning Only some points are labeled = semi- supervised learning Labels may be expensive to obtain, so we only get a few Clustering is the unsupervised grouping of data points. It can be used for knowledge discovery.

3 K- Means Clustering Some material adapted from slides by Andrew Moore, CMU. Visit hmp:// for Andrew s repository of Data Mining tutorials.

4 Clustering Data

5 K- Means Clustering K- Means ( k, X ) Randomly choose k cluster center loca>ons (centroids) Loop un>l convergence Assign each point to the cluster of the closest centroid Re- es>mate the cluster centroids based on the data assigned to each cluster

6 K- Means Clustering K- Means ( k, X ) Randomly choose k cluster center loca>ons (centroids) Loop un>l convergence Assign each point to the cluster of the closest centroid Re- es>mate the cluster centroids based on the data assigned to each cluster

7 K- Means Clustering K- Means ( k, X ) Randomly choose k cluster center loca>ons (centroids) Loop un>l convergence Assign each point to the cluster of the closest centroid Re- es>mate the cluster centroids based on the data assigned to each cluster

8 K- Means Anima>on Example generated by Andrew Moore using Dan Pelleg s superduper fast K-means system: Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning. Proc. Conference on Knowledge Discovery in Databases 1999.

9 K- Means Objec>ve Func>on K- means finds a local op>mum of the following objec>ve func>on: S arg Xmin X S kx k i=1 2S X kx µ i k 2 2 k x2s i S {S S } where S = {S 1,...,S k } is a partitioning S over X = {x 1,...,x n } s.t. X = S k i=1 S i and µ i = mean(s i )

10 Problems with K- Means Very sensi>ve to the ini>al points Do many runs of K- Means, each with different ini>al centroids Seed the centroids using a bemer method than randomly choosing the centroids e.g., Farthest- first sampling Must manually choose k Learn the op>mal k for the clustering Note that this requires a performance measure

11 Problems with K- Means How do you tell it which clustering you want? k = 2" Constrained clustering techniques (semi- supervised) Same- cluster constraint (must- link) Different- cluster constraint (cannot- link)

12 Gaussian Mixture Models Recall the Gaussian distribu>on: P (x µ, ) = 1 1 p (2 )d exp 2 (x µ) 1 (x µ)

13 The GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i µ 2 µ 1 µ 3 Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 13

14 The GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix σ 2 I Assume that each datapoint is generated according to the following recipe: µ 1 µ 2 µ 3 Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 14

15 The GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix σ 2 I Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(ω i ). µ 2 Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 15

16 The GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix σ 2 I Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(ω i ). 2. Datapoint ~ N(µ i, σ 2 I ) µ 2 x Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 16

17 The General GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix Σ i Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(ω i ). 2. Datapoint ~ N(µ i, Σ i ) Copyright 2001, 2004, Andrew W. Moore µ 1 µ 2 µ 3 Clustering with Gaussian Mixtures: Slide 17

18 Fi8ng a Gaussian Mixture Model (Optional)

19 Clustering with Gaussian Mixtures: Slide 19 Just evaluate a Gaussian at x k Copyright 2001, 2004, Andrew W. Moore Expectation-Maximization for GMMs Iterate until convergence: On the t th iteration let our estimates be λ t = { µ 1 (t), µ 2 (t) µ c (t) } E-step: Compute expected classes of all datapoints for each class ( ) ( ) ( ) ( ) ( ) ( ) = = = c j j j j k i i i k t k t i t i k t k i t p t w x t p t w x x w w x x w ) ( ), (, p ) ( ), (, p p P, p, P I I σ µ σ µ λ λ λ λ M-step: Estimate µ given our data s class membership distributions ( ) ( ) ( ) = + k t k i k k t k i i x w x x w t λ λ, P, P 1 µ

20 Clustering with Gaussian Mixtures: Slide 20 Copyright 2001, 2004, Andrew W. Moore E.M. for General GMMs Iterate. On the t th iteration let our estimates be λ t = { µ 1 (t), µ 2 (t) µ c (t), Σ 1 (t), Σ 2 (t) Σ c (t), p 1 (t), p 2 (t) p c (t) } E-step: Compute expected clusters of all datapoints ( ) ( ) ( ) ( ) ( ) ( ) = Σ Σ = = c j j j j j k i i i i k t k t i t i k t k i t p t t w x t p t t w x x w w x x w 1 ) ( ) ( ), (, p ) ( ) ( ), (, p p P, p, P µ µ λ λ λ λ M-step: Estimate µ, Σ given our data s class membership distributions p i (t) is shorthand for estimate of P(ω i ) on t th iteration ( ) ( ) ( ) = + k t k i k k t k i i x w x x w t λ λ, P, P 1 µ ( ) ( ) ( ) [ ] ( ) [ ] ( ) + + = + Σ k t k i T i k i k k t k i i x w t x t x x w t λ µ µ λ, P 1 1, P 1 ( ) ( ) R x w t p k t k i i = +,λ P 1 R = #records Just evaluate a Gaussian at x k

21 (End op>onal sec>on)

33 Principal Components Analysis Based on slides by Barnabás Póczos, UAlberta

34 How Can We Visualize High Dimensional Data? E.g., 53 blood and urine tests for 65 pa>ents Instances H-WBC H-RBC H-Hgb H-Hct H-MCV H-MCH H-MCHC A A A A A A A A A Features Difficult to see the correla>ons between the features... 34

35 Data Visualization Is there a representa>on bemer than the raw features? Is it really necessary to show all the 53 dimensions? what if there are strong correla>ons between the features? Could we find the smallest subspace of the 53- D space that keeps the most informa-on about the original data? One solu>on: Principal Component Analysis 35

36 Principle Component Analysis Orthogonal projec>on of data onto lower- dimension linear space that... maximizes variance of projected data (purple line) minimizes mean squared distance between data point and projec>ons (sum of blue lines) 36

37 The Principal Components Vectors origina>ng from the center of mass Principal component #1 points in the direc>on of the largest variance Each subsequent principal component is orthogonal to the previous ones, and points in the direc>ons of the largest variance of the residual subspace 38

38 2D Gaussian Dataset 39

39 1 st PCA axis 40

40 2 nd PCA axis 41

41 Dimensionality Reduc>on Can ignore the components of lesser significance Variance (%) PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 You do lose some informa>on, but if the eigenvalues are small, you don t lose much choose only the first k eigenvectors, based on their eigenvalues final data set has only k dimensions 43

42 PCA Algorithm Given data {x 1,, x n }, compute covariance matrix Σ X is the n x d data matrix Compute data mean (average over all rows of X) Subtract mean from each row of X (centering the data) Compute covariance matrix Σ = XX T PCA basis vectors are given by the eigenvectors of Σ Q,Λ = numpy.linalg.eig(σ)" {q i, λ i } i=1..n are the eigenvectors/eigenvalues of Σ... λ 1 λ 2 λ n! Larger eigenvalue more important eigenvectors 44

43 X = PCA x 3" Q = Columns are ordered by importance! x 3" Slide by Eric Eaton 45

44 X = PCA x 3" Q = Keep only first k columns of Q" x 3" Slide by Eric Eaton 46

45 PCA Visualiza>on of MNIST Digits 47

46 Challenge: Facial Recognition Want to identify specific person, based on facial image Robust to glasses, lighting, Can t just use the given 256 x 256 pixels 52

47 PCA applications -Eigenfaces Eigenfaces are the eigenvectors of the covariance matrix of the probability distribution of the vector space of human faces Eigenfaces are the standardized face ingredients derived from the statistical analysis of many pictures of human faces A human face may be considered to be a combination of these standard faces

48 PCA applications -Eigenfaces To generate a set of eigenfaces: 1. Large set of digitized images of human faces is taken under the same lighting conditions. 2. The images are normalized to line up the eyes and mouths. 3. The eigenvectors of the covariance matrix of the statistical distribution of face image vectors are then extracted. 4. These eigenvectors are called eigenfaces.

49 PCA applications -Eigenfaces the principal eigenface looks like a bland androgynous average human face

50 Eigenfaces Face Recognition When properly weighted, eigenfaces can be summed together to create an approximate grayscale rendering of a human face. Remarkably few eigenvector terms are needed to give a fair likeness of most people's faces Hence eigenfaces provide a means of applying data compression to faces for identification purposes. Similarly, Expert Object Recognition in Video

51 Eigenfaces Experiment and Results Data used here are from the ORL database of faces. Facial images of 16 persons each with 10 views are used. - Training set contains 16 7 images. - Test set contains 16 3 images. First three eigenfaces :

52 Classification Using Nearest Neighbor Save average coefficients for each person. Classify new face as the person with the closest average. Recognition accuracy increases with number of eigenfaces till 15. Later eigenfaces do not help much with recognition. Best recognition rates Training set 99% Test set 89%

53 Facial Expression Recognition: Happiness subspace 59

54 Disgust subspace 60

55 Image Compression

56 Original Image Divide the original 372x492 image into patches: Each patch is an instance that contains 12x12 pixels on a grid View each as a 144-D vector 62

57 L 2 error and PCA dim 63

58 PCA compression: 144D ) 60D 64

59 PCA compression: 144D ) 16D 65

60 16 most important eigenvectors

61 PCA compression: 144D ) 6D 67

62 6 most important eigenvectors

63 PCA compression: 144D ) 3D 69

64 3 most important eigenvectors

65 PCA compression: 144D ) 1D 71

66 60 most important eigenvectors Looks like the discrete cosine bases of JPG!... 72

67 2D Discrete Cosine Basis 73

68 Noisy image 74

69 Denoised image using 15 PCA components 75

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Robot Image Credit: Viktoriya Sukhanova 13RF.com Dimensionality Reduction Feature Selection vs. Dimensionality Reduction Feature Selection (last time) Select a subset of features. When classifying novel