Unsupervised Learning: K- Means & PCA

Similar documents
Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Unsupervised Learning: K-Means, Gaussian Mixture Models

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Advanced Introduction to Machine Learning CMU-10715

Principal Component Analysis

Deriving Principal Component Analysis (PCA)

What is Principal Component Analysis?

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Lecture 17: Face Recogni2on

Karhunen-Loève Transform KLT. JanKees van der Poel D.Sc. Student, Mechanical Engineering

Lecture 17: Face Recogni2on

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

20 Unsupervised Learning and Principal Components Analysis (PCA)

Principal Component Analysis (PCA)

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Face Detection and Recognition

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Recognition Using Class Specific Linear Projection. Magali Segal Stolrasky Nadav Ben Jakov April, 2015

CSE446: Clustering and EM Spring 2017

Unsupervised Learning

Lecture: Face Recognition

Lecture: Face Recognition and Feature Reduction

Eigenimaging for Facial Recognition

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Principal Component Analysis (PCA)

Keywords Eigenface, face recognition, kernel principal component analysis, machine learning. II. LITERATURE REVIEW & OVERVIEW OF PROPOSED METHODOLOGY

Lecture: Face Recognition and Feature Reduction

Image Analysis. PCA and Eigenfaces

Statistical Machine Learning

Example: Face Detection

COS 429: COMPUTER VISON Face Recognition

PCA Review. CS 510 February 25 th, 2013

Lecture 7: Con3nuous Latent Variable Models

Principal Component Analysis (PCA) CSC411/2515 Tutorial

PCA FACE RECOGNITION

Gaussian Mixture Models, Expectation Maximization

Expectation Maximization

PCA, Kernel PCA, ICA

L11: Pattern recognition principles

Bayesian Networks Structure Learning (cont.)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Principal Component Analysis

Machine Learning - MT & 14. PCA and MDS

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Face recognition Computer Vision Spring 2018, Lecture 21

Department of Computer Science and Engineering

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Machine Learning 2nd Edition

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Methods for sparse analysis of high-dimensional data, II

Gaussian Mixture Models

STA 414/2104: Lecture 8

Face detection and recognition. Detection Recognition Sally

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

CITS 4402 Computer Vision

Reconnaissance d objetsd et vision artificielle

CSC 411 Lecture 12: Principal Component Analysis

Dimensionality Reduction

Announcements (repeat) Principal Components Analysis

Classification 2: Linear discriminant analysis (continued); logistic regression

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Principal Component Analysis

Lecture 13 Visual recognition

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Artificial Intelligence Module 2. Feature Selection. Andrea Torsello

Methods for sparse analysis of high-dimensional data, II

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 18: Principal Component Analysis (PCA) Dr. Yanjun Qi. University of Virginia

Introduction to Machine Learning

Expectation Maximization Algorithm

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

CS 559: Machine Learning Fundamentals and Applications 5 th Set of Notes

ECE 661: Homework 10 Fall 2014

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

2D Image Processing Face Detection and Recognition

PRINCIPAL COMPONENTS ANALYSIS

Lecture 8: Principal Component Analysis; Kernel PCA

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Eigenface-based facial recognition

Mixtures of Gaussians continued

Lecture 11: Unsupervised Machine Learning

Pattern Classification

Dimensionality reduction

STA 414/2104: Lecture 8

PCA and LDA. Man-Wai MAK

Pattern Recognition 2

When Dictionary Learning Meets Classification

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Pattern recognition. "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher

Transcription:

Unsupervised Learning: K- Means & PCA

Unsupervised Learning Supervised learning used labeled data pairs (x, y) to learn a func>on f : X Y But, what if we don t have labels? No labels = unsupervised learning Only some points are labeled = semi- supervised learning Labels may be expensive to obtain, so we only get a few Clustering is the unsupervised grouping of data points. It can be used for knowledge discovery.

K- Means Clustering Some material adapted from slides by Andrew Moore, CMU. Visit hmp://www.autonlab.org/tutorials/ for Andrew s repository of Data Mining tutorials.

Clustering Data

K- Means Clustering K- Means ( k, X ) Randomly choose k cluster center loca>ons (centroids) Loop un>l convergence Assign each point to the cluster of the closest centroid Re- es>mate the cluster centroids based on the data assigned to each cluster

K- Means Clustering K- Means ( k, X ) Randomly choose k cluster center loca>ons (centroids) Loop un>l convergence Assign each point to the cluster of the closest centroid Re- es>mate the cluster centroids based on the data assigned to each cluster

K- Means Clustering K- Means ( k, X ) Randomly choose k cluster center loca>ons (centroids) Loop un>l convergence Assign each point to the cluster of the closest centroid Re- es>mate the cluster centroids based on the data assigned to each cluster

K- Means Anima>on Example generated by Andrew Moore using Dan Pelleg s superduper fast K-means system: Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning. Proc. Conference on Knowledge Discovery in Databases 1999.

K- Means Objec>ve Func>on K- means finds a local op>mum of the following objec>ve func>on: S arg Xmin X S kx k i=1 2S X kx µ i k 2 2 k x2s i S {S S } where S = {S 1,...,S k } is a partitioning S over X = {x 1,...,x n } s.t. X = S k i=1 S i and µ i = mean(s i )

Problems with K- Means Very sensi>ve to the ini>al points Do many runs of K- Means, each with different ini>al centroids Seed the centroids using a bemer method than randomly choosing the centroids e.g., Farthest- first sampling Must manually choose k Learn the op>mal k for the clustering Note that this requires a performance measure

Problems with K- Means How do you tell it which clustering you want? k = 2" Constrained clustering techniques (semi- supervised) Same- cluster constraint (must- link) Different- cluster constraint (cannot- link)

Gaussian Mixture Models Recall the Gaussian distribu>on: P (x µ, ) = 1 1 p (2 )d exp 2 (x µ) 1 (x µ)

The GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i µ 2 µ 1 µ 3 Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 13

The GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix σ 2 I Assume that each datapoint is generated according to the following recipe: µ 1 µ 2 µ 3 Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 14

The GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix σ 2 I Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(ω i ). µ 2 Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 15

The GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix σ 2 I Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(ω i ). 2. Datapoint ~ N(µ i, σ 2 I ) µ 2 x Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 16

The General GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix Σ i Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(ω i ). 2. Datapoint ~ N(µ i, Σ i ) Copyright 2001, 2004, Andrew W. Moore µ 1 µ 2 µ 3 Clustering with Gaussian Mixtures: Slide 17

Fi8ng a Gaussian Mixture Model (Optional)

Clustering with Gaussian Mixtures: Slide 19 Just evaluate a Gaussian at x k Copyright 2001, 2004, Andrew W. Moore Expectation-Maximization for GMMs Iterate until convergence: On the t th iteration let our estimates be λ t = { µ 1 (t), µ 2 (t) µ c (t) } E-step: Compute expected classes of all datapoints for each class ( ) ( ) ( ) ( ) ( ) ( ) = = = c j j j j k i i i k t k t i t i k t k i t p t w x t p t w x x w w x x w 1 2 2 ) ( ), (, p ) ( ), (, p p P, p, P I I σ µ σ µ λ λ λ λ M-step: Estimate µ given our data s class membership distributions ( ) ( ) ( ) = + k t k i k k t k i i x w x x w t λ λ, P, P 1 µ

Clustering with Gaussian Mixtures: Slide 20 Copyright 2001, 2004, Andrew W. Moore E.M. for General GMMs Iterate. On the t th iteration let our estimates be λ t = { µ 1 (t), µ 2 (t) µ c (t), Σ 1 (t), Σ 2 (t) Σ c (t), p 1 (t), p 2 (t) p c (t) } E-step: Compute expected clusters of all datapoints ( ) ( ) ( ) ( ) ( ) ( ) = Σ Σ = = c j j j j j k i i i i k t k t i t i k t k i t p t t w x t p t t w x x w w x x w 1 ) ( ) ( ), (, p ) ( ) ( ), (, p p P, p, P µ µ λ λ λ λ M-step: Estimate µ, Σ given our data s class membership distributions p i (t) is shorthand for estimate of P(ω i ) on t th iteration ( ) ( ) ( ) = + k t k i k k t k i i x w x x w t λ λ, P, P 1 µ ( ) ( ) ( ) [ ] ( ) [ ] ( ) + + = + Σ k t k i T i k i k k t k i i x w t x t x x w t λ µ µ λ, P 1 1, P 1 ( ) ( ) R x w t p k t k i i = +,λ P 1 R = #records Just evaluate a Gaussian at x k

(End op>onal sec>on)

Gaussian Mixture Example: Start Advance apologies: in Black and White this example will be incomprehensible Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 22

After first iteration Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 23

After 2nd iteration Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 24

After 3rd iteration Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 25

After 4th iteration Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 26

After 5th iteration Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 27

After 6th iteration Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 28

After 20th iteration Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 29

Some Bio Assay data Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 30

GMM clustering of the assay data Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 31

Resulting Density Estimator Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 32

Principal Components Analysis Based on slides by Barnabás Póczos, UAlberta

How Can We Visualize High Dimensional Data? E.g., 53 blood and urine tests for 65 pa>ents Instances H-WBC H-RBC H-Hgb H-Hct H-MCV H-MCH H-MCHC A1 8.0000 4.8200 14.1000 41.0000 85.0000 29.0000 34.0000 A2 7.3000 5.0200 14.7000 43.0000 86.0000 29.0000 34.0000 A3 4.3000 4.4800 14.1000 41.0000 91.0000 32.0000 35.0000 A4 7.5000 4.4700 14.9000 45.0000 101.0000 33.0000 33.0000 A5 7.3000 5.5200 15.4000 46.0000 84.0000 28.0000 33.0000 A6 6.9000 4.8600 16.0000 47.0000 97.0000 33.0000 34.0000 A7 7.8000 4.6800 14.7000 43.0000 92.0000 31.0000 34.0000 A8 8.6000 4.8200 15.8000 42.0000 88.0000 33.0000 37.0000 A9 5.1000 4.7100 14.0000 43.0000 92.0000 30.0000 32.0000 Features Difficult to see the correla>ons between the features... 34

Data Visualization Is there a representa>on bemer than the raw features? Is it really necessary to show all the 53 dimensions? what if there are strong correla>ons between the features? Could we find the smallest subspace of the 53- D space that keeps the most informa-on about the original data? One solu>on: Principal Component Analysis 35

Principle Component Analysis Orthogonal projec>on of data onto lower- dimension linear space that... maximizes variance of projected data (purple line) minimizes mean squared distance between data point and projec>ons (sum of blue lines) 36

The Principal Components Vectors origina>ng from the center of mass Principal component #1 points in the direc>on of the largest variance Each subsequent principal component is orthogonal to the previous ones, and points in the direc>ons of the largest variance of the residual subspace 38

2D Gaussian Dataset 39

1 st PCA axis 40

2 nd PCA axis 41

Dimensionality Reduc>on Can ignore the components of lesser significance 25 20 Variance (%) 15 10 5 0 PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 You do lose some informa>on, but if the eigenvalues are small, you don t lose much choose only the first k eigenvectors, based on their eigenvalues final data set has only k dimensions 43

PCA Algorithm Given data {x 1,, x n }, compute covariance matrix Σ X is the n x d data matrix Compute data mean (average over all rows of X) Subtract mean from each row of X (centering the data) Compute covariance matrix Σ = XX T PCA basis vectors are given by the eigenvectors of Σ Q,Λ = numpy.linalg.eig(σ)" {q i, λ i } i=1..n are the eigenvectors/eigenvalues of Σ... λ 1 λ 2 λ n! Larger eigenvalue more important eigenvectors 44

X = 2 6 4 01011001... 11011100... 00111000.... 10101000... PCA 3 7 5 x 3" Q = 2 6 4 Columns are ordered by importance! 0.34 0.23 0.30 0.23... 0.04 0.13 0.40 0.21... 0.64 0.93 0.61 0.28.......... 0.20 0.83 0.78 0.93... 3 7 5 x 3" Slide by Eric Eaton 45

X = 2 6 4 01011001... 11011100... 00111000.... 10101000... PCA 3 7 5 x 3" Q = 2 6 4 Keep only first k columns of Q" 0.34 0.23 0.30 0.23... 0.04 0.13 0.40 0.21... 0.64 0.93 0.61 0.28.......... 0.20 0.83 0.78 0.93... 3 7 5 x 3" Slide by Eric Eaton 46

PCA Visualiza>on of MNIST Digits 47

Challenge: Facial Recognition Want to identify specific person, based on facial image Robust to glasses, lighting, Can t just use the given 256 x 256 pixels 52

PCA applications -Eigenfaces Eigenfaces are the eigenvectors of the covariance matrix of the probability distribution of the vector space of human faces Eigenfaces are the standardized face ingredients derived from the statistical analysis of many pictures of human faces A human face may be considered to be a combination of these standard faces

PCA applications -Eigenfaces To generate a set of eigenfaces: 1. Large set of digitized images of human faces is taken under the same lighting conditions. 2. The images are normalized to line up the eyes and mouths. 3. The eigenvectors of the covariance matrix of the statistical distribution of face image vectors are then extracted. 4. These eigenvectors are called eigenfaces.

PCA applications -Eigenfaces the principal eigenface looks like a bland androgynous average human face http://en.wikipedia.org/wiki/image:eigenfaces.png

Eigenfaces Face Recognition When properly weighted, eigenfaces can be summed together to create an approximate grayscale rendering of a human face. Remarkably few eigenvector terms are needed to give a fair likeness of most people's faces Hence eigenfaces provide a means of applying data compression to faces for identification purposes. Similarly, Expert Object Recognition in Video

Eigenfaces Experiment and Results Data used here are from the ORL database of faces. Facial images of 16 persons each with 10 views are used. - Training set contains 16 7 images. - Test set contains 16 3 images. First three eigenfaces :

Classification Using Nearest Neighbor Save average coefficients for each person. Classify new face as the person with the closest average. Recognition accuracy increases with number of eigenfaces till 15. Later eigenfaces do not help much with recognition. Best recognition rates Training set 99% Test set 89%

Facial Expression Recognition: Happiness subspace 59

Disgust subspace 60

Image Compression

Original Image Divide the original 372x492 image into patches: Each patch is an instance that contains 12x12 pixels on a grid View each as a 144-D vector 62

L 2 error and PCA dim 63

PCA compression: 144D ) 60D 64

PCA compression: 144D ) 16D 65

16 most important eigenvectors 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 66

PCA compression: 144D ) 6D 67

6 most important eigenvectors 2 4 2 4 2 4 6 8 6 8 6 8 10 12 2 4 6 8 10 12 10 12 2 4 6 8 10 12 10 12 2 4 6 8 10 12 2 4 2 4 2 4 6 8 6 8 6 8 10 12 2 4 6 8 10 12 10 12 2 4 6 8 10 12 10 12 2 4 6 8 10 12 68

PCA compression: 144D ) 3D 69

3 most important eigenvectors 2 2 4 4 6 6 8 8 10 10 12 2 4 6 8 10 12 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 70

PCA compression: 144D ) 1D 71

60 most important eigenvectors Looks like the discrete cosine bases of JPG!... 72

2D Discrete Cosine Basis http://en.wikipedia.org/wiki/discrete_cosine_transform 73

Noisy image 74

Denoised image using 15 PCA components 75