Advanced Introduction to Machine Learning CMU-10715

Similar documents
Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Principal Component Analysis

Deriving Principal Component Analysis (PCA)

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Unsupervised Learning: K- Means & PCA

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Lecture: Face Recognition

PCA. Principle Components Analysis. Ron Parr CPS 271. Idea:

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Recognition Using Class Specific Linear Projection. Magali Segal Stolrasky Nadav Ben Jakov April, 2015

Principal Component Analysis (PCA)

Principal Component Analysis

Karhunen-Loève Transform KLT. JanKees van der Poel D.Sc. Student, Mechanical Engineering

Example: Face Detection

PCA, Kernel PCA, ICA

Image Analysis & Retrieval. Lec 14. Eigenface and Fisherface

What is Principal Component Analysis?

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Introduction to Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to:

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Image Analysis & Retrieval Lec 14 - Eigenface & Fisherface

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Face Detection and Recognition

Modeling Classes of Shapes Suppose you have a class of shapes with a range of variations: System 2 Overview

Lecture: Face Recognition and Feature Reduction

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

PCA FACE RECOGNITION

Machine Learning - MT & 14. PCA and MDS

Principal Component Analysis (PCA)

Expectation Maximization

Face recognition Computer Vision Spring 2018, Lecture 21

Lecture: Face Recognition and Feature Reduction

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

CITS 4402 Computer Vision

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

COS 429: COMPUTER VISON Face Recognition

Principal Component Analysis

Dimensionality Reduction

Eigenimaging for Facial Recognition

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Principal Component Analysis and Linear Discriminant Analysis

Dimensionality reduction

CS 4495 Computer Vision Principle Component Analysis

Artificial Intelligence Module 2. Feature Selection. Andrea Torsello

Face detection and recognition. Detection Recognition Sally

Statistical Machine Learning

Announcements (repeat) Principal Components Analysis

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Computation. For QDA we need to calculate: Lets first consider the case that

Principal components analysis COMS 4771

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Image Analysis. PCA and Eigenfaces

ECE 661: Homework 10 Fall 2014

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

1 Principal Components Analysis

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Dimensionality Reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

A Unified Bayesian Framework for Face Recognition

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

Methods for sparse analysis of high-dimensional data, II

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Linear Subspace Models

Keywords Eigenface, face recognition, kernel principal component analysis, machine learning. II. LITERATURE REVIEW & OVERVIEW OF PROPOSED METHODOLOGY

Principal Component Analysis

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Machine Learning 2nd Edition

Pattern Recognition 2

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Maximum variance formulation

Methods for sparse analysis of high-dimensional data, II

Machine Learning (Spring 2012) Principal Component Analysis

Lecture 13 Visual recognition

Linear Dimensionality Reduction

Data Mining Techniques

Lecture 17: Face Recogni2on

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Unsupervised Learning: Dimensionality Reduction

Lecture 7: Con3nuous Latent Variable Models

Eigenfaces. Face Recognition Using Principal Components Analysis

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

The Mathematics of Facial Recognition

1 Singular Value Decomposition and Principal Component

STA 414/2104: Lecture 8

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Lecture Notes 1: Vector spaces

When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants

Comparative Assessment of Independent Component. Component Analysis (ICA) for Face Recognition.

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

EECS 275 Matrix Computation

PCA and admixture models

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Transcription:

Advanced Introduction to Machine Learning CMU-10715 Principal Component Analysis Barnabás Póczos

Contents Motivation PCA algorithms Applications Some of these slides are taken from Karl Booksh Research group Tom Mitchell Ron Parr 2

Motivation 3

Data Visualization Data Compression Noise Reduction PCA Applications 4

Data Visualization Example: Given 53 blood and urine samples (features) from 65 people. How can we visualize the measurements? 5

Matrix format (65x53) Data Visualization Instances H-WBC H-RBC H-Hgb H-Hct H-MCV H-MCH H-MCHC A1 8.0000 4.8200 14.1000 41.0000 85.0000 29.0000 34.0000 A2 7.3000 5.0200 14.7000 43.0000 86.0000 29.0000 34.0000 A3 4.3000 4.4800 14.1000 41.0000 91.0000 32.0000 35.0000 A4 7.5000 4.4700 14.9000 45.0000 101.0000 33.0000 33.0000 A5 7.3000 5.5200 15.4000 46.0000 84.0000 28.0000 33.0000 A6 6.9000 4.8600 16.0000 47.0000 97.0000 33.0000 34.0000 A7 7.8000 4.6800 14.7000 43.0000 92.0000 31.0000 34.0000 A8 8.6000 4.8200 15.8000 42.0000 88.0000 33.0000 37.0000 A9 5.1000 4.7100 14.0000 43.0000 92.0000 30.0000 32.0000 Features Difficult to see the correlations between the features... 6

Data Visualization Spectral format (65 curves, one for each person) Value 1 000 9 00 8 00 7 00 6 00 5 00 4 00 3 00 2 00 1 00 0 0 1 0 2 0 30 40 50 6 0 m easurem ent M easu rem en t Difficult to compare the different patients... 7

Data Visualization Spectral format (53 pictures, one for each feature) 1. 8 1. 6 1. 4 1. 2 1 0. 8 0. 6 0. 4 0. 2 H-Bands 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 P e r s o n Difficult to see the correlations between the features... 8

Data Visualization Bi-variate Tri-variate C-LDH 550 500 450 400 350 300 250 200 150 100 50 0 50 150 250 350 450 C-Triglycerides M-EPI 4 3 2 1 0 600 400 C-LDH 200 0 0 100200300 400 500 C-Triglycerides How can we visualize the other variables??? difficult to see in 4 or higher dimensional spaces... 9

Data Visualization Is there a representation better than the coordinate axes? Is it really necessary to show all the 53 dimensions? what if there are strong correlations between the features? How could we find the smallest subspace of the 53-D space that keeps the most information about the original data? A solution: Principal Component Analysis 10

PCA Algorithms 11

Principal Component Analysis PCA: Orthogonal projection of the data onto a lower-dimension linear space that... maximizes variance of projected data (purple line) minimizes the mean squared distance between data point and projections (sum of blue lines) 12

Principal Component Analysis Idea: Given data points in a d-dimensional space, project them into a lower dimensional space while preserving as much information as possible. Find best planar approximation of 3D data Find best 12-D approximation of 10 4 -D data In particular, choose projection that minimizes squared error in reconstructing the original data. 13

Principal Component Analysis Properties: PCA Vectors originate from the center of mass. Principal component #1: points in the direction of the largest variance. Each subsequent principal component is orthogonal to the previous ones, and points in the directions of the largest variance of the residual subspace 14

2D Gaussian dataset 15

1 st PCA axis 16

2 nd PCA axis 17

Given the centered data {x 1,, x m }, compute the principal vectors: 1 2 = arg max m T w {( w xi) } 1 1 w = 1 m st PCA vector w PCA algorithm I (sequential) 1 i= 1 To find w 1, maximize the variance of projection of x m T T 2 2 = arg max {[ w ( xi w1w1 xi )] } w = 1 m i= 1 To find w 2, we maximize the variance of the projection in the residual subspace w w 2 w x PCA reconstruction x-x x 2 nd PCA vector x =w 1 (w 1T x) w 1 18

PCA algorithm I (sequential) Given w 1,, w k-1, we calculate w k principal vector as before: Maximize the variance of projection of x w k = arg max w = 1 1 m m i= 1 {[ w T ( x i k 1 j= 1 w j w T j x i )] 2 } k th PCA vector x PCA reconstruction We maximize the variance of the projection in the residual subspace w 2 (w 2T x) w x w 1 (w 1T x) w 1 w 2 x =w 1 (w 1T x)+w 2 (w 2T x) 19

PCA algorithm II (sample covariance matrix) Given data {x 1,, x m }, compute covariance matrix Σ Σ = m 1 ( x x)( x x) i m 1 i= T where x = 1 m m i= 1 x i PCA basis vectors = the eigenvectors of Σ Larger eigenvalue more important eigenvectors 20

PCA algorithm(x, k): top k eigenvalues/eigenvectors % X = N m data matrix, % each data point x i = column vector, i=1..m x m 1 x i = m i = 1 PCA algorithm II (sample covariance matrix) X subtract mean x from each column vector x i in X Σ XX T covariance matrix of X { λ i, u i } i=1..n = eigenvectors/eigenvalues of Σ... λ 1 λ 2 λ N Return { λ i, u i } i=1.. k % top k PCA components 21

Animation Power iteration 1: v=sigma*v; Power iteration 2: λ= v v=v/sqrt(v'*v); PCA1 T*Sigma*v PCA1 Sigma v 2 =Sigma- λ*v PCA1 *v PCA1T ; PCA1 v=sigma 2 *v; v=v/sqrt(v'*v); v PCA2 22

23

PCA algorithm III (SVD of the data matrix) Singular Value Decomposition of the centered data matrix X. X features samples = USV T X = U S V T sig. significant significant noise noise noise samples 24

PCA algorithm III Columns of U the principal vectors, { u (1),, u (k) } orthogonal and has unit norm so U T U = I Can reconstruct the data using linear combinations of { u (1),, u (k) } Matrix S Diagonal Shows importance of each eigenvector Columns of V T The coefficients for reconstructing the samples 25

Applications 26

Face Recognition Want to identify specific person, based on facial image Robust to glasses, lighting, Can t just use the given 256 x 256 pixels 27

Applying PCA: Eigenfaces Method A: Build a PCA subspace for each person and check which subspace can reconstruct the test image the best Method B: Build one PCA database for the whole dataset and then classify based on the weights. 28

Applying PCA: Eigenfaces Example data set: Images of faces Eigenface approach [Turk & Pentland], [Sirovich & Kirby] Each face x is 256 256 values (luminance at location) x 1,, x m x in R 256 256 (view as 64K dim vector) Form X = [ x 1,, x m ] centered data mtx Compute Σ= XX T X = 256 x 256 real values Problem: Σ is 64K 64K HUGE!!! m faces 29

Computational Complexity Suppose m instances, each of size N Eigenfaces: m=500 faces, each of size N=64K Given N N covariance matrix Σ, can compute all N eigenvectors/eigenvalues in O(N 3 ) first k eigenvectors/eigenvalues in O(k N 2 ) But if N=64K, EXPENSIVE! 30

A Clever Workaround Note that m<<64k Use L=X T X instead of Σ=XX T If v is eigenvector of L then Xv is eigenvector of Σ Proof: L v = γ v X T X v = γ v X (X T X v) = X(γ v) = γ Xv (XX T )X v = γ (Xv) Σ (Xv) = γ (Xv) X = x 1,, x m m faces 256 x 256 real values 31

Principle Components (Method B) 32

Shortcomings Requires carefully controlled data: All faces centered in frame Same size Some sensitivity to angle Method is completely knowledge free (sometimes this is good!) Doesn t know that faces are wrapped around 3D objects (heads) Makes no effort to preserve class distinctions 33

Happiness subspace 34

Disgust subspace 35

Facial Expression Recognition Movies 36

Facial Expression Recognition Movies 37

Facial Expression Recognition Movies 38

Image Compression

Original Image Divide the original 372x492 image into patches: Each patch is an instance that contains 12x12 pixels on a grid Consider each as a 144-D vector 40

L 2 error and PCA dim 41

PCA compression: 144D 60D 42

PCA compression: 144D 16D 43

16 most important eigenvectors 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 44

PCA compression: 144D 6D 45

6 most important eigenvectors 2 2 2 4 4 4 6 6 6 8 8 8 10 10 10 12 2 4 6 8 10 12 12 2 4 6 8 10 12 12 2 4 6 8 10 12 2 2 2 4 4 4 6 8 6 8 6 8 10 10 10 12 2 4 6 8 10 12 12 2 4 6 8 10 12 12 2 4 6 8 10 12 46

PCA compression: 144D 3D 47

3 most important eigenvectors 2 2 4 4 6 6 8 8 10 10 12 2 4 6 8 10 12 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 48

PCA compression: 144D 1D 49

60 most important eigenvectors Looks like the discrete cosine bases of JPG!... 50

2D Discrete Cosine Basis http://en.wikipedia.org/wiki/discrete_cosine_transform 51

Noise Filtering

Noise Filtering x x U x 53

Noisy image 54

Denoised image using 15 PCA components 55

PCA Shortcomings

Problematic Data Set for PCA PCA doesn t know about class labels! 57

PCA vs Fisher Linear Discriminant PCA maximizes variance, independent of class magenta FLD attempts to separate classes green line 58

Problematic Data Set for PCA PCA cannot capture NON-LINEAR structure! 59

PCA Conclusions PCA finds orthonormal basis for data Sorts dimensions in order of importance Discard low significance dimensions Applications: Get compact description Remove noise Improve classification (hopefully) Not magic: Doesn t know class labels Can only capture linear variations One of many tricks to reduce dimensionality! 60

Kernel PCA 61

Kernel PCA Performing PCA in the feature space Lemma Proof: 62

Kernel PCA Lemma 63

Kernel PCA Proof 64

Kernel PCA How to use α to calculate the projection of a new sample t? Where was I cheating? The data should be centered in the feature space, too! But this is manageable... 65

Input points before kernel PCA http://en.wikipedia.org/wiki/kernel_principal_component_analysis 66

Output after kernel PCA The three groups are distinguishable using the first component only 67

PCA Theory 68

Justification of Algorithm II GOAL: 69

Justification of Algorithm II x is centered! 70

Justification of Algorithm II GOAL: Use Lagrange-multipliers for the constraints. 71

Justification of Algorithm II 72

Thanks for the Attention! 73