Principal Component Analysis (PCA)

Similar documents
Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Principal Component Analysis

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Dimensionality reduction

CSC321 Lecture 20: Autoencoders

PCA, Kernel PCA, ICA

Expectation Maximization

Advanced Introduction to Machine Learning CMU-10715

STA 414/2104: Lecture 8

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Principal Component Analysis

CSC 411 Lecture 12: Principal Component Analysis

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

1 Principal Components Analysis

STA 414/2104: Lecture 8

Deriving Principal Component Analysis (PCA)

Introduction to Machine Learning

Lecture 7: Con3nuous Latent Variable Models

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Dimensionality Reduction

Principal Component Analysis (PCA)

Lecture: Face Recognition and Feature Reduction

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Machine Learning - MT & 14. PCA and MDS

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to:

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Lecture: Face Recognition and Feature Reduction

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Example: Face Detection

Unsupervised Learning: K- Means & PCA

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Face Detection and Recognition

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Principal Component Analysis and Linear Discriminant Analysis

CITS 4402 Computer Vision

Dimensionality Reduction

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Understanding How ConvNets See

1 Singular Value Decomposition and Principal Component

Principal Component Analysis

Principal Component Analysis

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Announcements (repeat) Principal Components Analysis

Covariance and Correlation Matrix

Lecture 13 Visual recognition

Image Analysis. PCA and Eigenfaces

Lecture 17: Face Recogni2on

Nonlinear Dimensionality Reduction. Jose A. Costa

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

What is Principal Component Analysis?

Statistical Machine Learning

Modeling Classes of Shapes Suppose you have a class of shapes with a range of variations: System 2 Overview

STA 414/2104: Machine Learning

20 Unsupervised Learning and Principal Components Analysis (PCA)

Face Recognition and Biometric Systems

Data Mining and Analysis: Fundamental Concepts and Algorithms

Lecture: Face Recognition

PCA and Autoencoders

15 Singular Value Decomposition

14 Singular Value Decomposition

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Eigenfaces. Face Recognition Using Principal Components Analysis

Data Mining Techniques

Exercises * on Principal Component Analysis

Machine Learning: Basis and Wavelet 김화평 (CSE ) Medical Image computing lab 서진근교수연구실 Haar DWT in 2 levels

7. Variable extraction and dimensionality reduction

Regularized Discriminant Analysis and Reduced-Rank LDA

Covariance and Principal Components

Machine Learning Techniques

Machine Learning (Spring 2012) Principal Component Analysis

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

LECTURE NOTE #11 PROF. ALAN YUILLE

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

PCA FACE RECOGNITION

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Signal Analysis. Principal Component Analysis

Karhunen-Loève Transform KLT. JanKees van der Poel D.Sc. Student, Mechanical Engineering

CS 559: Machine Learning Fundamentals and Applications 5 th Set of Notes

CSE 554 Lecture 7: Alignment

November 28 th, Carlos Guestrin 1. Lower dimensional projections

PCA and admixture models

CS 4495 Computer Vision Principle Component Analysis

Tutorial on Principal Component Analysis

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

Transcription:

Principal Component Analysis (PCA) Salvador Dalí, Galatea of the Spheres CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang Some slides from Derek Hoiem and Alysha Efros 1

Dimensionality Reduction Want to represent data in a new coordinate system with fewer dimensions Cannot easily visualize n-d data, but can plot 2D or 3D data Want to extract features from the data Similar to extracting features with a ConvNet easier to classify extracted features than original data Want to compress the data while preserving most of the information Goal: preserve as much information as we can about the data in the new coordinate system Preserve distance between data points Preserve variation between datapoints Original data Transformed (1, 1.2) 1.15 (2, 2) 2 (3, 3.3) 3.1 2

Principal Component Analysis Linear dimensionality reduction The transformed data is a linear transformation of the original data Find a subspace that the data lies in and project the data onto that subspace 3

Principal Component Analysis In the 2D example: want to transform the data so that it varies mostly along u 1 Can just take the u 1 coordinate and keep most of the information about the data Keep the diagram in mind for the rest of the lecture We will use a dataset of faces as an example Same idea, but higher-dimensional data 4

Principal Component Analysis If all the data lies in a subspace, we can represent the data using the coordinates in that subspace Here: a 1D subspace arguably suffices Just keep information about where in the orange cloud the point is one dimension suffices 5

The space of all face images When viewed as vectors of pixel values, face images are extremely high-dimensional 100x100 image => 10,000 dimensions But very few 10,000-dimensional vectors are valid face images We want to effectively model the subspace of face images Need a lot fewer dimensions than 10,000 dimensions for that 6

The space of faces = + Each images is a point in space 7

Rotating a Cloud to Be Axis-Aligned Consider the covariance matrix of all the points in a cloud Σ = σ i x i μ x i μ T The Spectral Theorem says we can diagonalize Σ (not covered in detail): λ 1 0 R T ΣR = D =, 0 λ d R s columns are the Eigenvectors of Σ Spectral theorem: P 1 ΣP = D Set P = R, P 1 = R T No need to divide the sum by n when computing the covariance in this context 8

Now: i R T ( R T x i μ R T x i μ T = i = R T ΣR = D x i μ x i μ T )R So if we rotate the x i μ using R T, the covariance matrix of the transformed data will be diagonal! 9

Intuition If the covariance of the data is diagonal, the data lies in an axis-aligned cloud R, the matrix of the eigenvector of Σ, is the rotation matrix that needs to be applied to the data x i μ to make the cloud of datapoints axis aligned Because we proved the covariance of the result will be diagonal 10

Change of Basis (On the board) Main points Review of change of basis A rotation around 0 can be interpreted as a change of basis If the data is centred at 0, rotating it means changing its basis We can make data be centred at 0 by subtracting the mean from it 11

Reconstruction For a subspace with the orthonormal basis V k = v 0, v 1, v 2, v k, the best (i.e., closest) reconstruction of x in that subspace is: x k = x v 0 v 0 + x v 1 v 1 + + x v k v k If x is in the span of V k, this is an exact reconstruction If not, this is the projection of x on V Squared reconstruction error: x k x 2 12

Reconstruction cont d x k = x v 0 v 0 + x v 1 v 1 + + x v k v k Note: in x v 0 v 0, x v 0 is a measure of how similar x is to v 0 The more similar x is to v 0, the larger the contribution from v 0 is to the sum 13

Representation and reconstruction Face x in face space coordinates: = Reconstruction: = + ^x = µ + w 1 u 1 +w 2 u 2 +w 3 u 3 +w 4 u 4 + 14 slide by Derek Hoiem

Reconstruction k = 4 k = 200 k = 400 After computing eigenfaces using 400 face images from ORL face database 15 slide by Derek Hoiem

Principal Component Analysis (PCA) Suppose the columns of a matrix X N K are the datapoints (N is the size of each image, K is the size of the dataset), and we would like to obtain an orthonormal basis of size k that produces the smallest sum of squared reconstruction errors for all the columns of X തX തX is the average column of X Answer: the basis we are looking for is the k eigenvectors of X തX X തX T that correspond to the k largest eigenvalues 16

PCA cont d If x is the datapoint (obtained after subtracting the mean), and V an orthonormal basis, V T x is a column of the dot products of x and the elements of x So the reconstruction for the centered x is x = V(V T x)) PCA is the procedure of obtaining the k eigenvectors V k 17

NOTE: centering If the image x is not centred (i.e., തX was not subtracted from all the images), the reconstruction is: x = തX + V(V T (x തX)) 18

Proof that PCA produces the best reconstruction 19

20

Reminder: Intuition for PCA The subspace where the reconstruction that will be the best is the major axis of the cloud We ve shown that we are making the cloud axis-aligned by rotating it So we know one of the basis elements will correspond to the best one-dimensional subspace The major axis is the Eigenvector which corresponds to the largest Eigenvalue We haven t shown that (but we could) 21

Obtaining the Principal Components XX T can be huge There are tricks to still compute the Evs Look up Singular Value Decomposition (SVD) if you re interested 22

EigenFace as dimensionality reduction The set of faces is a subspace of the set of images Suppose it is K dimensional We can find the best subspace using PCA This is like fitting a hyper-plane to the set of (centred) faces spanned by vectors v 1, v 2,..., v K any face 23

Eigenfaces example Top eigenvectors: u 1, u k Mean: μ 24 slide by Derek Hoiem

Another Eigenface set 25

PCA: Applications convert x into v 1, v 2 coordinates What does the v 2 coordinate measure? - Distance to line - Use it for classification near 0 for orange pts - Possibly a good feature! What does the v 1 coordinate measure? - Position along line - Use it to specify which orange point it is - Use v 1 if want to compress data 26

Two views of PCA Want to minimize the squared distance between the original data and the reconstructed data, for a given dimensionality of the subspace k Minimize the red-green distance per data point Our approach so far Maximize the variance of the projected data, for a given dimensionality of the subspace k Maximize the scatter (i.e., variance) of the green points In general, maximize sum of the variances along all the components 27

Two views of PCA R T ΣR = D = λ 1 0 0 λ d D is the covariance matrix of the transformed data The variance of the projected data along the j-th component (coordinate) is λ j = 1 σ i 2 N i x j ഥx j For a given k, when maximizing the variance of the projected data, we are trying to maximize λ 1 + λ 2 + + λ k We can show (but won t) that the variance maximization formulation and minimum square reconstruction error formulation produce the same k- dimensional bases 28

How to choose k? If need to visualize the dataset Use k=2 or k=3 If want to use PCA for feature extraction Transform the data to be k-dimensional Use cross-validation to select the best k 29

How to choose k? Pick based on percentage of variance captured / lost Variance captured: the variance of the projected data Pick smallest k that explains some % of variance (λ 1 + λ 2 + + λ k )/(λ 1 + λ 2 + + λ k + + λ N ) Look for an elbow in Scree plot (plot of explained variance or eigenvalues) In practice the plot is never very clean 30

Limitations of PCA Assumption: variance == information Assumption: the data lies in a linear subspace only 31

Autoencoders W (7) W (6) W (5) W (3) W (2) W (1) Reconstruction 500 neurons 250 neurons 30 code 250 neurons 500 neurons Sq. Difference input Find the weights that produce as small a difference as possible between the input and the reconstruction Train using Backprop The code layer is a low-dimensional summary of the input 32

Autoencoders The Encoder can be used To compress data As a feature extractor If you have lots of unlabeled data and some labeled data, train an autoencoder on unlabeled data, then a classifier on the features (semi-supervised learning) The Decoder can be used to generate images given a new code 33

Autoencoders and PCA PCA can be viewed as a special case of an autoencoder We can tie the encoder and decoder weights to make sure the architecture works the same way PCA does This is sometimes useful for regularization reconstruction Weights: W T Linear activation Weights: W input Input: x Hidden layer: W T x Reconstruction: WW T x Minimize σ i WW T x x 2 -- the square reconstruction error, as in PCA Just one set of weight parameters, transposed 34