CSC 411 Lecture 12: Principal Component Analysis

Similar documents
CSC321 Lecture 20: Autoencoders

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Introduction to Machine Learning

STA 414/2104: Lecture 8

Principal Component Analysis (PCA)

STA 414/2104: Lecture 8

Lecture 7: Con3nuous Latent Variable Models

Principal Component Analysis (PCA) CSC411/2515 Tutorial

STA 414/2104: Machine Learning

Machine Learning - MT & 14. PCA and MDS

Principal Component Analysis

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

CS281 Section 4: Factor Analysis and PCA

Linear Dimensionality Reduction

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

CSC 411 Lecture 10: Neural Networks

PCA, Kernel PCA, ICA

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Mathematical foundations - linear algebra

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Lecture: Face Recognition and Feature Reduction

14 Singular Value Decomposition

PCA and admixture models

7 Principal Component Analysis

Principal Component Analysis

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Neuroscience Introduction

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Lecture: Face Recognition and Feature Reduction

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Lecture 7 Spectral methods

Unsupervised Learning: Dimensionality Reduction

Covariance and Correlation Matrix

20 Unsupervised Learning and Principal Components Analysis (PCA)

Machine Learning 2nd Edition

1 Principal Components Analysis

Maximum variance formulation

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Dimensionality Reduction

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

What is Principal Component Analysis?

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CSC321 Lecture 5: Multilayer Perceptrons

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

STATISTICAL LEARNING SYSTEMS

Eigenvalues and diagonalization

Machine Learning (Spring 2012) Principal Component Analysis

EECS 275 Matrix Computation

Tutorial on Principal Component Analysis

Numerical Methods I Singular Value Decomposition

Lecture 3: Review of Linear Algebra

Chapter 7: Symmetric Matrices and Quadratic Forms

Preprocessing & dimensionality reduction

Lecture 3: Review of Linear Algebra

MAT 1302B Mathematical Methods II

Instructor (Andrew Ng)

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Lecture: Face Recognition

Principal Component Analysis

Dimensionality Reduction

CSC 411 Lecture 6: Linear Regression

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Motivating the Covariance Matrix

Clustering, K-Means, EM Tutorial

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method

Lecture 2: Linear Algebra

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Principal Component Analysis CS498

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Lecture Notes 1: Vector spaces

Unsupervised Learning

Principal Component Analysis and Linear Discriminant Analysis

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Dimension Reduction (PCA, ICA, CCA, FLD,

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Classification 2: Linear discriminant analysis (continued); logistic regression

Statistical Pattern Recognition

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Karhunen-Loève Transform KLT. JanKees van der Poel D.Sc. Student, Mechanical Engineering

Machine learning for pervasive systems Classification in high-dimensional spaces

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

CSC321 Lecture 9: Generalization

Table of Contents. Multivariate methods. Introduction II. Introduction I

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP)

Deriving Principal Component Analysis (PCA)

1 Singular Value Decomposition and Principal Component

15 Singular Value Decomposition

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag

7. Variable extraction and dimensionality reduction

Vectors and Matrices Statistics with Vectors and Matrices

PCA and LDA. Man-Wai MAK

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Expectation Maximization

Transcription:

CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 12-PCA 1 / 23

Overview Today we ll cover the first unsupervised learning algorithm for this course: principal component analysis (PCA) Dimensionality reduction: map the data to a lower dimensional space Save computation/memory Reduce overfitting Visualize in 2 dimensions PCA is a linear model, with a closed-form solution. It s useful for understanding lots of other algorithms. Autoencoders Matrix factorizations (next lecture) Today s lecture is very linear-algebra-heavy. Especially orthogonal matrices and eigendecompositions. Don t worry if you don t get it immediately next few lectures won t build on it ot on midterm (which only covers up through L9) UofT CSC 411: 12-PCA 2 / 23

Projection onto a Subspace z = U (x µ) Here, the columns of U form an orthonormal basis for a subspace S. The projection of a point x onto S is the point x S closest to x. In machine learning, x is also called the reconstruction of x. z is its representation, or code. UofT CSC 411: 12-PCA 3 / 23

Projection onto a Subspace If we have a K-dimensional subspace in a D-dimensional input space, then x R D and z R K. If the data points x all lie close to the subspace, then we can approximate distances, dot products, etc. in terms of these same operations on the code vectors z. If K D, then it s much cheaper to work with z than x. A mapping to a space that s easier to manipulate or visualize is called a representation, and learning such a mapping is representation learning. Mapping data to a low-dimensional space is called dimensionality reduction. UofT CSC 411: 12-PCA 4 / 23

Learning a Subspace How to choose a good subspace S? eed to choose a vector µ and a D K matrix U with orthonormal columns. Set µ to the mean of the data, µ = 1 i=1 x(i) Two criteria: Minimize the reconstruction error min 1 x (i) x (i) 2 i=1 Maximize the variance of the code vectors max Var(z j ) = 1 (z (i) j z i ) 2 j j i = 1 z (i) z 2 i = 1 z (i) 2 Exercise: show z = 0 ote: here, z denotes the mean, not a derivative. i UofT CSC 411: 12-PCA 5 / 23

Learning a Subspace These two criteria are equivalent! I.e., we ll show 1 x (i) x (i) 2 = const 1 z (i) 2 i=1 Observation: by unitarity, x (i) µ = Uz (i) = z (i) i By the Pythagorean Theorem, 1 x (i) µ 2 + 1 x (i) x (i) 2 i=1 i=1 }{{}}{{} projected variance reconstruction error = 1 x (i) µ 2 i=1 }{{} constant UofT CSC 411: 12-PCA 6 / 23

Principal Component Analysis Choosing a subspace to maximize the projected variance, or minimize the reconstruction error, is called principal component analysis (PCA). Recall: Spectral Decomposition: a symmetric matrix A has a full set of eigenvectors, which can be chosen to be orthogonal. This gives a decomposition A = QΛQ, where Q is orthogonal and Λ is diagonal. The columns of Q are eigenvectors, and the diagonal entries λ j of Λ are the corresponding eigenvalues. I.e., symmetric matrices are diagonal in some basis. A symmetric matrix A is positive semidefinite iff each λ j 0. UofT CSC 411: 12-PCA 7 / 23

Principal Component Analysis Consider the empirical covariance matrix: Σ = 1 (x (i) µ)(x (i) µ) i=1 Recall: Covariance matrices are symmetric and positive semidefinite. The optimal PCA subspace is spanned by the top K eigenvectors of Σ. More precisely, choose the first K of any orthonormal eigenbasis for Σ. The general case is tricky, but we ll show this for K = 1. These eigenvectors are called principal components, analogous to the principal axes of an ellipse. UofT CSC 411: 12-PCA 8 / 23

Deriving PCA For K = 1, we are fitting a unit vector u, and the code is a scalar z = u (x µ). 1 [z (i) ] 2 = 1 (u (x (i) µ)) 2 i = 1 i u (x (i) µ)(x (i) µ) u i=1 [ = u 1 = u Σu ] (x (i) µ)(x (i) µ) u i=1 = u QΛQ u Spectral Decomposition = a Λa for a = Q u D = λ j aj 2 j=1 UofT CSC 411: 12-PCA 9 / 23

Deriving PCA Maximize a Λa = D j=1 λ ja 2 j for a = Q u. This is a change-of-basis to the eigenbasis of Σ. Assume the λ i are in sorted order. For simplicity, assume they are all distinct. Observation: since u is a unit vector, then by unitarity, a is also a unit vector. I.e., j a2 j = 1. By inspection, set a 1 = ±1 and a j = 0 for j 1. Hence, u = Qa = q 1 (the top eigenvector). A similar argument shows that the kth principal component is the kth eigenvector of Σ. If you re interested, look up the Courant-Fischer Theorem. UofT CSC 411: 12-PCA 10 / 23

Decorrelation Interesting fact: the dimensions of z are decorrelated. For now, let Cov denote the empirical covariance. Cov(z) = Cov(U (x µ)) = U Cov(x)U = U ΣU = U QΛQ U = ( I 0 ) ( ) I Λ 0 = top left K K block of Λ by orthogonality If the covariance matrix is diagonal, this means the features are uncorrelated. This is why PCA was originally invented (in 1901!). UofT CSC 411: 12-PCA 11 / 23

Recap Recap: Dimensionality reduction aims to find a low-dimensional representation of the data. PCA projects the data onto a subspace which maximizes the projected variance, or equivalently, minimizes the reconstruction error. The optimal subspace is given by the top eigenvectors of the empirical covariance matrix. PCA gives a set of decorrelated features. UofT CSC 411: 12-PCA 12 / 23

Applying PCA to faces Consider running PCA on 2429 19x19 grayscale images (CBCL data) Can get good reconstructions with only 3 components PCA for pre-processing: can apply classifier to latent representation For face recognition PCA with 3 components obtains 79% accuracy on face/non-face discrimination on test data vs. 76.8% for a Gaussian mixture model (GMM) with 84 states. (We ll cover GMMs later in the course.) Can also be good for visualization UofT CSC 411: 12-PCA 13 / 23

Applying PCA to faces: Learned basis Principal components of face images ( eigenfaces ) UofT CSC 411: 12-PCA 14 / 23

Applying PCA to digits UofT CSC 411: 12-PCA 15 / 23

ext ext: two more interpretations of PCA, which have interesting generalizations. 1 Autoencoders 2 Matrix factorization (next lecture) UofT CSC 411: 12-PCA 16 / 23

Autoencoders An autoencoder is a feed-forward neural net whose job it is to take an input x and predict x. To make this non-trivial, we need to add a bottleneck layer whose dimension is much smaller than the input. UofT CSC 411: 12-PCA 17 / 23

Linear Autoencoders Why autoencoders? Map high-dimensional data to two dimensions for visualization Learn abstract features in an unsupervised way so you can apply them to a supervised task Unlabled data can be much more plentiful than labeled data UofT CSC 411: 12-PCA 18 / 23

Linear Autoencoders The simplest kind of autoencoder has one hidden layer, linear activations, and squared error loss. L(x, x) = x x 2 This network computes x = W 2 W 1 x, which is a linear function. If K D, we can choose W 2 and W 1 such that W 2 W 1 is the identity matrix. This isn t very interesting. But suppose K < D: W 1 maps x to a K-dimensional space, so it s doing dimensionality reduction. UofT CSC 411: 12-PCA 19 / 23

Linear Autoencoders Observe that the output of the autoencoder must lie in a K-dimensional subspace spanned by the columns of W 2. We saw that the best possible K-dimensional subspace in terms of reconstruction error is the PCA subspace. The autoencoder can achieve this by setting W 1 = U and W 2 = U. Therefore, the optimal weights for a linear autoencoder are just the principal components! UofT CSC 411: 12-PCA 20 / 23

onlinear Autoencoders Deep nonlinear autoencoders learn to project the data, not onto a subspace, but onto a nonlinear manifold This manifold is the image of the decoder. This is a kind of nonlinear dimensionality reduction. UofT CSC 411: 12-PCA 21 / 23

onlinear Autoencoders onlinear autoencoders can learn more powerful codes for a given dimensionality, compared with linear autoencoders (PCA) UofT CSC 411: 12-PCA 22 / 23

onlinear Autoencoders Here s a 2-dimensional autoencoder representation of newsgroup articles. They re color-coded by topic, but the algorithm wasn t given the labels. UofT CSC 411: 12-PCA 23 / 23