LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Similar documents
LEC 5: Two Dimensional Linear Discriminant Analysis

LEC 3: Fisher Discriminant Analysis (FDA)

The Singular Value Decomposition

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Jordan Normal Form and Singular Decomposition

14 Singular Value Decomposition

Singular Value Decomposition

7. Symmetric Matrices and Quadratic Forms

Some notes on SVD, dimensionality reduction, and clustering. Guangliang Chen

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

LEC 4: Discriminant Analysis for Classification

15 Singular Value Decomposition

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

Linear Algebra - Part II

33AH, WINTER 2018: STUDY GUIDE FOR FINAL EXAM

UNIT 6: The singular value decomposition.

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Principal Component Analysis (PCA)

Assignment #10: Diagonalization of Symmetric Matrices, Quadratic Forms, Optimization, Singular Value Decomposition. Name:

Linear Algebra in Actuarial Science: Slides to the lecture

LINEAR ALGEBRA 1, 2012-I PARTIAL EXAM 3 SOLUTIONS TO PRACTICE PROBLEMS

Notes on Eigenvalues, Singular Values and QR

235 Final exam review questions

Chapter 3 Transformations

Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition

. = V c = V [x]v (5.1) c 1. c k

Eigenvalues and diagonalization

CS 143 Linear Algebra Review

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Chapter 7: Symmetric Matrices and Quadratic Forms

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Review problems for MA 54, Fall 2004.

Notes on Linear Algebra

Conceptual Questions for Review

MATH 583A REVIEW SESSION #1

7 Principal Component Analysis

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition

Linear Algebra Review. Fei-Fei Li

1 Singular Value Decomposition and Principal Component

Linear Algebra Primer

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

The Singular Value Decomposition

Ir O D = D = ( ) Section 2.6 Example 1. (Bottom of page 119) dim(v ) = dim(l(v, W )) = dim(v ) dim(f ) = dim(v )

Linear Systems. Class 27. c 2008 Ron Buckmire. TITLE Projection Matrices and Orthogonal Diagonalization CURRENT READING Poole 5.4

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

1 Last time: least-squares problems

2. LINEAR ALGEBRA. 1. Definitions. 2. Linear least squares problem. 3. QR factorization. 4. Singular value decomposition (SVD) 5.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Lecture 3: Review of Linear Algebra

Foundations of Computer Vision

Lecture 3: Review of Linear Algebra

Introduction to SVD and Applications

Functional Analysis Review

Numerical Methods I Singular Value Decomposition

Chap 3. Linear Algebra

The Singular Value Decomposition and Least Squares Problems

I. Multiple Choice Questions (Answer any eight)

Pseudoinverse & Orthogonal Projection Operators

EE731 Lecture Notes: Matrix Computations for Signal Processing

Machine Learning - MT & 14. PCA and MDS

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

1 Linearity and Linear Systems

EXAM. Exam 1. Math 5316, Fall December 2, 2012

Contents. Preface for the Instructor. Preface for the Student. xvii. Acknowledgments. 1 Vector Spaces 1 1.A R n and C n 2

Singular Value Decomposition

Lecture: Face Recognition and Feature Reduction

Econ Slides from Lecture 7

Dimensionality Reduction

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Definition (T -invariant subspace) Example. Example

Study Notes on Matrices & Determinants for GATE 2017

Properties of Matrices and Operations on Matrices

Advanced Introduction to Machine Learning CMU-10715

Machine Learning (Spring 2012) Principal Component Analysis

Principal Component Analysis

Background Mathematics (2/2) 1. David Barber

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Example: Face Detection

CS 231A Section 1: Linear Algebra & Probability Review

A Review of Linear Algebra

Lecture: Face Recognition and Feature Reduction

Remark 1 By definition, an eigenvector must be a nonzero vector, but eigenvalue could be zero.

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Moore-Penrose Conditions & SVD

Linear Algebra Review. Fei-Fei Li

LECTURE NOTE #11 PROF. ALAN YUILLE

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

Lecture 7 Spectral methods

Linear Algebra, part 3. Going back to least squares. Mathematical Models, Analysis and Simulation = 0. a T 1 e. a T n e. Anna-Karin Tornberg

BASIC ALGORITHMS IN LINEAR ALGEBRA. Matrices and Applications of Gaussian Elimination. A 2 x. A T m x. A 1 x A T 1. A m x

MATH 5720: Unconstrained Optimization Hung Phan, UMass Lowell September 13, 2018

Remark By definition, an eigenvector must be a nonzero vector, but eigenvalue could be zero.

What is Principal Component Analysis?

PCA, Kernel PCA, ICA

Linear Dimensionality Reduction

NORMS ON SPACE OF MATRICES

Transcription:

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach Dr. Guangliang Chen February 9, 2016

Outline Introduction Review of linear algebra Matrix SVD PCA

Motivation The digits are 784 dimensional - very time consuming to perform even simple tasks like knn search Need a way to reduce the dimensionality of the data in order to increase speed However, if we discard some dimensions, will that degrade performance? The answer can be no (as long as we do it carefully). In fact, it may even improve results in some cases. Dr. Guangliang Chen Mathematics & Statistics, San José State University 3/29

How is this possible? 0.6 mean intensity value at each pixel 0.5 0.4 0.3 0.2 0.1 0 0 100 200 300 400 500 600 700 800 Boundary pixels tend to be zero; The number of degrees of freedom of each digit is much less than 784. Dr. Guangliang Chen Mathematics & Statistics, San José State University 4/29

The main idea of PCA PCA reduces the dimensionality by discarding low-variance directions (under the assumption that useful information is only along high-variance dimensions) Dr. Guangliang Chen Mathematics & Statistics, San José State University 5/29

Other dimensionality reduction techniques PCA reduces the dimension of data by preserving as much variance as possible. Here are some alternatives: Linear Discriminant Analysis (LDA): by preserving discriminatory information between the different training classes Multidimensional Scaling (MDS): by preserving pairwise distances ISOmap: by preserving geodesic distances Locally Linear Embedding (LLE): by preserving local geometry Dr. Guangliang Chen Mathematics & Statistics, San José State University 6/29

Review of matrix algebra Special square matrices A R n n Symmetric: A T = A Diagonal: a ij = 0 whenever i j Orthogonal: A 1 = A T (i.e. AA T = A T A = I n ) Eigenvalues and eigenvectors: Av = λv λ satisfies det(a λi) = 0 v satisfies (A λi)v = 0 Dr. Guangliang Chen Mathematics & Statistics, San José State University 7/29

Eigenvalue decomposition of symmetric matrices Let A R n n be a symmetric matrix. Then there exist an orthogonal matrix Q = [q 1... q n ] and a diagonal matrix Λ = diag(λ 1,..., λ n ), both real & square, such that A = QΛQ T This is also called the spectral decomposition of A. Note that the above equation is equivalent to Aq i = λ i q i, i = 1,..., n Therefore, the λ i s represent eigenvalues of A while the q i s are the associated eigenvectors. Dr. Guangliang Chen Mathematics & Statistics, San José State University 8/29

Remark 1: For convenience the diagonal elements of Λ are often sorted such that λ 1 λ 2 λ n. Remark 2: For asymmetric matrices, neither Q is orthogonal nor Λ is diagonal needs to be true. For an abitrary matrix A R n n, we have the following decomposition A = PJP 1 where P is invertible and J is upper triangular. This is called the Jordan canonical form of A. When J can be made diagonal by selecting P (not always possible), we say that A is diagonalizable. Dr. Guangliang Chen Mathematics & Statistics, San José State University 9/29

Positive (semi)definite matrices A symmetric matrix A is said to be positive (semi)definite if x T Ax > 0 ( 0) for all x 0. Theorem. A symmetric matrix A is positive (semi)definite if and only if its eigenvalues are positive (nonnegative). Proof : This result can be proved by applying the spectral decomposition of A. Left to you as an exercise. Dr. Guangliang Chen Mathematics & Statistics, San José State University 10/29

Matlab command for eigenvalue decomposition D = eig(a) produces a column vector D containing the eigenvalues of a square matrix A. [V,D] = eig(a) produces a diagonal matrix D of eigenvalues and a full matrix V whose columns are the corresponding eigenvectors so that A = VDV 1. Dr. Guangliang Chen Mathematics & Statistics, San José State University 11/29

Singular value decomposition (SVD) Let X R n d be a rectangular matrix. The SVD of X is defined as X n d = U n n Σ n d V T d d where U = [u 1... u n ], V = [v 1... v d ] are orthogonal and Σ is diagonal : u i s are called the left singular vectors of X; v j s are called the right singular vectors of X; The diagonal entries of Σ are called the singular values of X. Note. This is often called the full SVD, to distinguish from other variations. Dr. Guangliang Chen Mathematics & Statistics, San José State University 12/29

Matlab command for matrix SVD svd Singular Value Decomposition. [U,S,V] = svd(x) produces a diagonal matrix S, of the same dimension as X and with nonnegative diagonal elements in decreasing order, and orthogonal matrices U and V so that X = U*S*V T. s = svd(x) returns a vector containing the singular values. Dr. Guangliang Chen Mathematics & Statistics, San José State University 13/29

Other versions of matrix SVD Let X R n d. We define Economic SVD: X n d = Compact SVD: { Un d Σ d d V T d d, n > d (tall matrix) U n n Σ n n V T d n, n < d (long matrix) X n d = U n r Σ r r V T d r, where r = rank(x) Rank-1 decomposition: X = σ i u i v T i Dr. Guangliang Chen Mathematics & Statistics, San José State University 14/29

Matlab command for Economic SVD For tall matrices only: [U,S,V] = svd(x,0) produces the "economy size" decomposition. If X is m-by-n with m > n, then only the first n columns of U are computed and S is n-by-n. For m <= n, svd(x,0) is equivalent to svd(x). For both tall and long matrices: [U,S,V] = svd(x, econ ) also produces the "economy size" decomposition. If X is m-by-n with m >= n, then it is equivalent to svd(x,0). For m < n, only the first m columns of V are computed and S is m-by-m. Dr. Guangliang Chen Mathematics & Statistics, San José State University 15/29

A brief mathematical derivation of SVD For any X R n d, form C = X T X R d d. Observation: C is square, symmetric, and positive semidefinite. Therefore, C = VΛV T for an orthogonal V R d d and Λ = diag(λ 1,..., λ d ) with λ 1 λ r > 0 = λ r+1 = = λ d (where r = rank(x) d). Define σ i = λ i for all i and u i = 1 σ i Xv i for 1 i r. Claim: u 1,..., u r are orthonormal vectors. Let U = [u 1... u r u r+1... u d ] such that it is orthogonal and Σ = diag(σ 1,..., σ d ). Then from Xv i = σ i u i, i we obtain XV = UΣ, or equivalently, X = UΣV T. Dr. Guangliang Chen Mathematics & Statistics, San José State University 16/29

Connection with eigenvalue decomposition Let X = UΣV be the full SVD of X R n d. Then U consists of the eigenvectors of XX T R n n (gram matrix); V consists of the eigenvectors of X T X R d d (convariance matrix, if the rows of X have a mean of zero); Σ consists of the square roots of the eigenvalues of either matrix (and zero). Dr. Guangliang Chen Mathematics & Statistics, San José State University 17/29

Principal Component Analysis (PCA) Algorithm Input: data set X = [x 1... x n ] T R n d and integer s (with 0 < s < d) Output: compressed data Y R n s Steps: 1. Center data: X = [x1... x n ] T [m... m] T where m = 1 n xi 2. Perform SVD: X = UΣV T 3. Return: Y = X V(:, 1 : s) = U(:, 1 : s) Σ(1 : s, 1 : s) Terminology. We call the columns of V the principal directions of the data and the columns of Y its top s principal components. Dr. Guangliang Chen Mathematics & Statistics, San José State University 18/29

Geometric meaning The principal components Y represent the coefficients of the projection of X onto the subspace through the point m = 1 n xi and spanned by the basis vectors v 1,..., v s. Dr. Guangliang Chen Mathematics & Statistics, San José State University 19/29

MATLAB implementation of PCA MATLAB built-in: [V, US] = pca(x); % Rows of X are observations Alternatively, you may want to code it yourself: n = size(x,1); center = mean(x,1); Xtilde = X - repmat(center, n, 1); [U,S,V] = svd(xtilde, econ ); Y = Xtilde*V(:,1:k); % k is the reduced dimension Note: The first three lines can be combined into one line Xtilde = X - repmat(mean(x,1), size(x,1), 1); Dr. Guangliang Chen Mathematics & Statistics, San José State University 20/29

Some properties of PCA Note that the principal coordinates Y = [ Xv 1... Xv s ] = [σ 1 u 1... σ s u s ] The projection of X onto the ith principal direction v i is Xv i = σ i u i. The variance of the ith projection Xv i is σ 2 i. Proof. First, 1 T Xvi = 0 T v i = 0. This shows that Xv i has mean zero. Second, σ i u i 2 2 = σ 2 i. Accordingly, the projection Xv i = σ i u i has a variance of σ 2 i. Therefore, the right singluar vectors v 1,..., v s represent the directions of decreasing variance (σ 2 1 σ 2 s). Dr. Guangliang Chen Mathematics & Statistics, San José State University 21/29

Is there an even better direction than v 1? The answer is no; that is, ±v 1 contains the maximum possible variance among all single directions. Mathematically, it is the solution of the following problem: max v R d : v 2=1 Xv 2 2 Proof. Consider the full SVD of X = UΣV T. For any unit vector v R d, write v = Vα for some unit vector α R d. Then Xv = XVα = UΣα. Accordingly, Xv 2 2 = UΣα 2 2 = Σα 2 2 = σ 2 i α2 i σ2 1, where the equality holds when α = ±e 1 and correspondingly, v = ±Ve 1 = ±v 1. Remark: The maximum value is σ 2 1, which is consistent with before. Dr. Guangliang Chen Mathematics & Statistics, San José State University 22/29

How to set the parameter s in principle? Generally, there are two ways to choose the reduced dimension s: Set s = # dorminant singular values Choose s such that the top s principal directions explain a certain fraction of the variance of the data: s i=1 σ 2 i }{{} explained variance > p σ 2 i }{{} total variance Typically, p =.95, or.99 (more conservative), or.90 (more aggressive). Dr. Guangliang Chen Mathematics & Statistics, San José State University 23/29

More interpretations of PCA PCA reduces the dimension of data by maximizing the variance of the projection (for a given dimension). It is also known that PCA for a given dimension s preserves the most of the pairwise distances between the data points minimizes the total orthogonal fitting error by a s-dimensional subspace Dr. Guangliang Chen Mathematics & Statistics, San José State University 24/29

PCA for classification Note that in the classification setting there are two data sets: X train and X test. Perform PCA on the entire training set and then project the test data using the training basis: Y test = (X test [m train... m train ] T ) V train Finally, select a classifier to work in the reduced space: PCA + knn PCA + local kmeans PCA + other classifiers Dr. Guangliang Chen Mathematics & Statistics, San José State University 25/29

A final remark about the projection dimension s PCA is an unsupervised method, and the 95% (or 90%) criterion is only a conservative choice in order to avoid losing any useful direction. In the context of classification it is possible to get much lower than this threshold while maintaining or even improving the classification accuracy. The reason is that variance is not necessarily useful for classification. In practice, one may want to use cross validation to select the optimal s. Dr. Guangliang Chen Mathematics & Statistics, San José State University 26/29

HW2a (due in about two weeks) 1. Apply the plain knn classifier with 6-fold cross validation to the projected digits with s = 154 (corresponding to 95% variance) and 50 (corresponding to 82.5% variance) separately to select the best k from the range 1:10 (for each s). Plot the curves of validation error versus k and compare them with that of no projection (i.e. HW1-Q1). For the three choices of s (50, 154, and 784), what are the respective best k? Overall, which (s, k) pair gives the smallest validation error? 2. For each of the three values of s above (with corresponding optimal k), apply the plain knn classifier to the test data and display their errors using a bar plot. Interpret the results. Is what you got in this question consistent with that in Question 1? Dr. Guangliang Chen Mathematics & Statistics, San José State University 27/29

3. Apply the local kmeans classifier, with each k = 1,..., 10, to the test set of handwrittend digits after they are projected into R 50. How does your result compare with that of no projection (i.e. HW1-Q5)? What about speed? Note: HW2b will be available after we finish our next topic. Dr. Guangliang Chen Mathematics & Statistics, San José State University 28/29

Midterm project 1: Instance-based classification Task: Concisely describe the classifiers we have learned thus far and summarize their corresponding results in a poster to be displayed in the classroom. You are also encouraged to try new ideas/options (e.g., normal weights for weighted knn, other metrics such as cosine) and include your findings in the poster. Who can participate: One to two students from this class, subject to instructor s approval. When to finish: In 2 to 3 weeks. How it will be graded: Based on clarity, completeness, correctness, originality. You are welcome to consult with the instructor for ideas to try in this project. Dr. Guangliang Chen Mathematics & Statistics, San José State University 29/29