Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Similar documents
Principal Component Analysis (PCA)

1 Singular Value Decomposition and Principal Component

14 Singular Value Decomposition

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

IV. Matrix Approximation using Least-Squares

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

15 Singular Value Decomposition

Principal Component Analysis

7. Variable extraction and dimensionality reduction

Dimensionality Reduction

Data Preprocessing Tasks

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Introduction to Machine Learning

Multivariate Statistics (I) 2. Principal Component Analysis (PCA)

LINEAR ALGEBRA 1, 2012-I PARTIAL EXAM 3 SOLUTIONS TO PRACTICE PROBLEMS

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

1 Principal Components Analysis

Linear Algebra- Final Exam Review

PCA, Kernel PCA, ICA

STATISTICAL LEARNING SYSTEMS

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

PRINCIPAL COMPONENTS ANALYSIS

Problem # Max points possible Actual score Total 120

CSC 411 Lecture 12: Principal Component Analysis

Numerical Methods I Singular Value Decomposition

Eigenvalues, Eigenvectors, and an Intro to PCA

Singular Value Decomposition and Principal Component Analysis (PCA) I

Eigenvalues, Eigenvectors, and an Intro to PCA

Exercises * on Principal Component Analysis

PCA and admixture models

Singular Value Decomposition

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Linear Dimensionality Reduction

Chapter 3 Transformations

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

MLCC 2015 Dimensionality Reduction and PCA

Machine Learning - MT & 14. PCA and MDS

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Principal Component Analysis

Dot Products. K. Behrend. April 3, Abstract A short review of some basic facts on the dot product. Projections. The spectral theorem.

Expectation Maximization

1 Linearity and Linear Systems

MATH 31 - ADDITIONAL PRACTICE PROBLEMS FOR FINAL

Dimensionality Reduction

Statistical Machine Learning

Mathematical foundations - linear algebra

4 Bias-Variance for Ridge Regression (24 points)

Lecture 3: Review of Linear Algebra

Contents. Preface for the Instructor. Preface for the Student. xvii. Acknowledgments. 1 Vector Spaces 1 1.A R n and C n 2

Lecture 3: Review of Linear Algebra

Foundations of Computer Vision

Lecture: Face Recognition and Feature Reduction

Unsupervised dimensionality reduction

Principal Components Theory Notes

Principal Component Analysis

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

EECS 275 Matrix Computation

Conceptual Questions for Review

Assignment 1 Math 5341 Linear Algebra Review. Give complete answers to each of the following questions. Show all of your work.

DS-GA 1002 Lecture notes 10 November 23, Linear models

Machine Learning (Spring 2012) Principal Component Analysis

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

7 Principal Component Analysis

Image Registration Lecture 2: Vectors and Matrices

Principal Component Analysis

Lecture: Face Recognition and Feature Reduction

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Principal Component Analysis

20 Unsupervised Learning and Principal Components Analysis (PCA)

CS 143 Linear Algebra Review

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Lecture 7 Spectral methods

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Exercise Sheet 1.

Review problems for MA 54, Fall 2004.

Multivariate Statistical Analysis

Lecture 6: Methods for high-dimensional problems

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Advanced Introduction to Machine Learning CMU-10715

Vectors and Matrices Statistics with Vectors and Matrices

1 Feature Vectors and Time Series

STA 414/2104: Lecture 8

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Factor Analysis (10/2/13)

Data Mining and Analysis: Fundamental Concepts and Algorithms

Least Squares Optimization

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Recall the convention that, for us, all vectors are column vectors.

Lecture 7: Con3nuous Latent Variable Models

Linear Methods for Regression. Lijun Zhang

series. Utilize the methods of calculus to solve applied problems that require computational or algebraic techniques..

Transcription:

Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26

Initial Question Intro Question Question Let S R n n be symmetric. 1 How does trace S relate to the spectral decomposition S = W ΛW T where W is orthogonal and Λ is diagonal? 2 How do you solve w = arg max w 2 =1 w T Sw? What is w T Sw? Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 2 / 26

Initial Question Intro Solution Solution 1 We use the following useful property of traces: trace AB = trace BA for any matrices A, B where the dimensions allow. Thus we have trace S = trace W (ΛW T ) = trace (ΛW T )W = trace Λ, so the trace of S is the sum of its eigenvalues. 2 w is an eigenvector with the largest eigenvalue. Then w T Sw is the largest eigenvalue. Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 3 / 26

Principal Component Analysis (PCA) Unsupervised Learning 1 Where did the y s go? 2 Try to find intrinsic structure in unlabeled data. 3 With PCA, we are looking for a low dimensional affine subspace that approximates our data well. Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 4 / 26

Definition of Principal Components Centered Data 1 Throughout this lecture we will work with centered data. 2 Suppose X R n d is our data matrix. Define x = 1 n n x i. i=1 3 Let X R n d be the matrix with x in every row. 4 Define the centered data: X = X X, x i = x i x. Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 5 / 26

Definition of Principal Components Variance Along A Direction Definition Let x 1,..., x n R d be the centered data. Fix a direction w R d with w 2 = 1. The sample variance along w is given by 1 n 1 n ( x i T w) 2. i=1 This is the sample variance of the components x T 1 w,..., x T n w. 1 This is also the sample variance of using the uncentered data. x T 1 w,..., x T n w, Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 6 / 26

Definition of Principal Components Variance Along A Direction x 2 x 2 x 7 x 4 x 5 w x 3 x 1 x 6 x 1 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 7 / 26

Definition of Principal Components Variance Along A Direction x 2 x 2 x 7 x 4 w x 5 x 1 x 6 x 3 x 1 w T x i -values Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 8 / 26

Definition of Principal Components First Principal Component 1 Define the first loading vector w (1) to be the direction giving the highest variance: w (1) = arg max w 2 =1 1 n 1 n ( x i T w) 2. i=1 2 Maximizer is not unique, so we choose one. Definition The first principal component of x i is x T i w (1). Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 9 / 26

Definition of Principal Components Principal Components 1 Define the kth loading vector w (k) to be the direction giving the highest variance that is orthogonal to the first k 1 loading vectors: w (k) = 1 arg max w 2 =1 n 1 w w (1),...,w (k 1) n ( x i T w) 2. 2 The complete set of loading vectors w (1),..., w (d) form an orthonormal basis for R d. Definition The kth principal component of x i is x T i w (k). i=1 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 10 / 26

Definition of Principal Components Principal Components 1 Let W denote the matrix with the kth loading vector w (k) as its kth column. 2 Then W T x i gives the principal components of x i as a column vector. 3 X W gives a new data matrix in terms of principal components. 4 If we compute the singular value decomposition (SVD) of X we get X = VDW T, where D R n d is diagonal with non-negative entries, and V R n n, W R d d are orthogonal. 5 Then X T X = WD T DW T. Thus we can use the SVD on our data matrix to obtain the loading vectors W and the eigenvalues Λ = 1 n 1 DT D. Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 11 / 26

Computing Principal Components Some Linear Algebra Recall that w (1) is defined by w (1) = arg max w 2 =1 1 n 1 n ( x i T w) 2. We now perform some algebra to simplify this expression. Note that n n ( x i T w) 2 = ( x i T w)( x i T w) i=1 = i=1 i=1 n (w T x i )( x i T w) ] i=1 = w T [ n i=1 x i x T i w = w T X T X w. Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 12 / 26

Computing Principal Components Some Linear Algebra 1 This shows w (1) = arg max w 2 =1 1 n 1 w T X T X w = arg max w T Sw, w 2 =1 where S = 1 n 1 X T X is the sample covariance matrix. 2 By the introductory problem this implies w (1) is the eigenvector corresponding to the largest eigenvalue of S. 3 We also learn that the variance along w (1) is λ 1, the largest eigenvalue of S. 4 With a bit more work we can see that w (k) is the eigenvector corresponding to the kth largest eigenvalue, with λ k giving the associated variance. Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 13 / 26

Computing Principal Components PCA Example Example A collection of people come to a testing site to have their heights measured twice. The two testers use different measuring devices, each of which introduces errors into the measurement process. Below we depict some of the measurements computed (already centered). Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 14 / 26

Computing Principal Components PCA Example Tester 2 20 10 20 10 10 10 Tester 1 20 1 Describe (vaguely) what you expect the sample covariance matrix to look like. 2 What do you think w (1) and w (2) are? Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 15 / 26

Computing Principal Components PCA Example: Solutions 1 We expect tester 2 to have a larger variance than tester 1, and to be nearly perfectly correlated. The sample covariance matrix is ( ) 40.5154 93.5069 S =. 93.5069 232.8653 2 We have S = W ΛW T, W = ( 0.3762 0.9265 0.9265 0.3762 ) ( 270.8290 0, Λ = 0 2.5518 ). Note that trace Λ = trace S. Since λ 2 is small, it shows that w (2) is almost in the null space of S. This suggests.9265 x 1 +.3762 x 2 0 for data points ( x 1, x 2 ). In other words, x 2 2.46 x 1. Maybe tester 2 used centimeters and tester 1 used inches. Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 16 / 26

Computing Principal Components PCA Example: Plot In Terms of Principal Components Tester 2 20 10 20 10 10 10 Tester 1 w (2) 6.25 2.5 20 20 10 1.25 10 20 w (1) Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 17 / 26

Computing Principal Components Uses of PCA: Dimensionality Reduction 1 In our height example above, we can replace our two features with only a single feature, the first principal component. 2 This can be used as a preprocessing step in a supervised learning algorithm. 3 When performing dimensionality reduction, one must choose how many principal components to use. This is often done using a scree plot: a plot of the eigenvalues of S in descending order. 4 Often people look for an elbow in the scree plot: a point where the plot becomes much less steep. Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 18 / 26

Computing Principal Components Scree Plot 1 From Jolliffe, Principal Component Analysis Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 19 / 26

Computing Principal Components Uses of PCA: Visualization 1 Visualization: If we have high dimensional data, it can be hard to plot it effectively. Sometimes plotting the first two principal components can reveal interesting geometric structure in the data. 1 https://www.ncbi.nlm.nih.gov/pmc/articles/pmc2735096/ Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 20 / 26

Computing Principal Components Uses of PCA: Principal Component Regression 1 Want to build a linear model with a dataset D = {(x 1, y 1 ),..., (x n, y n )}. 2 We can choose some k and replace each x i with its first k principal components. Afterward we perform linear regression. 3 This is called principal component regression, and can be thought of as a discrete variant of ridge regression (see HTF 3.4.1). 4 Correlated features may be grouped together into a single principal component that averages their values (like with ridge regression). Think about the 2 tester example from before. Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 21 / 26

Other Comments About PCA Standardization 1 What happens if you scale one of the features by a huge factor? 2 It will have a huge variance and become a dominant part of the first principal component. 3 To add scale-invariance to the process, people often standardize their data (center and normalize) before running PCA. 4 This is the same as using the correlation matrix in place of the covariance matrix. Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 22 / 26

Other Comments About PCA Standardization 1 What happens if you scale one of the features by a huge factor? 2 It will have a huge variance and become a dominant part of the first principal component. 3 To add scale-invariance to the process, people often standardize their data (center and normalize) before running PCA. 4 This is the same as using the correlation matrix in place of the covariance matrix. Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 22 / 26

Other Comments About PCA Dispersion Of The Data 1 One measure of how dispersed our data is the following: = 1 n 1 n i=1 x i x 2 2 = 1 n 1 n x i 2 2. 2 A little algebra shows this is trace S, where S is the sample covariance matrix. 3 If we project onto the first k principal components, the resulting data has dispersion λ 1 + + λ k. 4 We can choose k to account for a desired percentage of. 5 The subspace spanned by the first k loading vectors maximizes the resulting dispersion over all possible k-dimensional subspaces. i=1 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 23 / 26

Other Comments About PCA Other Comments 1 The k-dimensional subspace V spanned by w (1),..., w (k) best fits the centered data in the least-squares sense. More precisely, it minimizes n x i P V (x i ) 2 2 i=1 over all k-dimensional subspaces, where P V orthogonally projects onto V. 2 Converting your data into principal components can sometimes hurt interpretability since the new features are linear combinations (i.e., blends or baskets) of your old features. 3 The smallest principal components, if they correspond to small eigenvalues, are nearly in the null space of X, and thus can reveal linear dependencies in the centered data. Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 24 / 26

Other Comments About PCA Principal Components Are Linear Suppose we have the following labeled data. x 2 x 1 How can we apply PCA and obtain a single principal component that distinguishes the colored clusters? Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 25 / 26

Other Comments About PCA Principal Components Are Linear 1 In general, can deal with non-linear by adding features or using kernels. 2 Using kernels results in the technique called Kernel PCA. 3 Below we added the feature x i 2 and took the first principal component. w (1) Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 26 / 26