Principal Component Analysis!! Lecture 11!

Similar documents
1 Principal Components Analysis

Maximum variance formulation

Data Preprocessing Tasks

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

LECTURE 16: PCA AND SVD

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Principal Component Analysis (PCA)

Principal Component Analysis

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Machine Learning (Spring 2012) Principal Component Analysis

Quantitative Understanding in Biology Principal Components Analysis

Principal Component Analysis

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Statistical Pattern Recognition

Non-linear Dimensionality Reduction

PCA, Kernel PCA, ICA

Advanced Introduction to Machine Learning CMU-10715

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Principal Components Analysis (PCA)

Unsupervised Learning: Dimensionality Reduction

GEOG 4110/5100 Advanced Remote Sensing Lecture 15

Kernel methods for comparing distributions, measuring dependence

Principal Component Analysis

Multivariate Statistics Fundamentals Part 1: Rotation-based Techniques

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Dimensionality Reduction with Principal Component Analysis

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Noise & Data Reduction

DIMENSION REDUCTION AND CLUSTER ANALYSIS

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Linear Algebra Methods for Data Mining

Principal Component Analysis

Introduction to Machine Learning

Lecture Notes 1: Vector spaces

Exercises * on Principal Component Analysis

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

7. Variable extraction and dimensionality reduction

Eigenvalues, Eigenvectors, and an Intro to PCA

Computation. For QDA we need to calculate: Lets first consider the case that

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Eigenvalues, Eigenvectors, and an Intro to PCA

L26: Advanced dimensionality reduction

14 Singular Value Decomposition

Principal Component Analysis (PCA) Theory, Practice, and Examples

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4

Whitening and Coloring Transformations for Multivariate Gaussian Data. A Slecture for ECE 662 by Maliha Hossain

Intelligent Data Analysis. Principal Component Analysis. School of Computer Science University of Birmingham

Mathematical foundations - linear algebra

Apprentissage non supervisée

Linear Dimensionality Reduction

Principal component analysis

Dimensionality Reduction

CSC 411 Lecture 12: Principal Component Analysis

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

PRINCIPAL COMPONENT ANALYSIS

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

LECTURE NOTE #11 PROF. ALAN YUILLE

Neuroscience Introduction

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Second-Order Inference for Gaussian Random Curves

Multivariate Statistics (I) 2. Principal Component Analysis (PCA)

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Principal Component Analysis

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Announcements (repeat) Principal Components Analysis

20 Unsupervised Learning and Principal Components Analysis (PCA)

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Statistical and Computational Analysis of Locality Preserving Projection

What is Principal Component Analysis?

Statistical signal processing

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

THE UNIVERSITY OF HONG KONG DEPARTMENT OF MATHEMATICS

Principal Component Analysis

PRINCIPAL COMPONENTS ANALYSIS

Lecture: Face Recognition and Feature Reduction

Covariance and Correlation Matrix

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method

Machine Learning for Software Engineering

PCA and admixture models

Vectors and Matrices Statistics with Vectors and Matrices

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Lecture 3: Review of Linear Algebra

Principal Dynamical Components

Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis

Lecture 3: Review of Linear Algebra

Review (Probability & Linear Algebra)

Noise & Data Reduction

Machine Learning 2nd Edition

Drift Reduction For Metal-Oxide Sensor Arrays Using Canonical Correlation Regression And Partial Least Squares

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

Artificial Intelligence Module 2. Feature Selection. Andrea Torsello

Eigenfaces. Face Recognition Using Principal Components Analysis

Lecture: Face Recognition and Feature Reduction

Transcription:

Principal Component Analysis Lecture 11 1

Eigenvectors and Eigenvalues g Consider this problem of spreading butter on a bread slice 2

Eigenvectors and Eigenvalues g Consider this problem of stretching cheese on a bread slice 3

Eigenvectors and Eigenvalues g Consider this problem of spreading butter on a bread slice " A = 2 0 % $ ' # 0 1& Linear Transformation 4

Eigenvectors and Eigenvalues Find Eigenvalues: " A = 2 0 % $ ' # 0 1& A ( )I = 0 " 2 ( ) 0 % $ ' = 0 # 0 1 ( )& (2 ( ))(1( )) = 0 * ) = 2,1 5

Eigenvectors and Eigenvalues Find Eigenvectors for λ=2 A v = " v (A # "I) v = 0 $ 2 # 2 0 ' $ & ) v ' 1 & ) = 0 % 0 1 # 2( % ( 0v 1 #1v 2 = 0 v 2 = 0 $ v = & 1' ) % 0( v 2 X-axis 6

Eigenvectors and Eigenvalues Find Eigenvectors for λ=2 A v = " v (A # "I) v = 0 $ 2 # 2 0 ' $ & ) v ' 1 & ) = 0 % 0 1 # 2( % ( 0v 1 #1v 2 = 0 v 2 = 0 $ v = & 1' ) % 0( X-axis v 2 Find Eigenvectors for λ=1 A v = " v (A # "I) v = 0 $ 2 #1 0 ' $ & ) v ' 1 & ) = 0 % 0 1 #1(% ( 1v 1 # 0v 2 = 0 v 1 = 0 $ v = 0 ' & ) % 1( Y-axis v 2 7

Eigenvectors and Eigenvalues g What does the eigenvalue and vectors indicate? n That the linear transformation results in a scaling of λ along the eigenvector corresponding to λ # g In the example presented, λ=2 along the x-axis and λ=1 along the y-axis (intuitive)# g Stated differently, eigenvalue indicates the percentage of transformation present along a particular direction# n 66.66% along x-axis# n 33.33% along y-axis # 8

Eigenvectors and Eigenvalues g Lets consider another example n This time non-zero diagonal elements# " 1.5 0.5% A = $ ' # 0.5 1.5& 2.5 2 1.5 1 Before Linear Transformation " 1.5 0.5% A = $ ' # 0.5 1.5& 2.5 2 1.5 1 Before & After Linear Transformation 0.5 0 0.5 Linear Transformation 0.5 0 0.5 1 1 0.5 0 0.5 1 1.5 2 2.5 1 1 0.5 0 0.5 1 1.5 2 2.5 9

Eigenvectors and Eigenvalues g Compute eigenvalues and eigenvectors of A " 1.5 0.5% A = $ ' # 0.5 1.5& 2.5 2.5 2 1.5 1 Before Linear Transformation " 1.5 0.5% A = $ ' # 0.5 1.5& 2 1.5 1 Before & After Linear Transformation 0.5 0 0.5 Linear Transformation 0.5 0 0.5 1 1 0.5 0 0.5 1 1.5 2 2.5 1 1 0.5 0 0.5 1 1.5 2 2.5 10

Eigenvectors and Eigenvalues g Compute eigenvalues and eigenvectors of A " 1.5 0.5% A = $ ' # 0.5 1.5& " =1 " = 2 $ v = #0.7071 ' $ & ) v = 0.7071 ' & ) % 0.7071 ( % 0.7071( 2.5 2 Before & After Linear Transformation 1.5 1 0.5 0 0.5 1 1 0.5 0 0.5 1 1.5 2 2.5 11

Eigenvectors and Eigenvalues g What does the eigenvalue and vectors indicate? n That the linear transformation results in a scaling of λ along the eigenvector corresponding to λ # g λ=2 along [0.7071, 0.7071] T# g λ=1 along [-0.7071, 0.7071] T # Eigenvector2 (λ=1) 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 0 0.5 1 1.5 2 2.5 3 Eigenvector1 (λ=2) # A " = 2 0 & % ( $ 0 1' Even though we started with a non-diagonal transformation matrix (A), by computing the eigenvectors and projecting the data onto those eigenvectors allows us to diagonalize the transformation 12

Recap Linear Algebra: Projection 13

Vector Space 14

Basis Vectors 15

Projection using Basis Vectors 16

Projection using Basis Vectors 17

Projection using Basis Vectors 18

Statistics Preliminaries g Mean n Let X 1,X 2,, X n be n observations of a random variable X# µ = X = E[ X] = 1 n n Mean is a measure of central tendency (others are mode and median)# n " i=1 X i g Standard Deviation n Measure of variability (square root of variance)# " = 1 n n $ i=1 ( X i # µ ) 2 19

Statistics Preliminaries g Covariance n A measure of how two variables change together# " XY = 1 n i=1 g Covariance Matrix n $ " XY = E X # E X ( X i # µ X ) Y i # µ Y ( ) T [( [ ])( Y # E[ Y] )] $ 2 # X1 " X1X 2... " X1Xd ' & 2 ) &" X1X 2 # X 2... " X 2Xd ) &.. ) " = &..... ) & ).. & 2 ) %" X1Xd " X 2Xd... # Xd ( 20

Statistics Preliminaries g Correlation n A normalized measure of how two variables change together# " XY = # XY $ X $ Y 21

Statistics Preliminaries An Example " µ = 2 % " $ ' ( = 2 0 % $ ' # 3& # 0 3& 10 8 6 4 x2 2 0 2 4 4 2 0 2 4 6 8 x1 22

Statistics Preliminaries An Example " µ = 2 % " $ ' ( = 2 0.5 % $ ' # 3& # 0.5 3 & 10 8 6 x2 4 2 0 2 4 4 2 0 2 4 6 8 x1 23

Statistics Preliminaries An Example " µ = 2 % " $ ' ( = 2 1 % $ ' # 3& # 1 3& 10 8 6 x2 4 2 0 2 4 4 2 0 2 4 6 8 x1 24

Statistics Preliminaries An Example " µ = 2 % " $ ' ( = 2 2 % $ ' # 3& # 2 3& 10 8 6 4 x2 2 0 2 4 4 2 0 2 4 6 8 x1 25

Statistics Preliminaries An Example " µ = 2 % " $ ' ( = 2 )2 % $ ' # 3& #)2 3 & 10 8 6 x2 4 2 0 2 4 4 2 0 2 4 6 8 x1 26

PCA An Example " µ = 2 % " $ ' ( = 2 2 % $ ' # 3& # 2 3& 10 Can you find a vector that would approximate this 2-D space? 8 6 4 x2 2 0 2 4 4 2 0 2 4 6 8 x1 27

PCA Letʼs build some intuition " µ = 2 % " $ ' ( = 2 2 % $ ' # 3& # 2 3& 10 Can you find a vector that would approximate this 2-D space? Green vector right? 8 6 4 x2 2 0 2 4 4 2 0 2 4 6 8 x1 28

PCA Letʼs build some intuition " µ = 2 % " $ ' ( = 2 2 % $ ' # 3& # 2 3& 10 It would be nice to diagonalize the covariance matrix then you have only think about variance 8 6 4 x2 2 0 2 4 4 2 0 2 4 6 8 x1 29

PCA Letʼs build some intuition " µ = 2 % " $ ' ( = 2 2 % $ ' # 3& # 2 3& 10 It would be nice to diagonalize the covariance matrix then you have only think about variance Think eigenvectors of covariance matrix 8 6 4 x2 2 0 2 4 4 2 0 2 4 6 8 x1 30

PCA Letʼs build some intuition " µ = 0 % " $ ' ( = 2 2 % $ ' # 0& # 2 3& It would be nice to diagonalize the covariance matrix then you have only think about variance Think eigenvectors of covariance matrix 8 6 4 2 x2 0 2 4 Letʼs subtract the mean first (makes things simple) 6 8 8 6 4 2 0 2 4 6 x1 31

PCA Letʼs build some intuition " µ = 0 % " $ ' ( = 2 2 % $ ' # 0& # 2 3& It would be nice to diagonalize the covariance matrix then you have only think about variance Think eigenvectors of covariance matrix 8 6 4 2 x2 0 2 4 6 8 8 6 4 2 0 2 4 6 " = 0.4384 " = 4.5616 $ v = #0.7882 ' $ & ) v = 0.6154 ' & ) % 0.6154 ( % 0.7882( x1 32

PCA Letʼs build some intuition " µ = 0 % " $ ' ( = 2 2 % $ ' # 0& # 2 3& It would be nice to diagonalize the covariance matrix then you have only think about variance Think eigenvectors of covariance matrix 8 6 4 2 x2 0 2 4 First eigenvector accounts for 91.23% of the data 6 8 8 6 4 2 0 2 4 6 x1 33

PCA Letʼs build some intuition " µ = 0 % " $ ' ( = 2 2 % $ ' # 0& # 2 3& If I have to pick a vector that would approximate this 2-D space? Green vector right 8 6 4 2 x2 0 2 4 First eigenvector accounts for 91.23% of the data 6 8 8 6 4 2 0 2 4 6 x1 34

PCA An Example g Compute the principal components for the following twodimensional dataset n X=(x1,x2)={(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)}# 10 9 8 7 6 x2 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 x1 35

PCA An Example g Compute the principal components for the following twodimensional dataset n X=(x1,x2)={(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)}# 10 Step 1: Determine the Sample Covariance Matrix 9 8 7 6 x2 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 x1 36

PCA An Example g Compute the principal components for the following twodimensional dataset n X=(x1,x2)={(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)}# Step 1: Determine the Sample Covariance Matrix # 6.25 4.25& " = % ( $ 4.25 3.5 ' x2 10 9 8 7 6 5 4 Step 2: Find eigenvalues and eigenvectors of the covariance matrix 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 x1 37

PCA An Example g Compute the principal components for the following twodimensional dataset n X=(x1,x2)={(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)}# Step 1: Determine Covariance Matrix # 6.25 4.25& " = % ( $ 4.25 3.5 ' x2 10 9 8 7 6 5 4 Step 2: Find eigenvalues and eigenvectors of the covariance matrix " = 0.4081 " = 9.3419 $ v = 0.5883 ' $ & ) v = #0.5883 ' & ) %#0.8086( %#0.8086( 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 x1 38

Now the Math Part g How did we decide that eigenvectors of the covariance matrix will retain maximum information? g What are the assumptions behind PCA n First, the data distribution is unimodal Gaussian in nature# g Fully explained by the first two moments of the distribution i.e. mean and covariance# n Information is in the variance # g PCA dimensionality reduction n The optimal* approximation of a random vector x R N by a linear combination of M (M<N) independent vectors is obtained by projecting the random vector x onto the eigenvectors ϕ i corresponding to the largest eigenvalues λ i of the covariance matrix Σ x# (*optimality is defined as the minimum of the sum-square magnitude of the approximation error) # 39

Lets do the derivation g The objective of PCA is to perform dimensionality reduction while preserving as much of the randomness (variance) in the highdimensional space as possible n Let x be an N-dimensional random vector, represented as a linear combination of orthonormal basis vectors [ϕ 1 ϕ 2... ϕ N ] as# x = N # y i " i where " i $" j i=1 n Where y i is the weighting coefficient of basis vector formed by taking the inner product of x with ϕ 1 : # y i = x T " i 40

Lets do the derivation g The objective of PCA is to perform dimensionality reduction while preserving as much of the randomness (variance) in the highdimensional space as possible n Let x be an N-dimensional random vector, represented as a linear combination of orthonormal basis vectors [ϕ 1 ϕ 2... ϕ N ] as# x = N # y i " i where " i $" j i=1 n Where y i is the weighting coefficient of basis vector formed by taking the inner product of x with ϕ 1 : # y i = x T " i Φ 2 =[0 1] T x=[1 1] T y 2 =x T φ 2 =1 " 1% " 0% x = y 1 $ ' + y # 0 2 $ ' & # 1& y 1 =x T φ 1 =1 Φ 1 =[1 0] T 41

Lets do the derivation g Suppose we choose to represent x with only M (M<N) of the basis vectors. We can do this by replacing the components [y M+1,, y N ] T with some pre-selected constants b i ˆ x (M) = M # y i " i + # b i " i i=1 N i=m +1 g The representation error is: 42

Lets do the derivation g Suppose we choose to represent x with only M (M<N) of the basis vectors. We can do this by replacing the components [y M+1,, y N ] T with some pre-selected constants b i ˆ x (M) = M # y i " i + # b i " i i=1 N i=m +1 g The representation error is "x(m) = x # ˆ x (M) M N & M N ) = % y i $ i + % y i $ i #(% y i $ i + % b i $ i + ' * i=1 N i=m +1 = % y i $ i # % b i $ i i=m +1 N % N i=m +1 = ( y i # b i )$ i i=m +1 i=1 i=m +1 43

Lets do the derivation g g We can measure this representation error by the mean-squared magnitude of Δx Our goal is to find the basis vectors ϕ i and constants b i that minimize this mean-square error E [ "x(m) 2 ] = 1 S S $ # 2 i (M) S = samples i=1 -' N *' N * 0 = E /) $ ( y i % b i )& i,) $ ( y i % b i )& i, 2.( i=m +1 + ( i=m +1 + 1 -' N N = E / ) $ $ y i % b i./ ( i=m +1 j =M +1 N $ [( ) 2 ] = E y i % b i i=m +1 * 0 ( )& T i & j, 2 + 12 ( ) y j % b j -& i 3& j 0 / 2.& i =11 44

Lets do the derivation g Find b i n The optimal values of b i can be found by computing the partial derivative of the objective function and equating it to zero# " % N ' $ E y "b i # b i i & i=m +1 [( ) 2 ] ( [ ] # b i ) = 0 [ ] = #2 E y i b i = E y i ( * = #2 E y i # b i ) ( [ ]) = 0 E[x] is a linear operator E[A-B]=E[A]-E[B] n Intuitive: replace the discarded dimensions y iʼs by their expected value (or mean)# 45

Lets do the derivation g Find b i n As we have done earlier in the course, the optimal values of bi can be found by computing the partial derivative of the objective function and equating it to zero# # n b i = E[ y i ] Intuitive: replace the discarded dimensions y iʼs by their expected value# Reducing 2d to1d x=[1 1] T x=[2.5 1] T y 2 =x T φ 2 =1 b 2 =(1+1+0.5)/3 b 2 =0.833 x=[2 0.5] T y 1 =1 y 1 =2 y 1 =2.5 Φ 1 =[1 0] T y 1 =x T φ 1 =1 46

Lets do the derivation g The Mean-Squared-Error can now be written as N $ [( ) 2 ] " 2 (M) = E y i # b i N i=m +1 ( [ ]) 2 % = $ E y i # E y &' i i=m +1 ( )* g where y i = x T " i 47

Lets do the derivation g The Mean-Squared-Error can now be written as g where # ( [ ]) 2 " 2 $ (M) = * E y i # E y %& i i=m +1 Substituting y i in the MSE equation we get: N y i = x T " i ' () N " 2 % (M) = + E x T # i $ E x T # &' i N i=m +1 [ ] ( ) 2 = + # T i E ( x $ E[ x] )( x $ E [ x ] ) T # i #### "#### $ i=m +1 N + = # i T, x # i i=m +1 ( )* [ ] data cov ariance Where Σ x is the covariance matrix of x 48

Lets do the derivation g We seek to find the solution that minimizes the MSE and is also subject to the normality constraint (φ T φ=1), which we incorporate into the expression using a set of Lagrange multipliers λ i N " 2 (M) = %# T i $ x # i + %& i (1'# T i # i ) i=m +1 N i=m +1 g Computing partial derivatives with respect to φ i d ' N N * ) $ " T d" i # x " i + $ % i (1&" T i " i ), = 2 # x " i & % i " i i ( + Note : i=m +1 d dx xt Ax i=m +1 ( ) = (A + A T A symmertric )x = 2Ax ( ) = 0 - # x # " i " = # % $ i " i Eigenvalue problem g So ϕ i and λ i are the eigenvectors and eigenvalues of the covariance matrix Σ x 49

Lets do the derivation g We can express the sum-squared error as N " 2 (M) = %# T i $ x # i = %# T i & i # i = %& i i=m +1 N i=m +1 N i=m +1 g In order to minimize this measure, λ i will have to be smallest eigenvalues n Therefore, to represent x with minimum sum-square error, we will choose the eigenvectors ϕ i corresponding to the largest eigenvalues λ i # # 50

Scree plot 51

Applications of PCA: Spike Sorting g Extracellular recording (single electrode) 52

Applications of PCA: Spike Sorting g Multi-unit recording Notice different colors (represent different neurons) have different patterns of extracellular activity) [from Pouzat et al. 2002] 53

Applications of PCA: Spike Sorting g Multi-unit recording PC2 Different colors (represent different neurons) have different patterns of extracellular activity) PC1 Clustering after PCA allows identification of events from a single source (neuron) 54

Applications of PCA: Neural Coding g Information is represented by spiking patterns of neural ensembles g Visualization of high-dimensional neural activity 55

Identity vs. Intensity Coding g PN ensemble activity trace concentration-specific trajectories on odor-specific manifolds (Stopfer et al., Neuron, 2003). Odor Trajectories (3 odors at multiple concentrations) Odor Trajectories (Hexanol at five concentrations) Octanol manifold Geraniol manifold Hexanol manifold Stopfer et al, Neuron, 2003 Stopfer et al., Neuron, 2003 LLE Local Linear Embedding S. Roweis and L. Saul, Science, 2000 56

Another Neural Coding Example Kemere et al., 2008 Byron Yu, CMU, Teaching Notes 57

Another Neural Coding Example Direction 3 Direction 2 Direction 1 Neural representation after dimensionality reduction Santhanam et al., 2009 Byron Yu, CMU, Teaching Notes 58