Principal Component Analysis Lecture 11 1
Eigenvectors and Eigenvalues g Consider this problem of spreading butter on a bread slice 2
Eigenvectors and Eigenvalues g Consider this problem of stretching cheese on a bread slice 3
Eigenvectors and Eigenvalues g Consider this problem of spreading butter on a bread slice " A = 2 0 % $ ' # 0 1& Linear Transformation 4
Eigenvectors and Eigenvalues Find Eigenvalues: " A = 2 0 % $ ' # 0 1& A ( )I = 0 " 2 ( ) 0 % $ ' = 0 # 0 1 ( )& (2 ( ))(1( )) = 0 * ) = 2,1 5
Eigenvectors and Eigenvalues Find Eigenvectors for λ=2 A v = " v (A # "I) v = 0 $ 2 # 2 0 ' $ & ) v ' 1 & ) = 0 % 0 1 # 2( % ( 0v 1 #1v 2 = 0 v 2 = 0 $ v = & 1' ) % 0( v 2 X-axis 6
Eigenvectors and Eigenvalues Find Eigenvectors for λ=2 A v = " v (A # "I) v = 0 $ 2 # 2 0 ' $ & ) v ' 1 & ) = 0 % 0 1 # 2( % ( 0v 1 #1v 2 = 0 v 2 = 0 $ v = & 1' ) % 0( X-axis v 2 Find Eigenvectors for λ=1 A v = " v (A # "I) v = 0 $ 2 #1 0 ' $ & ) v ' 1 & ) = 0 % 0 1 #1(% ( 1v 1 # 0v 2 = 0 v 1 = 0 $ v = 0 ' & ) % 1( Y-axis v 2 7
Eigenvectors and Eigenvalues g What does the eigenvalue and vectors indicate? n That the linear transformation results in a scaling of λ along the eigenvector corresponding to λ # g In the example presented, λ=2 along the x-axis and λ=1 along the y-axis (intuitive)# g Stated differently, eigenvalue indicates the percentage of transformation present along a particular direction# n 66.66% along x-axis# n 33.33% along y-axis # 8
Eigenvectors and Eigenvalues g Lets consider another example n This time non-zero diagonal elements# " 1.5 0.5% A = $ ' # 0.5 1.5& 2.5 2 1.5 1 Before Linear Transformation " 1.5 0.5% A = $ ' # 0.5 1.5& 2.5 2 1.5 1 Before & After Linear Transformation 0.5 0 0.5 Linear Transformation 0.5 0 0.5 1 1 0.5 0 0.5 1 1.5 2 2.5 1 1 0.5 0 0.5 1 1.5 2 2.5 9
Eigenvectors and Eigenvalues g Compute eigenvalues and eigenvectors of A " 1.5 0.5% A = $ ' # 0.5 1.5& 2.5 2.5 2 1.5 1 Before Linear Transformation " 1.5 0.5% A = $ ' # 0.5 1.5& 2 1.5 1 Before & After Linear Transformation 0.5 0 0.5 Linear Transformation 0.5 0 0.5 1 1 0.5 0 0.5 1 1.5 2 2.5 1 1 0.5 0 0.5 1 1.5 2 2.5 10
Eigenvectors and Eigenvalues g Compute eigenvalues and eigenvectors of A " 1.5 0.5% A = $ ' # 0.5 1.5& " =1 " = 2 $ v = #0.7071 ' $ & ) v = 0.7071 ' & ) % 0.7071 ( % 0.7071( 2.5 2 Before & After Linear Transformation 1.5 1 0.5 0 0.5 1 1 0.5 0 0.5 1 1.5 2 2.5 11
Eigenvectors and Eigenvalues g What does the eigenvalue and vectors indicate? n That the linear transformation results in a scaling of λ along the eigenvector corresponding to λ # g λ=2 along [0.7071, 0.7071] T# g λ=1 along [-0.7071, 0.7071] T # Eigenvector2 (λ=1) 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 0 0.5 1 1.5 2 2.5 3 Eigenvector1 (λ=2) # A " = 2 0 & % ( $ 0 1' Even though we started with a non-diagonal transformation matrix (A), by computing the eigenvectors and projecting the data onto those eigenvectors allows us to diagonalize the transformation 12
Recap Linear Algebra: Projection 13
Vector Space 14
Basis Vectors 15
Projection using Basis Vectors 16
Projection using Basis Vectors 17
Projection using Basis Vectors 18
Statistics Preliminaries g Mean n Let X 1,X 2,, X n be n observations of a random variable X# µ = X = E[ X] = 1 n n Mean is a measure of central tendency (others are mode and median)# n " i=1 X i g Standard Deviation n Measure of variability (square root of variance)# " = 1 n n $ i=1 ( X i # µ ) 2 19
Statistics Preliminaries g Covariance n A measure of how two variables change together# " XY = 1 n i=1 g Covariance Matrix n $ " XY = E X # E X ( X i # µ X ) Y i # µ Y ( ) T [( [ ])( Y # E[ Y] )] $ 2 # X1 " X1X 2... " X1Xd ' & 2 ) &" X1X 2 # X 2... " X 2Xd ) &.. ) " = &..... ) & ).. & 2 ) %" X1Xd " X 2Xd... # Xd ( 20
Statistics Preliminaries g Correlation n A normalized measure of how two variables change together# " XY = # XY $ X $ Y 21
Statistics Preliminaries An Example " µ = 2 % " $ ' ( = 2 0 % $ ' # 3& # 0 3& 10 8 6 4 x2 2 0 2 4 4 2 0 2 4 6 8 x1 22
Statistics Preliminaries An Example " µ = 2 % " $ ' ( = 2 0.5 % $ ' # 3& # 0.5 3 & 10 8 6 x2 4 2 0 2 4 4 2 0 2 4 6 8 x1 23
Statistics Preliminaries An Example " µ = 2 % " $ ' ( = 2 1 % $ ' # 3& # 1 3& 10 8 6 x2 4 2 0 2 4 4 2 0 2 4 6 8 x1 24
Statistics Preliminaries An Example " µ = 2 % " $ ' ( = 2 2 % $ ' # 3& # 2 3& 10 8 6 4 x2 2 0 2 4 4 2 0 2 4 6 8 x1 25
Statistics Preliminaries An Example " µ = 2 % " $ ' ( = 2 )2 % $ ' # 3& #)2 3 & 10 8 6 x2 4 2 0 2 4 4 2 0 2 4 6 8 x1 26
PCA An Example " µ = 2 % " $ ' ( = 2 2 % $ ' # 3& # 2 3& 10 Can you find a vector that would approximate this 2-D space? 8 6 4 x2 2 0 2 4 4 2 0 2 4 6 8 x1 27
PCA Letʼs build some intuition " µ = 2 % " $ ' ( = 2 2 % $ ' # 3& # 2 3& 10 Can you find a vector that would approximate this 2-D space? Green vector right? 8 6 4 x2 2 0 2 4 4 2 0 2 4 6 8 x1 28
PCA Letʼs build some intuition " µ = 2 % " $ ' ( = 2 2 % $ ' # 3& # 2 3& 10 It would be nice to diagonalize the covariance matrix then you have only think about variance 8 6 4 x2 2 0 2 4 4 2 0 2 4 6 8 x1 29
PCA Letʼs build some intuition " µ = 2 % " $ ' ( = 2 2 % $ ' # 3& # 2 3& 10 It would be nice to diagonalize the covariance matrix then you have only think about variance Think eigenvectors of covariance matrix 8 6 4 x2 2 0 2 4 4 2 0 2 4 6 8 x1 30
PCA Letʼs build some intuition " µ = 0 % " $ ' ( = 2 2 % $ ' # 0& # 2 3& It would be nice to diagonalize the covariance matrix then you have only think about variance Think eigenvectors of covariance matrix 8 6 4 2 x2 0 2 4 Letʼs subtract the mean first (makes things simple) 6 8 8 6 4 2 0 2 4 6 x1 31
PCA Letʼs build some intuition " µ = 0 % " $ ' ( = 2 2 % $ ' # 0& # 2 3& It would be nice to diagonalize the covariance matrix then you have only think about variance Think eigenvectors of covariance matrix 8 6 4 2 x2 0 2 4 6 8 8 6 4 2 0 2 4 6 " = 0.4384 " = 4.5616 $ v = #0.7882 ' $ & ) v = 0.6154 ' & ) % 0.6154 ( % 0.7882( x1 32
PCA Letʼs build some intuition " µ = 0 % " $ ' ( = 2 2 % $ ' # 0& # 2 3& It would be nice to diagonalize the covariance matrix then you have only think about variance Think eigenvectors of covariance matrix 8 6 4 2 x2 0 2 4 First eigenvector accounts for 91.23% of the data 6 8 8 6 4 2 0 2 4 6 x1 33
PCA Letʼs build some intuition " µ = 0 % " $ ' ( = 2 2 % $ ' # 0& # 2 3& If I have to pick a vector that would approximate this 2-D space? Green vector right 8 6 4 2 x2 0 2 4 First eigenvector accounts for 91.23% of the data 6 8 8 6 4 2 0 2 4 6 x1 34
PCA An Example g Compute the principal components for the following twodimensional dataset n X=(x1,x2)={(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)}# 10 9 8 7 6 x2 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 x1 35
PCA An Example g Compute the principal components for the following twodimensional dataset n X=(x1,x2)={(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)}# 10 Step 1: Determine the Sample Covariance Matrix 9 8 7 6 x2 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 x1 36
PCA An Example g Compute the principal components for the following twodimensional dataset n X=(x1,x2)={(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)}# Step 1: Determine the Sample Covariance Matrix # 6.25 4.25& " = % ( $ 4.25 3.5 ' x2 10 9 8 7 6 5 4 Step 2: Find eigenvalues and eigenvectors of the covariance matrix 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 x1 37
PCA An Example g Compute the principal components for the following twodimensional dataset n X=(x1,x2)={(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)}# Step 1: Determine Covariance Matrix # 6.25 4.25& " = % ( $ 4.25 3.5 ' x2 10 9 8 7 6 5 4 Step 2: Find eigenvalues and eigenvectors of the covariance matrix " = 0.4081 " = 9.3419 $ v = 0.5883 ' $ & ) v = #0.5883 ' & ) %#0.8086( %#0.8086( 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 x1 38
Now the Math Part g How did we decide that eigenvectors of the covariance matrix will retain maximum information? g What are the assumptions behind PCA n First, the data distribution is unimodal Gaussian in nature# g Fully explained by the first two moments of the distribution i.e. mean and covariance# n Information is in the variance # g PCA dimensionality reduction n The optimal* approximation of a random vector x R N by a linear combination of M (M<N) independent vectors is obtained by projecting the random vector x onto the eigenvectors ϕ i corresponding to the largest eigenvalues λ i of the covariance matrix Σ x# (*optimality is defined as the minimum of the sum-square magnitude of the approximation error) # 39
Lets do the derivation g The objective of PCA is to perform dimensionality reduction while preserving as much of the randomness (variance) in the highdimensional space as possible n Let x be an N-dimensional random vector, represented as a linear combination of orthonormal basis vectors [ϕ 1 ϕ 2... ϕ N ] as# x = N # y i " i where " i $" j i=1 n Where y i is the weighting coefficient of basis vector formed by taking the inner product of x with ϕ 1 : # y i = x T " i 40
Lets do the derivation g The objective of PCA is to perform dimensionality reduction while preserving as much of the randomness (variance) in the highdimensional space as possible n Let x be an N-dimensional random vector, represented as a linear combination of orthonormal basis vectors [ϕ 1 ϕ 2... ϕ N ] as# x = N # y i " i where " i $" j i=1 n Where y i is the weighting coefficient of basis vector formed by taking the inner product of x with ϕ 1 : # y i = x T " i Φ 2 =[0 1] T x=[1 1] T y 2 =x T φ 2 =1 " 1% " 0% x = y 1 $ ' + y # 0 2 $ ' & # 1& y 1 =x T φ 1 =1 Φ 1 =[1 0] T 41
Lets do the derivation g Suppose we choose to represent x with only M (M<N) of the basis vectors. We can do this by replacing the components [y M+1,, y N ] T with some pre-selected constants b i ˆ x (M) = M # y i " i + # b i " i i=1 N i=m +1 g The representation error is: 42
Lets do the derivation g Suppose we choose to represent x with only M (M<N) of the basis vectors. We can do this by replacing the components [y M+1,, y N ] T with some pre-selected constants b i ˆ x (M) = M # y i " i + # b i " i i=1 N i=m +1 g The representation error is "x(m) = x # ˆ x (M) M N & M N ) = % y i $ i + % y i $ i #(% y i $ i + % b i $ i + ' * i=1 N i=m +1 = % y i $ i # % b i $ i i=m +1 N % N i=m +1 = ( y i # b i )$ i i=m +1 i=1 i=m +1 43
Lets do the derivation g g We can measure this representation error by the mean-squared magnitude of Δx Our goal is to find the basis vectors ϕ i and constants b i that minimize this mean-square error E [ "x(m) 2 ] = 1 S S $ # 2 i (M) S = samples i=1 -' N *' N * 0 = E /) $ ( y i % b i )& i,) $ ( y i % b i )& i, 2.( i=m +1 + ( i=m +1 + 1 -' N N = E / ) $ $ y i % b i./ ( i=m +1 j =M +1 N $ [( ) 2 ] = E y i % b i i=m +1 * 0 ( )& T i & j, 2 + 12 ( ) y j % b j -& i 3& j 0 / 2.& i =11 44
Lets do the derivation g Find b i n The optimal values of b i can be found by computing the partial derivative of the objective function and equating it to zero# " % N ' $ E y "b i # b i i & i=m +1 [( ) 2 ] ( [ ] # b i ) = 0 [ ] = #2 E y i b i = E y i ( * = #2 E y i # b i ) ( [ ]) = 0 E[x] is a linear operator E[A-B]=E[A]-E[B] n Intuitive: replace the discarded dimensions y iʼs by their expected value (or mean)# 45
Lets do the derivation g Find b i n As we have done earlier in the course, the optimal values of bi can be found by computing the partial derivative of the objective function and equating it to zero# # n b i = E[ y i ] Intuitive: replace the discarded dimensions y iʼs by their expected value# Reducing 2d to1d x=[1 1] T x=[2.5 1] T y 2 =x T φ 2 =1 b 2 =(1+1+0.5)/3 b 2 =0.833 x=[2 0.5] T y 1 =1 y 1 =2 y 1 =2.5 Φ 1 =[1 0] T y 1 =x T φ 1 =1 46
Lets do the derivation g The Mean-Squared-Error can now be written as N $ [( ) 2 ] " 2 (M) = E y i # b i N i=m +1 ( [ ]) 2 % = $ E y i # E y &' i i=m +1 ( )* g where y i = x T " i 47
Lets do the derivation g The Mean-Squared-Error can now be written as g where # ( [ ]) 2 " 2 $ (M) = * E y i # E y %& i i=m +1 Substituting y i in the MSE equation we get: N y i = x T " i ' () N " 2 % (M) = + E x T # i $ E x T # &' i N i=m +1 [ ] ( ) 2 = + # T i E ( x $ E[ x] )( x $ E [ x ] ) T # i #### "#### $ i=m +1 N + = # i T, x # i i=m +1 ( )* [ ] data cov ariance Where Σ x is the covariance matrix of x 48
Lets do the derivation g We seek to find the solution that minimizes the MSE and is also subject to the normality constraint (φ T φ=1), which we incorporate into the expression using a set of Lagrange multipliers λ i N " 2 (M) = %# T i $ x # i + %& i (1'# T i # i ) i=m +1 N i=m +1 g Computing partial derivatives with respect to φ i d ' N N * ) $ " T d" i # x " i + $ % i (1&" T i " i ), = 2 # x " i & % i " i i ( + Note : i=m +1 d dx xt Ax i=m +1 ( ) = (A + A T A symmertric )x = 2Ax ( ) = 0 - # x # " i " = # % $ i " i Eigenvalue problem g So ϕ i and λ i are the eigenvectors and eigenvalues of the covariance matrix Σ x 49
Lets do the derivation g We can express the sum-squared error as N " 2 (M) = %# T i $ x # i = %# T i & i # i = %& i i=m +1 N i=m +1 N i=m +1 g In order to minimize this measure, λ i will have to be smallest eigenvalues n Therefore, to represent x with minimum sum-square error, we will choose the eigenvectors ϕ i corresponding to the largest eigenvalues λ i # # 50
Scree plot 51
Applications of PCA: Spike Sorting g Extracellular recording (single electrode) 52
Applications of PCA: Spike Sorting g Multi-unit recording Notice different colors (represent different neurons) have different patterns of extracellular activity) [from Pouzat et al. 2002] 53
Applications of PCA: Spike Sorting g Multi-unit recording PC2 Different colors (represent different neurons) have different patterns of extracellular activity) PC1 Clustering after PCA allows identification of events from a single source (neuron) 54
Applications of PCA: Neural Coding g Information is represented by spiking patterns of neural ensembles g Visualization of high-dimensional neural activity 55
Identity vs. Intensity Coding g PN ensemble activity trace concentration-specific trajectories on odor-specific manifolds (Stopfer et al., Neuron, 2003). Odor Trajectories (3 odors at multiple concentrations) Odor Trajectories (Hexanol at five concentrations) Octanol manifold Geraniol manifold Hexanol manifold Stopfer et al, Neuron, 2003 Stopfer et al., Neuron, 2003 LLE Local Linear Embedding S. Roweis and L. Saul, Science, 2000 56
Another Neural Coding Example Kemere et al., 2008 Byron Yu, CMU, Teaching Notes 57
Another Neural Coding Example Direction 3 Direction 2 Direction 1 Neural representation after dimensionality reduction Santhanam et al., 2009 Byron Yu, CMU, Teaching Notes 58