Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

Similar documents
December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Mathematical foundations - linear algebra

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

A Least Squares Formulation for Canonical Correlation Analysis

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

PCA, Kernel PCA, ICA

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

PCA and LDA. Man-Wai MAK

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

PCA and LDA. Man-Wai MAK

Basic Calculus Review

Advanced Introduction to Machine Learning CMU-10715

Principal Component Analysis

Machine Learning - MT & 14. PCA and MDS

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

MLCC 2015 Dimensionality Reduction and PCA

Principal Component Analysis and Linear Discriminant Analysis

Tutorial on Principal Component Analysis

1 Principal Components Analysis

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Principal Component Analysis

Introduction to Machine Learning

LECTURE NOTE #10 PROF. ALAN YUILLE

Functional Analysis Review

Data Mining Techniques

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to:

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

CS 143 Linear Algebra Review

CSC 411 Lecture 12: Principal Component Analysis

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Fundamentals of Matrices

Linear Subspace Models

Linear Dimensionality Reduction

Kernel methods for comparing distributions, measuring dependence

Lecture: Face Recognition and Feature Reduction

Regularized Discriminant Analysis and Reduced-Rank LDA

Principal Component Analysis

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

PCA and admixture models

Lecture: Face Recognition and Feature Reduction

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

Principal Component Analysis

Dimensionality Reduction

Example: Face Detection

Computation. For QDA we need to calculate: Lets first consider the case that

Principal Component Analysis (PCA)

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

What is Principal Component Analysis?

Eigenfaces. Face Recognition Using Principal Components Analysis

EECS 275 Matrix Computation

Lecture: Face Recognition

STATISTICAL LEARNING SYSTEMS

Kernel PCA, clustering and canonical correlation analysis

Numerical Methods I Singular Value Decomposition

Principal Component Analysis (PCA) CSC411/2515 Tutorial

When Fisher meets Fukunaga-Koontz: A New Look at Linear Discriminants

Expectation Maximization

Positive Definite Matrix

Principal Component Analysis

Kernel Methods. Machine Learning A W VO

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Subspace Methods for Visual Learning and Recognition

Foundations of Computer Vision

Data Mining Techniques

Dimension Reduction (PCA, ICA, CCA, FLD,

Multivariate Statistical Analysis

Dimensionality reduction

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

15 Singular Value Decomposition

EE731 Lecture Notes: Matrix Computations for Signal Processing

LECTURE 16: PCA AND SVD

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Announce Statistical Motivation Properties Spectral Theorem Other ODE Theory Spectral Embedding. Eigenproblems I

Linear Algebra. Session 12

Lecture 8: Linear Algebra Background

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Lecture 3: Review of Linear Algebra

14 Singular Value Decomposition


Lecture 3: Review of Linear Algebra

COMP 558 lecture 18 Nov. 15, 2010

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CS 4495 Computer Vision Principle Component Analysis

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Deriving Principal Component Analysis (PCA)

Principal components analysis COMS 4771

Image Analysis. PCA and Eigenfaces

2. Review of Linear Algebra

Machine learning for pervasive systems Classification in high-dimensional spaces

Image Analysis & Retrieval. Lec 14. Eigenface and Fisherface

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Transcription:

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA Pavel Laskov 1 Blaine Nelson 1 1 Cognitive Systems Group Wilhelm Schickard Institute for Computer Science Universität Tübingen, Germany Advanced Topics in Machine Learning, 2012 P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 1 / 44

Recall: Projections Projection of a point x onto a direction w is computed as: proj w (x) = w w x w 2 Directions in an RKHS expressed as linear combination of points: w = N i=1 α iφ(x i ) The norm of the projection onto w thus can be expressed as proj w (x) = w x N w = i=1 α iκ(x i,x) N = N i,j=1 α i=1 β iκ(x i,x) iα j κ(x i,x j ) Thus, the size of the projection onto w can be expressed as a linear combination of the kernel valuations with x P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 2 / 44

Recall: Fisher/Linear Discriminant Analysis (LDA) In LDA, we chose a projection direction w to maximize the cost function J(w) = µ+ w µ w 2 (σ + w) 2 +(σ w) 2 = w T S B w w T (S + W +S W )w where µ + & µ are the averages of the sets, σ + & σ are their standard deviations, S B is the between scatter matrix & S + W and S W are the within scatter matrices The optimal solution w is given by the first eigenvector of the matrix (S + W +S W ) 1 S B P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 3 / 44

Recall: Kernel LDA When the projection direction is in feature space, w α = N i=1 α iφ(x i ) From this, the LDA objective can be expressed as where max α J(α) = α Mα α Nα M = (K + K )1 N 1 N (K + K ) ) N = K + (I N + 1 N 1 + N +1 N + K + +K (I N 1 N 1 N 1 N )K Solutions α to the above generalized eigenvalue problem (as discussed later) allow us to project data onto this discriminant direction as proj w (x) = N i=1 α i κ(x i,x) P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 4 / 44

General Subspace Learning & Projections Objective: find a subspace that captures an important aspect of the training data...we find K axes that span this subspace General Problem: we will solve problems max f(w) g(w)=1 for projection direction w... iteratively solving these problems will yield a subspace defined by {w k } K k=1 General Approach: find a center µ and a set of K orthonormal directions {w k } K k=1 used to project data into the subspace: ( ) K x w k (x µ) This is a K-dimensional representation of the data regardless of the original space s dimensionality the coordinates in the space spanned by {w k } K k=1 This projection will be centered at 0 (in feature space) P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 5 / 44 k=1

Subspace Learning We want to find subspace that captures important aspects of our data P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 6 / 44

Overview LDA found 1 direction for discriminating between 2 classes In this lecture, we will see 3 subspace projection objectives / techniques: Find directions that maximize variance in X (PCA) Find directions that maximize covariance between X & Y (MCA) Find directions that maximize correlation X & Y (CCA) These techniques extract underlying structure from the data allowing us to... Capture fundamental structure of the data Represent the data in low dimensions Each of these techniques can be kernelized to operate in a feature space yielding kernelized projections onto w: proj w (φ(x)) = w φ(x) = N i=1 α iκ(x i,x) (1) where α is the vector of dual values defining w P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 7 / 44

Part I Principal Component Analysis P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 8 / 44

Motivation: Directions of Variance We want to find a direction w that maximizes the data s variance Consider a random variable x P X (Assume 0-mean). The variance of its projection onto (normalized) w is E x X [proj w (x) 2] [ ] = E w xx w = w E [xx ] w = w C xx w }{{} C xx In input space X, the empirical covariance matrix (of centered data) is an D D matrix Ĉ x,x = 1 N X X ; How can we find directions that maximize w C xx w? How can we kernelize it? P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 9 / 44

Recall: Eigenvalues & Eigenvectors Given an N N matrix A, an eigenvector of A is a non-trivial vector v that satisfies Av = λv; the corresponding value λ is an eigenvalue Eigen-values/vector pairs satisfy Rayleigh quotients: λ = v Av v v x λ 1 = max Ax x =1 x x Eigen-vectors/values form orthonormal matrix V & diagonal matrix Λ λ 1 (A) 0... 0 V = v 1 v 2... v N 0 λ 2 (A)... 0 Λ =...... 0 0 λ N (A) which form the eigen-decomposition of A: A = VΛV Deflation: for any eigen-value/vector pair (λ,v) of A, the transform à A λvv deflates the matrix; i.e., v is an eigenvector of à but has eigenvalue 0 P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 10 / 44

Principle Components Analysis (PCA) Principle Components Analysis (PCA) - algorithm for finding the principle axes of a dataset PCA finds subspace spanned by {u i } that maximizes the data s variance: u 1 = argmaxw C xx w w =1 C xx = 1 N X X This is achieved by computing C xx s eigenvectors 1 Compute the data s mean: µ = 1 N N i=1 x i = 1 N X 1 N 2 Compute the data s covariance: C xx = 1 N N i=1 (x i µ)(x i µ) 3 Find its principle axes: [U,Λ] = eig (C xx ) 4 Project data {x i } onto the first K eigenvectors: x i U 1:K (x i µ) P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 11 / 44

Properties of PCA Directions found by PCA are orthonormal: u i u j = δ i,j When projected onto the space spanned by {u i }, resulting data has diagonal covariance matrix The eigenvalues λ i are the amount of variance captured by the direction u i Variance captured by 1 st K directions is K i=1 λ i (C xx ) Using all directions, we can completely reconstruct the data in an alternative basis. Directions with low eigenvalues λ i λ 1 correspond to irrelevant aspects of data...often we use top K directions to re-represent the data. P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 12 / 44

Applications of PCA Denoising/Compression: PCA removes the (D K)-dimensional subspace with the least information. The PCA transform thus retains the most salient information about the data. Correction: Reconstruction of data that has been damaged or has missing elements Visualization: The PCA transform produces a small dimensional projection of data which is convenient for visualizing high dimensional datasets Document Analysis: PCA can be used to find common themes in a set of documents P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 13 / 44

Application: Eigenfaces for Face Recognition [1] P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 14 / 44

Application: Eigenfaces for Face Recognition [1] P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 14 / 44

Part II Kernel PCA P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 15 / 44

Kernelizing PCA PCA works in the primal space, but not all data structure is well-captured by these linear projections How can we kernelize PCA? P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 16 / 44

Singular Value Decomposition I Suppose X is any N D matrix The eigen-decomposition of PSD matrices C xx = X X & K = XX are C xx = UΛ D U K = VΛ N V where U & V are orthogonal and Λ D & Λ N have the eigenvalues Consider any eigen-pair (λ,v) of K...then X v is an eigenvector of C xx : C xx X v = X XX v = X Kv = λx v and X v = λ. Thus there is an eigenvector of Cxx such that u = 1 X v λ In fact, we have the following correspondences: u = λ 1/2 X v v = λ 1/2 Xv P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 17 / 44

Singular Value Decomposition II Further, let t = rank(x) min[d,n]. It can be shown that rank(c xx ) = rank(k) = t The singular value decomposition (SVD) of non-square X is X = VΣU where U is D D & orthogonal, V is N N & orthogonal, and Σ is N D with diagonal given by values σ i = λ i The SVD is an analog of eigen-decomposition for non-square matrices. X is non-singular iff all its singular values are non-zero It yields a spectral decomposition: X = i σ i v i u i Matrix-vector multiply Xw can be viewed as first projecting w into eigen-space {u i } of X, deforming according to its singular values σ i and reprojecting into N-space using {v i } P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 18 / 44

Covariance & Kernel Matrix Duality The SVD decomposition of X showed a duality in eigenvectors of C xx and K that allows us to kernelize it If u j is the j th eigenvector of C xx, then u j = λ 1/2 j X v j = λ 1/2 N j i=1 X i, v j,i i.e., a linear combination of the data points Replacing X i, with φ(x i ), the eigenvector u j in feature space is u j = λ 1/2 N j i=1 v j,iφ(x i ) = N i=1 α j,iφ(x i ) α j = λ 1/2 j v j with α j acting as a dual vector defined by eigen-vector v j of the kernel matrix K P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 19 / 44

Projections into Feature Space Suppose u j = N i=1 α j,iφ(x i ) is a normalized direction in the feature space For any data point x, the projection of φ(x) onto u j is proj uj (φ(x)) = u j φ(x) = N i=1 α j,iκ(x i,x) which represents the value of φ(x) in terms of the j th axis Thus, if we have a set of K orthonormal basis vectors {u j } K j=1, the projection of φ(x) onto each would produce a new K-vector proj u1 (φ(x)) proj u2 (φ(x)) x =. proj uk (φ(x)) the representation of φ(x) in that basis Thus, we can perform the PCA transform in feature space P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 20 / 44

Kernel PCA Performing PCA directly in feature space is not feasible since the covariance matrix is D D However, duality between C xx & K allows us to perform PCA indirectly Projecting data onto 1 st K directions yields a K-dimensional representation The algorithm is thus 1 Center kernel matrix: ˆK = K 1 N 11 K 1 N K11 + 1 K1 ) N 11 2 2 Find its eigenvectors: [V,Λ] = eig (ˆK 3 Find dual vectors: α j = λ 1/2 j v j ( N ) K 4 Project data onto subspace: x i=1 α j,iκ(x i,x) j=1 P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 21 / 44

Kernel PCA - Application 3 Original space 2 1 x 2 0 1 2 3 3 2 1 0 1 2 3 x 1 P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 22 / 44

Kernel PCA - Application Usual PCA fails to capture the data s two ring structure the rings are not separated in the first two components. 6 Original space 3 Projection by PCA 4 2 1 2 x 2 0 2nd component 0 1 2 2 4 3 6 6 4 2 0 2 4 6 x 1 4 4 3 2 1 0 1 2 3 1st principal component P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 22 / 44

Kernel PCA - Application Kernel PCA (RBF) does capture the data s two ring structure & the resulting projections separate the two rings 6 Original space 0.8 Projection by KPCA 4 0.6 0.4 2 x 2 0 2nd component 0.2 0.0 2 0.2 4 0.4 6 6 4 2 0 2 4 6 0.6 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 x 1 1st principal component in space induced by φ P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 22 / 44

Part III Maximum Covariance Analysis P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 23 / 44

Motivation: Directions that Capture Covariance Suppose we have a pair of related variables: input variable x P X and output variable y P Y paired data We d like to find directions of high covariance in spaces w x X and w y Y such that changes in direction w x yield changes in w y Assuming mean-centered variables, we again have that the covariance of its projection onto (normalized) w x & w y is ] E x X,y Y [w x xw y y = wx [xy E ] }{{} C xy The empirical covariance matrix (of centered data) is Ĉ x,y = 1 N X Y ; w y = wx C xyw y an D X D Y matrix How can we find directions that maximize w x C xy w y for non-square, non-symmetric matrix? How can we kernelize it in space X? P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 24 / 44

Maximum Covariance Analysis (MCA) PCA captures structure in data X, but what data is paired (x,y)? We would like to find correlated directions in X and Y Suppose we project x onto direction w x and y onto direction w y...the covariance of these random variables is [ ] [ E w x xw y y = w x E xy ] w y = w x C xy w y The problem we want to solve can again be cast as 1 max w N w x X Yw y x =1, w y =1 that is, finding a pair of directions to maximize the covariance The solution is simply the first singular vectors w x = u 1 & w y = v 1 of the SVD C xy = UΣV. Naturally, singular vectors (u 2,v 2 ),(u 3,v 3 ),... capture additional covariance P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 25 / 44

Kernelized MCA As with PCA, MCA can also be kernelized by projecting x φ(x) Consider that eigen-analysis of C xy C xy gives us U & of C xy C xy gives us V of the SVD of C xy...in fact C xy C xy = 1 N 2 Y K xx Y which has dimension D y D y & eigen-analysis of this matrix yields (kernelized) directions v k Then, in decomposing C xy C xy, we have again a relationship between u k & v k : u k = 1 σ k C xy v k, allowing us to project onto u k when X is kernelized: proj uk (φ(x)) = N i=1 α k,iκ(x i,x) α k = 1 Nσ k Yv k P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 26 / 44

Part IV Generalized Eigenvalues & CCA P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 27 / 44

Motivation: Directions of Correlation Suppose that instead of input & output variables, we have 2 variables that are different representations of the same data x: x a ψ a (x) x b ψ b (x) We d like to find directions of high correlation in these spaces w a X a and w b X b such that changes in direction w a yield changes in w b Assuming mean-centered variables, we have that the correlation of its projection onto (normalized) w a & w b is ρ ab = E xa X,x b X b [ wa x a w b x b ] E[wa x a w a x a ]E[w b x b w b x b ] = wa C ab w b wa C aa w a wb C bbw b where C ab, C aa & C bb are the covariance matrices between x a & x b (with usual empirical versions) How can we find directions that maximize ρ ab? How can we kernelize it in spaces X a & X b? P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 28 / 44

Applications of CCA Climate Prediction: Researchers have used CCA techniques to find correlations in sea level pressure & sea surface temperature: CCA is used with bilingual corpora (same text in two languages) aiding in translation tasks. P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 29 / 44

Canonical Correlation Analysis (CCA) I Our objective is to find directions of maximal correlation: max ρ ab (w a,w b ) = w a,w b w a C ab w b w a C aa w a w b C bbw b (2) a problem we call canonical correlation analysis (CCA) As with previous problems this can be expressed as max w w a C ab w b (3) a,w b such that w a C aaw a = 1 and w b C bbw b = 1 P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 30 / 44

Canonical Correlation Analysis (CCA) II The Lagrangian function for this optimization is L(w a,w b,λ a,λ b ) = w a C abw b λ a 2 (w a C aaw a 1) λ b 2 (w b C bbw b 1) Differentiating it w.r.t. w a & w b & setting equal to 0 gives C ab w b λ a C aa w a = 0 C ba w a λ b C bb w b = 0 λ a w a C aa w a = λ b w b C bbw b which implies that λ a = λ b = λ The constraints on w a & w b can be written in matrix form as [ ][ ] [ ][ ] 0 Cab wa Caa 0 wa = λ C ba 0 w b 0 C bb Aw = λbw ; w b (4) a generalized eigenvalue problem for the primal problem P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 31 / 44

Generalized Eigenvectors I Suppose A & B are symmetric & B 0, then the generalized eigenvalue problem (GEP) is to find (λ,w) s.t. which are equivalent to Aw = λbw (5) max w w Aw w Bw max w Aw w Bw=1 Note, eigenvalues are special case with B = I Since B 0, any GEP can be converted to an Eigenvalue problem by inverting B: B 1 Aw = λw P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 32 / 44

Generalized Eigenvectors II However, to ensure symmetry, we can instead use B 0 to decompose B = B 1/2 B 1/2 where B 1/2 = B 1 is a symmetric real matrix taking w = B 1/2 v for some v we obtain (symmetric) B 1/2 AB 1/2 v = λv an eigenvalue problem for C = B 1/2 AB 1/2 providing solutions to Eq. (5) w i = B 1/2 v i P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 33 / 44

Generalized Eigenvectors III Proposition 1 Solutions to GEP of Eq. (5) have following properties: if eigenvalues are distinct, then w i Bw j = δ i,j w i Aw j = λ i δ i,j that is, the vectors w i are orthonormal after applying transformation B 1/2 that is, they are conjugate with respect to B. P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 34 / 44

Generalized Eigenvectors IV Theorem 2 If (λ i,w i ) are eigen-solutions to GEP of Eq. (5), then A can be decomposed as A = N i=1 λ ibw i (Bw i ) This yields the generalized deflation of A: while B is unchanged. Ã A λ i Bw i w i B P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 35 / 44

Solving CCA as a GEP As shown in Eq. (4), CCA is a GEP Aw = λbw where [ ] [ ] 0 Cab Caa 0 A = B = w = C ba 0 0 C bb [ wa Since this is a solution to Eq. (2), the eigenvalues will be correlations λ [ 1,+1]. [ Further, ] the eigensolutions will pair: for each [ λ i ] > 0 with wa wa eigenvector, there is a λ w j = λ i with eigenvector. Hence, b w b we only need to consider the positive spectrum. Larger eigenvalues correspond to the strongest correlations. Finally, the solutions are conjugate w.r.t. matrix B which reveals that for i j w a,jc aa w a,i = 0 w b,j C bbw b,i = 0 However, the directions will not be orthogonal in the original input space. P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 36 / 44 w b ]

Dual Form of CCA I Let s take the directions to be linear combinations of data: w a = X a α a w b = X b α b Substituting these directions into Eq. (3) gives max α a,α b α a K a K b α b such that α a K2 a α a = 1 and α b K2 b α b = 1 where K a = X a X a and K b = X b X b. P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 37 / 44

Dual Form of CCA II Differentiating the Lagrangian again yields equations K a K b α b λk 2 a α a = 0 K b K a α a λk 2 b α b = 0 However, these equations reveal a problem. When the dimension of the feature space is large compared number of data points (D a N), solutions will overfit the data. For the Gaussian kernel, data will always be independent in feature space & K a will be invertible. Hence, we have K 2 b α b λ 2 K 2 b α b = 0 α a = 1 λ K 1 a K bα b but the latter holds for all α b with perfect correlation λ = 1 Solution is Overfit!!! P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 38 / 44

Regularized CCA I To avoid overfitting, we can regularize the solutions w a & w b by controlling their norms. The Regularized CCA Problem is max ρ ab (w a,w b ) = w a,w b wa C abw b ( (1 τ a )wa C aaw a +τ a w a 2) ((1 τ b )wb C bbw b +τ b w b 2) where τ a [0,1] & τ b [0,1] serve as regularization parameters Again this yields an optimization program for the dual variables max w a,w b α a K ak b α b such that (1 τ a )α a K2 a α a +τ a α a K aα a = 1 and (1 τ b )α b K2 b α b +τ b α b K bα b = 1 P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 39 / 44

Regularized CCA II Using the Lagrangian technique, we again arrive at a GEP: [ ][ ] [ 0 Ka K b αa (1 τa )K = λ 2 a +τ a K a 0 K b K a 0 α b 0 (1 τ b )K 2 b +τ bk b Solutions (α a,α b ) can now be used as usual projection directions of Eq. (1) ][ αa Solving CCA using the above GEP is impractical! The matrices required are 2N 2N. Instead, the usual approach is to make an incomplete Cholesky decomposition of the kernel matrices: α b ] K a = R a R a K b = R b R b The resulting GEP can be solved more efficiently (see book for algorithms details) P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 40 / 44

Regularized CCA III Finally CCA can be extended to multiple representations of the data, which result in the following GEP: C 11 C 12... C 1k w 1 C 11 0... 0 w 1 C 21 C 22... C 2k w 2....... = ρ 0 C 22... 0 w 2....... C k1 C k2... C kk w k 0 0... C kk w k P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 41 / 44

LDA as a GEP You should note, that the Fisher Discriminant Analysis problem can be expressed as max J(α) = α Mα α α Nα which is a GEP. In fact, this is how solutions to LDA are obtained. P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 42 / 44

Summary In this lecture, we saw how different objectives for projection directions yield different subspaces... we saw 3 different algorithms: 1 Principal Component Analysis 2 Maximum Covariance Analysis 3 Canonical Correlation Analysis We saw that each of these techniques can be solved using eigenvalue, singular value, and generalized eigenvector decompositions. We saw that each of these techniques yielded linear projections and thus could be kernelized. In the next lecture, we will explore the general technique of minimizing loss & how allows us to develop a wide range of kernel algorithms. In particular, we will see the Support Vector Machine for classification tasks. P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 43 / 44

Bibliography I The Majority of the work from this talk can be found in the lecture s accompanying book, Kernel Methods for Pattern Analysis. [1] M. A. Turk and A. P. Pentland. Face recognition using eigenfaces. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 586 591, 1991. P. Laskov and B. Nelson (Tübingen) Lecture 5: Subspace Transforms May 22, 2012 44 / 44