On the Eigenspectrum of the Gram Matrix and the Generalisation Error of Kernel PCA (Shawe-Taylor, et al. 2005) Ameet Talwalkar 02/13/07

On the Egenspectru of the Gra Matr and the Generalsaton Error of Kernel PCA Shawe-aylor, et al. 005 Aeet alwalar 0/3/07

Outlne Bacground Motvaton PCA, MDS Isoap Kernel PCA Generalsaton Error of Kernel PCA

Lossy Densonal Reducton: Motvaton Coputatonal effcency sualzaton of data requres D or 3D representatons Curse of Densonalty : Learnng algorths requre reasonably good saplng Intractable learnng proble A A ractable learnng proble D Red -> Lossless Manfold Learnng Assues estence of ntrnsc denson, or a reduced representaton contanng all ndependent varables

Lnear Densonal Reducton Assues nput data s a lnear functon of the ndependent varables Coon Methods: Prncpal Coponent Analyss PCA Multdensonal Scalng MDS

PCA Bg Pcture Lnearly transfor nput data n a way that: Mazes sgnal varance Mnzes redundancy of sgnal covarance

PCA Sple Eaple Orgnal Data Ponts E.g. shoe sze easured n ft, c y provdes a good appro of data

PCA Sple Eaple cont Orgnal data restored usng only frst prncpal coponent

PCA Covarance Covarance s a easure of how uch two varables vary together cov, y E[ y y] cov, var If and y are ndependent, then cov,y 0

PCA Covarance Matr Stores parwse covarance of varables Dagonals are varances Syetrc, Postve Se-defnte Start wth colun vector observatons of n varables Covarance s an n n atr C C X X E [ [ ] [ ] X E X X E X ] XX

Egendecoposton Egenvectors v and egenvalues λ for an n n atr, A, are pars v, λ such that: Av λv If A s a real syetrc atr, t can be dagonalzed nto A E DE E A s orthonoral egenvectors D dagonal atr of A s egenvalues A s postve se-defnte > egenvalues non-negatve negatve

PCA Goal 3 Lnearly transfor nput data n a way that: Mazes sgnal varance Mnzes redundancy of sgnal covarance Algorth: Select varance azng drecton nput space Fnd net varance azng drecton that s orthogonal to all prevously selected drectons Repeat - tes Fnd a transforaton, P, such that Y PX and C Y s dagonalzed Soluton: proect data onto egenvectors of C

PCA Algorth Goal: Fnd P where Y PX s.t.. C Y s dagonalzed C Y where YY PX PX PXX P PAP A XX EDE note: egenvectors of E are orthonoral Select P E, or a atr where each row s an egenvector of C C Y PAP P P D DP P Inverse ranspose for orthonoral atr C Y s dagonalzed PCs are the egenvectors of C th dagonal value of C Y s the varance of X along p

Gra Matr Kernel Matr Gven X, a collecton of colun vector observatons of n varables Gra Matr of M: atr of dot products of nputs, real, syetrc Postve se-defnte slarty atr K K X X

Classcal Multdensonal Scalng Gven obects and dsslarty δ for each par, fnd space n whch δ Eucldean dstance If δ Eucldean Dstance: Can convert Dsslarty atr to Gra Matr or we can ust start wth Gra Matr MDS yelds sae answer as PCA

Classcal Multdensonal Scalng Convert Dsslarty Matr to Gra Matr K Egendecoposton of K K EDE ED K X X X ED / ED / D / Reduce Denson / E ED / ED / / Construct X fro subset of egenvectors/egenvalues egenvalues Identcal to PCA

Ltatons of Lnear Methods Cannot account for non- lnear relatonshp of data n nput space Sall Eucldean dstance Data ay stll have lnear relatonshp n soe feature space Isoap: : use geodesc dstance to recover anfold Length of shortest curve on a anfold connectng two ponts on the anfold Large geodesc dstance

Local Estaton of Manfolds Sall patches on a non-lnear anfold loo lnear Locally lnear neghborhoods defned n two ways -nearest neghbors: fnd the nearest ponts to a gven pont ε-ball: fnd all ponts that le wthn ε of a gven pont

Isoap dea Create weghted graph vertces dataponts edges between neghbors, weghted by Eucldean dstance Dstance atr parwse Shortest paths Construct d-densonal d densonal ebeddng Perfor MDS and eyeball resdual varance

Eyeballng Intrnsc Denson

Isoap Convergence Guaranteed to asyptotcally recover conve Eucldean anfolds For a suffcently hgh densty of data ponts, gven arbtrarly sall values λ, λ and µ,, then wth probablty at least - µ: graph dstance - λ + λ geodesc dstance Rate of convergence dependent on densty of ponts and propertes of underlyng anfold radus of curvature, branch separaton

Kernel Functons Kernel functon: slarty easure between two vectors Defne non-lnear appng fro nput space to hgh- densonal feature space: : X F Defne such that: y, y Effcency: ay be uch ore effcent to copute than appng and dot product n hgh densonal space Fleblty: can be chosen arbtrarly so long as t s postve ve defnte syetrc

Postve Defnte Syetrc PDS Kernels Gven colun vector observatons of n varables Kernel Matr: atr n whch K, Kernel s PDS f K s syetrc and postve se- defnte If K s postve se-defnte then s the dot product n soe dot product space feature space

Kernel rc For any algorth relyng solely on dot-products, we can replace the dot-product wth a postve- defnte ernel Allows for non-lnearty Eaple: PCA

Kernel PCA Kernel PCA PCA: PCA: egenvectors of Covarance atr egenvectors of Covarance atr are Prncpal Coponents are Prncpal Coponents Can rewrte solely wth dot Can rewrte solely wth dotproducts products Kernel PCA: Kernel PCA: y y y y : by ultply * ],... [ λ C * λ λ

Kernel PCA Kernel PCA [ ] y y...... λ K λ y y y ]:... [ λ Kernel Matr Kernel Matr

Kernel PCA Kernel PCA K s ernel gra atr K s ernel gra atr Use Use egendecoposton egendecoposton on K to fnd on K to fnd egenvectors egenvectors Proect test ponts n F on subset of Proect test ponts n F on subset of egenvectors denson reducton egenvectors denson reducton PCA: PCA: egenvectors of Covarance atr egenvectors of Covarance atr are Prncpal Coponents are Prncpal Coponents Can rewrte solely wth dot Can rewrte solely wth dotproducts products Kernel PCA: Kernel PCA: y y y y : by ultply * ],... [ λ C * λ λ K λ, κ λ λ

heory behnd densonal reducton? Densonal reducton has ganed popularty snce Isoap,, LLE publshed But, not uch theory behnd t Isoap s an ecepton Assung estence of underlyng anfold Do varous d red algorths converge to the correct anfold? What s the rate of convergence,.e., gven nput X of ponts, how close s d_redx to underlyng anfold?

Why focus on KPCA? Generalzaton of densonal reducton LLE and Isoap are fors of KPCA Resdual arance an ntutve easureent of accuracy Lt s clear and provable: Gven an underlyng anfold wth denson, as approaches nfnty, resdual varance approaches 0 Paper also uses R to easure d red accuracy n fnte case

What we re nterested n Resdual arance Captured arance C n > λ λ λ λ + λ hs paper provdes bounds for the sus of these process egenvalues as a functon of eprcal egenvalues

Eprcal egenvalues Eprcal egenvalues Perfor PCA on saple, S, of ponts Perfor PCA on saple, S, of ponts Note: are egenvalues of C Note: are egenvalues of C S, and, and y y y y y y ˆ,, : by ultply * ],... [ λ κ µ κ * S C µ µ λ ˆ µ

Process egenvalues Eprcal egenproble: y [ κ... ], ultply * by y y, µ y As approaches nfnty, ths becoes: : χ κ, y p d λ y for a gven ernel functon and densty p on a space X µ λ s an estate for process egenvalue

Proectons onto Subspaces P : Proecton onto subspace P : Proecton onto orthogonal copleent of P : Resdual of proecton onto dstance between orgnal pont and ts proecton

Egenvalues and Proectons Equatons azed when v st egenvector of K q st egenvalue of operator K q equals epected squared nor of st egenvector of K q ntuton: frst egenvector s drecton for whch the epected square of the resdual s nal q defnes dstrbuton of K general forula applcable to eprcal and process cases λ Κ Ε q q P [ ] [ ] Pv Εq n Εq P v F v λ Κ q a Εq v F

Eprcal/Process Epectatons of Eprcal/Process Subspaces Frst two equatons follow fro last slde Ε λ P Εˆ P ˆ µ Ε P ˆ : Average resdual over entre dstrbuton of proecton onto frst eprcal egenvectors agreed? Εˆ P : Eprcal average of squared nor for ponts n S proected onto frst process egenvectors

wo sple nequaltes s the best soluton for eprcal data S ˆ Εˆ P Ε ˆ µ ˆ P s the best soluton for underlyng process Ε P Ε Pˆ λ Goal of paper: show that chan of nequaltes below s accurate and bound dfference between frst and last ters Εˆ Ε Ε Ε P ˆ ˆ P P Pˆ

What we re nterested n Resdual arance Captured arance C n λ λ λ > λ + λ Ε P Ε + Ε P P hs paper provdes bounds for the sus of these process egenvalues as a functon of eprcal egenvalues

And now a frst Bound If we perfor PCA n feature space defned by κ,y, then wth probablty greater than -δ over rando -saples S, f new data s proected onto Ṽ the su the largest process egenvalues captured varance s bounded by: λ Ε P l + l a µ S l R Ε 9 + ln δ P ˆ κ, where support of the dstrbuton s n a ball of radus R n feature space

And now a frst Bound Frst ter: l + l a µ S l κ, radeoff between ters wthn a ter: as l ncreases, captured varance ncreases, but so does the rato of l/ For well-behaved ernels those for whch dot product s bounded, the square root ter should be a constant Second ter: R 9 + ln δ Includes dependences on confdence paraeter and dstrbuton radus R

+ he second bound If we perfor PCA n feature space defned by κ,y, then wth probablty greater than -δ over rando -saples S, f new data s proected onto Ṽ the epected squared resdual s bounded by: > Ε Ε λ P P ˆ + > l l µ S + n κ, l R 8 ln δ where support of the dstrbuton s n a ball of radus R n feature space

Net steps How tght are these bounds? Can we do better? Can we use these bounds to copare estng densonal reducton algorths Can we construct a ernel that azes the tghtness of ths bound?