On the Egenspectru of the Gra Matr and the Generalsaton Error of Kernel PCA Shawe-aylor, et al. 005 Aeet alwalar 0/3/07
Outlne Bacground Motvaton PCA, MDS Isoap Kernel PCA Generalsaton Error of Kernel PCA
Lossy Densonal Reducton: Motvaton Coputatonal effcency sualzaton of data requres D or 3D representatons Curse of Densonalty : Learnng algorths requre reasonably good saplng Intractable learnng proble A A ractable learnng proble D Red -> Lossless Manfold Learnng Assues estence of ntrnsc denson, or a reduced representaton contanng all ndependent varables
Lnear Densonal Reducton Assues nput data s a lnear functon of the ndependent varables Coon Methods: Prncpal Coponent Analyss PCA Multdensonal Scalng MDS
PCA Bg Pcture Lnearly transfor nput data n a way that: Mazes sgnal varance Mnzes redundancy of sgnal covarance
PCA Sple Eaple Orgnal Data Ponts E.g. shoe sze easured n ft, c y provdes a good appro of data
PCA Sple Eaple cont Orgnal data restored usng only frst prncpal coponent
PCA Covarance Covarance s a easure of how uch two varables vary together cov, y E[ y y] cov, var If and y are ndependent, then cov,y 0
PCA Covarance Matr Stores parwse covarance of varables Dagonals are varances Syetrc, Postve Se-defnte Start wth colun vector observatons of n varables Covarance s an n n atr C C X X E [ [ ] [ ] X E X X E X ] XX
Egendecoposton Egenvectors v and egenvalues λ for an n n atr, A, are pars v, λ such that: Av λv If A s a real syetrc atr, t can be dagonalzed nto A E DE E A s orthonoral egenvectors D dagonal atr of A s egenvalues A s postve se-defnte > egenvalues non-negatve negatve
PCA Goal 3 Lnearly transfor nput data n a way that: Mazes sgnal varance Mnzes redundancy of sgnal covarance Algorth: Select varance azng drecton nput space Fnd net varance azng drecton that s orthogonal to all prevously selected drectons Repeat - tes Fnd a transforaton, P, such that Y PX and C Y s dagonalzed Soluton: proect data onto egenvectors of C
PCA Algorth Goal: Fnd P where Y PX s.t.. C Y s dagonalzed C Y where YY PX PX PXX P PAP A XX EDE note: egenvectors of E are orthonoral Select P E, or a atr where each row s an egenvector of C C Y PAP P P D DP P Inverse ranspose for orthonoral atr C Y s dagonalzed PCs are the egenvectors of C th dagonal value of C Y s the varance of X along p
Gra Matr Kernel Matr Gven X, a collecton of colun vector observatons of n varables Gra Matr of M: atr of dot products of nputs, real, syetrc Postve se-defnte slarty atr K K X X
Classcal Multdensonal Scalng Gven obects and dsslarty δ for each par, fnd space n whch δ Eucldean dstance If δ Eucldean Dstance: Can convert Dsslarty atr to Gra Matr or we can ust start wth Gra Matr MDS yelds sae answer as PCA
Classcal Multdensonal Scalng Convert Dsslarty Matr to Gra Matr K Egendecoposton of K K EDE ED K X X X ED / ED / D / Reduce Denson / E ED / ED / / Construct X fro subset of egenvectors/egenvalues egenvalues Identcal to PCA
Ltatons of Lnear Methods Cannot account for non- lnear relatonshp of data n nput space Sall Eucldean dstance Data ay stll have lnear relatonshp n soe feature space Isoap: : use geodesc dstance to recover anfold Length of shortest curve on a anfold connectng two ponts on the anfold Large geodesc dstance
Local Estaton of Manfolds Sall patches on a non-lnear anfold loo lnear Locally lnear neghborhoods defned n two ways -nearest neghbors: fnd the nearest ponts to a gven pont ε-ball: fnd all ponts that le wthn ε of a gven pont
Isoap dea Create weghted graph vertces dataponts edges between neghbors, weghted by Eucldean dstance Dstance atr parwse Shortest paths Construct d-densonal d densonal ebeddng Perfor MDS and eyeball resdual varance
Eyeballng Intrnsc Denson
Isoap Convergence Guaranteed to asyptotcally recover conve Eucldean anfolds For a suffcently hgh densty of data ponts, gven arbtrarly sall values λ, λ and µ,, then wth probablty at least - µ: graph dstance - λ + λ geodesc dstance Rate of convergence dependent on densty of ponts and propertes of underlyng anfold radus of curvature, branch separaton
Kernel Functons Kernel functon: slarty easure between two vectors Defne non-lnear appng fro nput space to hgh- densonal feature space: : X F Defne such that: y, y Effcency: ay be uch ore effcent to copute than appng and dot product n hgh densonal space Fleblty: can be chosen arbtrarly so long as t s postve ve defnte syetrc
Postve Defnte Syetrc PDS Kernels Gven colun vector observatons of n varables Kernel Matr: atr n whch K, Kernel s PDS f K s syetrc and postve se- defnte If K s postve se-defnte then s the dot product n soe dot product space feature space
Kernel rc For any algorth relyng solely on dot-products, we can replace the dot-product wth a postve- defnte ernel Allows for non-lnearty Eaple: PCA
Kernel PCA Kernel PCA PCA: PCA: egenvectors of Covarance atr egenvectors of Covarance atr are Prncpal Coponents are Prncpal Coponents Can rewrte solely wth dot Can rewrte solely wth dotproducts products Kernel PCA: Kernel PCA: y y y y : by ultply * ],... [ λ C * λ λ
Kernel PCA Kernel PCA [ ] y y...... λ K λ y y y ]:... [ λ Kernel Matr Kernel Matr
Kernel PCA Kernel PCA K s ernel gra atr K s ernel gra atr Use Use egendecoposton egendecoposton on K to fnd on K to fnd egenvectors egenvectors Proect test ponts n F on subset of Proect test ponts n F on subset of egenvectors denson reducton egenvectors denson reducton PCA: PCA: egenvectors of Covarance atr egenvectors of Covarance atr are Prncpal Coponents are Prncpal Coponents Can rewrte solely wth dot Can rewrte solely wth dotproducts products Kernel PCA: Kernel PCA: y y y y : by ultply * ],... [ λ C * λ λ K λ, κ λ λ
heory behnd densonal reducton? Densonal reducton has ganed popularty snce Isoap,, LLE publshed But, not uch theory behnd t Isoap s an ecepton Assung estence of underlyng anfold Do varous d red algorths converge to the correct anfold? What s the rate of convergence,.e., gven nput X of ponts, how close s d_redx to underlyng anfold?
Why focus on KPCA? Generalzaton of densonal reducton LLE and Isoap are fors of KPCA Resdual arance an ntutve easureent of accuracy Lt s clear and provable: Gven an underlyng anfold wth denson, as approaches nfnty, resdual varance approaches 0 Paper also uses R to easure d red accuracy n fnte case
What we re nterested n Resdual arance Captured arance C n > λ λ λ λ + λ hs paper provdes bounds for the sus of these process egenvalues as a functon of eprcal egenvalues
Eprcal egenvalues Eprcal egenvalues Perfor PCA on saple, S, of ponts Perfor PCA on saple, S, of ponts Note: are egenvalues of C Note: are egenvalues of C S, and, and y y y y y y ˆ,, : by ultply * ],... [ λ κ µ κ * S C µ µ λ ˆ µ
Process egenvalues Eprcal egenproble: y [ κ... ], ultply * by y y, µ y As approaches nfnty, ths becoes: : χ κ, y p d λ y for a gven ernel functon and densty p on a space X µ λ s an estate for process egenvalue
Proectons onto Subspaces P : Proecton onto subspace P : Proecton onto orthogonal copleent of P : Resdual of proecton onto dstance between orgnal pont and ts proecton
Egenvalues and Proectons Equatons azed when v st egenvector of K q st egenvalue of operator K q equals epected squared nor of st egenvector of K q ntuton: frst egenvector s drecton for whch the epected square of the resdual s nal q defnes dstrbuton of K general forula applcable to eprcal and process cases λ Κ Ε q q P [ ] [ ] Pv Εq n Εq P v F v λ Κ q a Εq v F
Eprcal/Process Epectatons of Eprcal/Process Subspaces Frst two equatons follow fro last slde Ε λ P Εˆ P ˆ µ Ε P ˆ : Average resdual over entre dstrbuton of proecton onto frst eprcal egenvectors agreed? Εˆ P : Eprcal average of squared nor for ponts n S proected onto frst process egenvectors
wo sple nequaltes s the best soluton for eprcal data S ˆ Εˆ P Ε ˆ µ ˆ P s the best soluton for underlyng process Ε P Ε Pˆ λ Goal of paper: show that chan of nequaltes below s accurate and bound dfference between frst and last ters Εˆ Ε Ε Ε P ˆ ˆ P P Pˆ
What we re nterested n Resdual arance Captured arance C n λ λ λ > λ + λ Ε P Ε + Ε P P hs paper provdes bounds for the sus of these process egenvalues as a functon of eprcal egenvalues
And now a frst Bound If we perfor PCA n feature space defned by κ,y, then wth probablty greater than -δ over rando -saples S, f new data s proected onto Ṽ the su the largest process egenvalues captured varance s bounded by: λ Ε P l + l a µ S l R Ε 9 + ln δ P ˆ κ, where support of the dstrbuton s n a ball of radus R n feature space
And now a frst Bound Frst ter: l + l a µ S l κ, radeoff between ters wthn a ter: as l ncreases, captured varance ncreases, but so does the rato of l/ For well-behaved ernels those for whch dot product s bounded, the square root ter should be a constant Second ter: R 9 + ln δ Includes dependences on confdence paraeter and dstrbuton radus R
+ he second bound If we perfor PCA n feature space defned by κ,y, then wth probablty greater than -δ over rando -saples S, f new data s proected onto Ṽ the epected squared resdual s bounded by: > Ε Ε λ P P ˆ + > l l µ S + n κ, l R 8 ln δ where support of the dstrbuton s n a ball of radus R n feature space
Net steps How tght are these bounds? Can we do better? Can we use these bounds to copare estng densonal reducton algorths Can we construct a ernel that azes the tghtness of ths bound?