On the Eigenspectrum of the Gram Matrix and the Generalisation Error of Kernel PCA (Shawe-Taylor, et al. 2005) Ameet Talwalkar 02/13/07

Similar documents
LECTURE :FACTOR ANALYSIS

Least Squares Fitting of Data

y new = M x old Feature Selection: Linear Transformations Constraint Optimization (insertion)

Least Squares Fitting of Data

Excess Error, Approximation Error, and Estimation Error

XII.3 The EM (Expectation-Maximization) Algorithm

COS 511: Theoretical Machine Learning

Statistical pattern recognition

Computational and Statistical Learning theory Assignment 4

Recap: the SVM problem

1 Definition of Rademacher Complexity

Scattering by a perfectly conducting infinite cylinder

Several generation methods of multinomial distributed random number Tian Lei 1, a,linxihe 1,b,Zhigang Zhang 1,c

BAYESIAN CURVE FITTING USING PIECEWISE POLYNOMIALS. Dariusz Biskup

System in Weibull Distribution

On the Calderón-Zygmund lemma for Sobolev functions

Determination of the Confidence Level of PSD Estimation with Given D.O.F. Based on WELCH Algorithm

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

On Pfaff s solution of the Pfaff problem

Composite Hypotheses testing

Which Separator? Spring 1

Our focus will be on linear systems. A system is linear if it obeys the principle of superposition and homogenity, i.e.

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

COMP th April, 2007 Clement Pang

Differentiating Gaussian Processes

Finite Vector Space Representations Ross Bannister Data Assimilation Research Centre, Reading, UK Last updated: 2nd August 2003

Perceptual Organization (IV)

Slobodan Lakić. Communicated by R. Van Keer

Economics 130. Lecture 4 Simple Linear Regression Continued

PGM Learning Tasks and Metrics

Biostatistics. Chapter 11 Simple Linear Correlation and Regression. Jing Li

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

1 Review From Last Time

However, since P is a symmetric idempotent matrix, of P are either 0 or 1 [Eigen-values

Properties of Least Squares

Problem Set 9 Solutions

1.3 Hence, calculate a formula for the force required to break the bond (i.e. the maximum value of F)

Kernel Methods and SVMs Extension

Chat eld, C. and A.J.Collins, Introduction to multivariate analysis. Chapman & Hall, 1980

Statistics for Economics & Business

Statistics for Managers Using Microsoft Excel/SPSS Chapter 13 The Simple Linear Regression Model and Correlation

Error Bars in both X and Y

Solving Fuzzy Linear Programming Problem With Fuzzy Relational Equation Constraint

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

Xiangwen Li. March 8th and March 13th, 2001

Linear Classification, SVMs and Nearest Neighbors

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

ITERATIVE ESTIMATION PROCEDURE FOR GEOSTATISTICAL REGRESSION AND GEOSTATISTICAL KRIGING

Singular Value Decomposition: Theory and Applications

CHAPT II : Prob-stats, estimation

PHYS 450 Spring semester Lecture 02: Dealing with Experimental Uncertainties. Ron Reifenberger Birck Nanotechnology Center Purdue University

Generalized Linear Methods

Applied Mathematics Letters

Linear Approximation with Regularization and Moving Least Squares

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Pattern Classification

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

FINDING RELATIONS BETWEEN VARIABLES

β0 + β1xi. You are interested in estimating the unknown parameters β

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

Fall 2012 Analysis of Experimental Measurements B. Eisenstein/rev. S. Errede. ) with a symmetric Pcovariance matrix of the y( x ) measurements V

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

= = = (a) Use the MATLAB command rref to solve the system. (b) Let A be the coefficient matrix and B be the right-hand side of the system.

Denote the function derivatives f(x) in given points. x a b. Using relationships (1.2), polynomials (1.1) are written in the form

Linear Regression Analysis: Terminology and Notation

Limited Dependent Variables

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Lecture 3. Camera Models 2 & Camera Calibration. Professor Silvio Savarese Computational Vision and Geometry Lab. 13- Jan- 15.

β0 + β1xi and want to estimate the unknown

Geometric Camera Calibration

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Chapter 3. Two-Variable Regression Model: The Problem of Estimation

, are assumed to fluctuate around zero, with E( i) 0. Now imagine that this overall random effect, , is composed of many independent factors,

Modeling and Simulation NETW 707

Similarities, Distances and Manifold Learning

AUTO-CALIBRATION. FACTORIZATION. STRUCTURE FROM MOTION.

First Year Examination Department of Statistics, University of Florida

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Outline. Communication. Bellman Ford Algorithm. Bellman Ford Example. Bellman Ford Shortest Path [1]

Efficient, General Point Cloud Registration with Kernel Feature Maps

On the number of regions in an m-dimensional space cut by n hyperplanes

General Averaged Divergence Analysis

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

CSE 252C: Computer Vision III

e i is a random error

Introducing Entropy Distributions

Machine Learning. Support Vector Machines. Eric Xing. Lecture 4, August 12, Reading: Eric CMU,

Finding Dense Subgraphs in G(n, 1/2)

Outline. Review Numerical Approach. Schedule for April and May. Review Simple Methods. Review Notation and Order

Structure from Motion. Forsyth&Ponce: Chap. 12 and 13 Szeliski: Chap. 7

Gradient Descent Learning and Backpropagation

Lecture 4: Universal Hash Functions/Streaming Cont d

Norms, Condition Numbers, Eigenvalues and Eigenvectors

Relating Principal Component Analysis on Merged Data Sets to a Regression Approach

What is LP? LP is an optimization technique that allocates limited resources among competing activities in the best possible manner.

Transcription:

On the Egenspectru of the Gra Matr and the Generalsaton Error of Kernel PCA Shawe-aylor, et al. 005 Aeet alwalar 0/3/07

Outlne Bacground Motvaton PCA, MDS Isoap Kernel PCA Generalsaton Error of Kernel PCA

Lossy Densonal Reducton: Motvaton Coputatonal effcency sualzaton of data requres D or 3D representatons Curse of Densonalty : Learnng algorths requre reasonably good saplng Intractable learnng proble A A ractable learnng proble D Red -> Lossless Manfold Learnng Assues estence of ntrnsc denson, or a reduced representaton contanng all ndependent varables

Lnear Densonal Reducton Assues nput data s a lnear functon of the ndependent varables Coon Methods: Prncpal Coponent Analyss PCA Multdensonal Scalng MDS

PCA Bg Pcture Lnearly transfor nput data n a way that: Mazes sgnal varance Mnzes redundancy of sgnal covarance

PCA Sple Eaple Orgnal Data Ponts E.g. shoe sze easured n ft, c y provdes a good appro of data

PCA Sple Eaple cont Orgnal data restored usng only frst prncpal coponent

PCA Covarance Covarance s a easure of how uch two varables vary together cov, y E[ y y] cov, var If and y are ndependent, then cov,y 0

PCA Covarance Matr Stores parwse covarance of varables Dagonals are varances Syetrc, Postve Se-defnte Start wth colun vector observatons of n varables Covarance s an n n atr C C X X E [ [ ] [ ] X E X X E X ] XX

Egendecoposton Egenvectors v and egenvalues λ for an n n atr, A, are pars v, λ such that: Av λv If A s a real syetrc atr, t can be dagonalzed nto A E DE E A s orthonoral egenvectors D dagonal atr of A s egenvalues A s postve se-defnte > egenvalues non-negatve negatve

PCA Goal 3 Lnearly transfor nput data n a way that: Mazes sgnal varance Mnzes redundancy of sgnal covarance Algorth: Select varance azng drecton nput space Fnd net varance azng drecton that s orthogonal to all prevously selected drectons Repeat - tes Fnd a transforaton, P, such that Y PX and C Y s dagonalzed Soluton: proect data onto egenvectors of C

PCA Algorth Goal: Fnd P where Y PX s.t.. C Y s dagonalzed C Y where YY PX PX PXX P PAP A XX EDE note: egenvectors of E are orthonoral Select P E, or a atr where each row s an egenvector of C C Y PAP P P D DP P Inverse ranspose for orthonoral atr C Y s dagonalzed PCs are the egenvectors of C th dagonal value of C Y s the varance of X along p

Gra Matr Kernel Matr Gven X, a collecton of colun vector observatons of n varables Gra Matr of M: atr of dot products of nputs, real, syetrc Postve se-defnte slarty atr K K X X

Classcal Multdensonal Scalng Gven obects and dsslarty δ for each par, fnd space n whch δ Eucldean dstance If δ Eucldean Dstance: Can convert Dsslarty atr to Gra Matr or we can ust start wth Gra Matr MDS yelds sae answer as PCA

Classcal Multdensonal Scalng Convert Dsslarty Matr to Gra Matr K Egendecoposton of K K EDE ED K X X X ED / ED / D / Reduce Denson / E ED / ED / / Construct X fro subset of egenvectors/egenvalues egenvalues Identcal to PCA

Ltatons of Lnear Methods Cannot account for non- lnear relatonshp of data n nput space Sall Eucldean dstance Data ay stll have lnear relatonshp n soe feature space Isoap: : use geodesc dstance to recover anfold Length of shortest curve on a anfold connectng two ponts on the anfold Large geodesc dstance

Local Estaton of Manfolds Sall patches on a non-lnear anfold loo lnear Locally lnear neghborhoods defned n two ways -nearest neghbors: fnd the nearest ponts to a gven pont ε-ball: fnd all ponts that le wthn ε of a gven pont

Isoap dea Create weghted graph vertces dataponts edges between neghbors, weghted by Eucldean dstance Dstance atr parwse Shortest paths Construct d-densonal d densonal ebeddng Perfor MDS and eyeball resdual varance

Eyeballng Intrnsc Denson

Isoap Convergence Guaranteed to asyptotcally recover conve Eucldean anfolds For a suffcently hgh densty of data ponts, gven arbtrarly sall values λ, λ and µ,, then wth probablty at least - µ: graph dstance - λ + λ geodesc dstance Rate of convergence dependent on densty of ponts and propertes of underlyng anfold radus of curvature, branch separaton

Kernel Functons Kernel functon: slarty easure between two vectors Defne non-lnear appng fro nput space to hgh- densonal feature space: : X F Defne such that: y, y Effcency: ay be uch ore effcent to copute than appng and dot product n hgh densonal space Fleblty: can be chosen arbtrarly so long as t s postve ve defnte syetrc

Postve Defnte Syetrc PDS Kernels Gven colun vector observatons of n varables Kernel Matr: atr n whch K, Kernel s PDS f K s syetrc and postve se- defnte If K s postve se-defnte then s the dot product n soe dot product space feature space

Kernel rc For any algorth relyng solely on dot-products, we can replace the dot-product wth a postve- defnte ernel Allows for non-lnearty Eaple: PCA

Kernel PCA Kernel PCA PCA: PCA: egenvectors of Covarance atr egenvectors of Covarance atr are Prncpal Coponents are Prncpal Coponents Can rewrte solely wth dot Can rewrte solely wth dotproducts products Kernel PCA: Kernel PCA: y y y y : by ultply * ],... [ λ C * λ λ

Kernel PCA Kernel PCA [ ] y y...... λ K λ y y y ]:... [ λ Kernel Matr Kernel Matr

Kernel PCA Kernel PCA K s ernel gra atr K s ernel gra atr Use Use egendecoposton egendecoposton on K to fnd on K to fnd egenvectors egenvectors Proect test ponts n F on subset of Proect test ponts n F on subset of egenvectors denson reducton egenvectors denson reducton PCA: PCA: egenvectors of Covarance atr egenvectors of Covarance atr are Prncpal Coponents are Prncpal Coponents Can rewrte solely wth dot Can rewrte solely wth dotproducts products Kernel PCA: Kernel PCA: y y y y : by ultply * ],... [ λ C * λ λ K λ, κ λ λ

heory behnd densonal reducton? Densonal reducton has ganed popularty snce Isoap,, LLE publshed But, not uch theory behnd t Isoap s an ecepton Assung estence of underlyng anfold Do varous d red algorths converge to the correct anfold? What s the rate of convergence,.e., gven nput X of ponts, how close s d_redx to underlyng anfold?

Why focus on KPCA? Generalzaton of densonal reducton LLE and Isoap are fors of KPCA Resdual arance an ntutve easureent of accuracy Lt s clear and provable: Gven an underlyng anfold wth denson, as approaches nfnty, resdual varance approaches 0 Paper also uses R to easure d red accuracy n fnte case

What we re nterested n Resdual arance Captured arance C n > λ λ λ λ + λ hs paper provdes bounds for the sus of these process egenvalues as a functon of eprcal egenvalues

Eprcal egenvalues Eprcal egenvalues Perfor PCA on saple, S, of ponts Perfor PCA on saple, S, of ponts Note: are egenvalues of C Note: are egenvalues of C S, and, and y y y y y y ˆ,, : by ultply * ],... [ λ κ µ κ * S C µ µ λ ˆ µ

Process egenvalues Eprcal egenproble: y [ κ... ], ultply * by y y, µ y As approaches nfnty, ths becoes: : χ κ, y p d λ y for a gven ernel functon and densty p on a space X µ λ s an estate for process egenvalue

Proectons onto Subspaces P : Proecton onto subspace P : Proecton onto orthogonal copleent of P : Resdual of proecton onto dstance between orgnal pont and ts proecton

Egenvalues and Proectons Equatons azed when v st egenvector of K q st egenvalue of operator K q equals epected squared nor of st egenvector of K q ntuton: frst egenvector s drecton for whch the epected square of the resdual s nal q defnes dstrbuton of K general forula applcable to eprcal and process cases λ Κ Ε q q P [ ] [ ] Pv Εq n Εq P v F v λ Κ q a Εq v F

Eprcal/Process Epectatons of Eprcal/Process Subspaces Frst two equatons follow fro last slde Ε λ P Εˆ P ˆ µ Ε P ˆ : Average resdual over entre dstrbuton of proecton onto frst eprcal egenvectors agreed? Εˆ P : Eprcal average of squared nor for ponts n S proected onto frst process egenvectors

wo sple nequaltes s the best soluton for eprcal data S ˆ Εˆ P Ε ˆ µ ˆ P s the best soluton for underlyng process Ε P Ε Pˆ λ Goal of paper: show that chan of nequaltes below s accurate and bound dfference between frst and last ters Εˆ Ε Ε Ε P ˆ ˆ P P Pˆ

What we re nterested n Resdual arance Captured arance C n λ λ λ > λ + λ Ε P Ε + Ε P P hs paper provdes bounds for the sus of these process egenvalues as a functon of eprcal egenvalues

And now a frst Bound If we perfor PCA n feature space defned by κ,y, then wth probablty greater than -δ over rando -saples S, f new data s proected onto Ṽ the su the largest process egenvalues captured varance s bounded by: λ Ε P l + l a µ S l R Ε 9 + ln δ P ˆ κ, where support of the dstrbuton s n a ball of radus R n feature space

And now a frst Bound Frst ter: l + l a µ S l κ, radeoff between ters wthn a ter: as l ncreases, captured varance ncreases, but so does the rato of l/ For well-behaved ernels those for whch dot product s bounded, the square root ter should be a constant Second ter: R 9 + ln δ Includes dependences on confdence paraeter and dstrbuton radus R

+ he second bound If we perfor PCA n feature space defned by κ,y, then wth probablty greater than -δ over rando -saples S, f new data s proected onto Ṽ the epected squared resdual s bounded by: > Ε Ε λ P P ˆ + > l l µ S + n κ, l R 8 ln δ where support of the dstrbuton s n a ball of radus R n feature space

Net steps How tght are these bounds? Can we do better? Can we use these bounds to copare estng densonal reducton algorths Can we construct a ernel that azes the tghtness of ths bound?