MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco

Linear functions Let H lin be the space of linear functions f(x) = w x. f w is one to one, inner product f, f H := w w, norm/metric f f H := w w.

An observation Function norm controls point-wise convergence. Since then f(x) f(x) x w w, x X w j w f j (x) f(x), x X.

ERM 1 min w R d n n (y i w x i ) 2 + λ w 2, λ 0 i=1 λ 0 ordinary least squares (bias to minimal norm), λ > 0 ridge regression (stable).

Computations Let Xn R nd and Ŷ R n. The ridge regression solution is ŵ λ = (Xn Xn+nλI) 1 Xn Ŷ time O(nd 2 d 3 ) mem. O(nd d 2 ) but also ŵ λ = Xn (XnXn +nλi) 1 Ŷ time O(dn 2 n 3 ) mem. O(nd n 2 )

Representer theorem in disguise We noted that n n ŵ λ = Xn c = x i c i ˆf λ (x) = x x i c i, i=1 i=1 c = (XnXn + nλi) 1 Ŷ, (XnXn ) ij = x i x j.

Limits of linear functions Regression

Limits of linear functions Classification

Nonlinear functions Two main possibilities: f(x) = w Φ(x), f(x) = Φ(w x) where Φ is a non linear map. The former choice leads to linear spaces of functions 1. The latter choice can be iterated f(x) = Φ(w L Φ(w L 1...Φ(w 1 x))). 1 The spaces are linear, NOT the functions!

Features and feature maps where Φ : X R p f(x) = w Φ(x), and ϕ j : X R, for j = 1,...,p. Φ(x) = (ϕ 1 (x),...,ϕ p (x)) X need not be R d. We can also write p f(x) = w j ϕ j (x). i=1

Geometric view f(x) = w Φ(x)

An example

More examples The equation p f(x) = w Φ(x) = w j ϕ j (x) i=1 suggests to think of features as some form of basis. Indeed we can consider Fourier basis, wave-lets + their variations,...

And even more examples Any set of functions ϕ j : X R, j = 1,...,p can be considered. Feature design/engineering vision: SIFT, HOG audio: MFCC...

Nonlinear functions using features Let H Φ be the space of linear functions f(x) = w Φ(x). f w is one to one, if (ϕ j ) j are lin. indip. inner product f, f H Φ := w w, norm/metric f f HΦ := w w. In this case f(x) f(x) Φ(x) w w, x X.

Back to ERM 1 min w R p n n (y i w Φ(x i )) 2 + λ w 2, λ 0, i=1 Equivalent to, 1 min f H Φ n n (y i f(x i )) 2 + λ f 2 H Φ, λ 0. i=1

Computations using features Let Φ R np with ( Φ) ij = ϕ j (x i ) The ridge regression solution is ŵ λ = ( Φ Φ+nλI) 1 Φ Ŷ time O(np 2 p 3 ) mem. O(np p 2 ), but also ŵ λ = Φ ( Φ Φ +nλi) 1 Ŷ time O(pn 2 n 3 ) mem. O(np n 2 ).

Representer theorem a little less in disguise Analogously to before ŵ λ = Φ n n c = Φ(x i )c i ˆf λ (x) = Φ(x) Φ(x i )c i i=1 i=1 c = ( Φ Φ + λi) 1 Ŷ, ( Φ Φ ) ij = Φ(x i ) Φ(x j ) p Φ(x) Φ( x) = ϕ s (x)ϕ s ( x). s=1

Unleash the features Can we consider linearly dependent features? Can we consider p =?

An observation For X = R consider with ϕ 1 (x) = 1. Then ϕ j (x)ϕ j ( x) = j=1 ϕ j (x) = x j 1 e x2 γ j=1 (2γ) (j 1) (j 1)!, j = 2,..., x j 1 e x2 γ (2γ) j 1 (j 1)! xj 1 e x2 γ (2γ) j 1 (j 1)! = e x2γ e x2 γ = e x x 2 γ j=1 (2γ) j 1 (j 1)! (x x)j 1 = e x2γ e x2γ e 2x x2 γ

From features to kernels Φ(x) Φ( x) = ϕ j (x)ϕ j ( x) = k(x, x) j=1 We might be able to compute the series in closed form. The function k is called kernel. Can we run ridge regression?

Kernel ridge regression We have n n ˆf λ (x) = Φ(x) Φ(x i )c i = k(x,x i )c i i=1 i=1 c = ( K + λi) 1 Ŷ, ( K) ij = Φ(x i ) Φ(x j ) = k(x i,x j ) K is the kernel matrix, the Gram (inner products) matrix of the data. The kernel trick

Kernels Can we start from kernels instead of features? Which functions k : X X R define kernels we can use?

Positive definite kernels A function k : X X R is called positive definite: if the matrix ˆK is positive semidefinite for all choice of points x 1,...,x n, i.e. a Ka 0, a R n. Equivalently n k(x i,x j )a i a j 0, i,j=1 for any a 1,...,a n R, x 1,...,x n X.

Inner product kernels are pos. def. Assume Φ : X R p, p and k(x, x) = Φ(x) Φ( x) Note that n n 2 n k(x i,x j )a i a j = Φ(x i ) Φ(x j )a i a j = Φ(x i )a i. i,j=1 i,j=1 i=1 Clearly k is symmetric.

But there are many pos. def. kernels Classic examples linear k(x, x) = x x polynomial k(x, x) = (x x + 1) s Gaussian k(x, x) = e x x 2 γ But one can consider kernels on probability distributions kernels on strings kernels on functions kernels on groups kernels graphs... It is natural to think of a kernel as a measure of similarity.

From pos. def. kernels to functions Let X be any set/ Given a pos. def. kernel k. consider the space H k of functions f(x) = N k(x,x i )a i i=1 for any a 1,...,a n R, x 1,...,x n X and any N N. Define an inner product on H k N N f, f = k(x Hk i, x j )a i ā j. i=1 j=1 H k can be completed to a Hilbert space.

A key result Functions defind by Gaussian kernels with large and small widths.

An illustration Theorem Given a pos. def. k there exists Φ s.t. k(x, x) = Φ(x),Φ( x) Hk and H Φ H k Roughly speaking N f(x) = w Φ(x) f(x) = k(x,x i )a i i=1

From features and kernels to RKHS and beyond H k and H Φ have many properties, characterizations, connections: reproducing property reproducing kernel Hilbert spaces (RKHS) Mercer theorem (Kar hunen Loéve expansion) Gaussian processes Cameron-Martin spaces

Reproducing property Note that by definition of H k k x = k(x, ) is a function in H k For all f H k, x X f(x) = f,k x Hk called the reproducing property Note that f(x) f(x) k x Hk f f Hk, x X. The above observations have a converse.

RKHS Definition A RKHS H is a Hilbert with a function k : X X R s.t. k x = k(x, ) H k, and f(x) = f,k x Hk. Theorem If H is a RKHS then k is pos. def.

Evaluation functionals in a RKHS If H is a RKHS then the evaluation functionals e x (f) = f(x) are continuous. i.e. e x (f) e x ( f) f f Hk, x X since e x (f) = f,k x Hk. Note that L 2 (R d ) or C(R d ) don t have this property!

Alternative RKHS definition Turns out the previous property also characterizes a RKHS. Theorem A Hilbert space with continuous evaluation functionals is a RKHS.

Summing up From linear to non linear functions using features using kernels plus pos. def. functions reproducing property RKHS