Hilbert Space Methods in Learning

Hilbert Space Methods in Learning guest lecturer: Risi Kondor 6772 Advanced Machine Learning and Perception (Jebara), Columbia University, October 15, 2003. 1

1. A general formulation of the learning problem Empirical and true errors over tting Error bounds and what they tell us about the design of algorithms 2. Hilbert space methods Reproducing Kernel Hilbert Spaces Kernels Algorithms: SVM, Gaussian Processes, Kernel PCA Tutorial online at http://www.cs.columbia.edu/risi/notes/tutorial6672.pdf 2

The Learning Problem 3

Regression Learn a function f : x y Linear functions, order p polynomials, splines, etc. Examples: Boston housing problem, robot grasps, motorcycle data, etc. 4

Classi cation Separate +1 labeled points from 1 labeled points Examples: Face recognition, DNA splice site identi cation, document classi cation, call type classi cation 5

Supervised learning Input space: X e.g. X =R n Output space: Y Y ={ 1, +1} for classi cation Y = R for regression Training set: S = (x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ) x i X, y t Y Truth : Deterministic: y = f 0 (x) Probabilistic: y p ( y x ) (more general) Goal: construct hypothesis f : X Y to predict y given x. 6

The Empirical Risk Empirical risk (training error): R emp [f] = 1 m L(f(x i ), y i ) where L : Y Y R is the loss function. Zero-one loss for classi cation: L(ŷ, y) = 1 if ŷ y 0 otherwise. Squared error loss for regression: L(ŷ, y) = (y ŷ) 2. 7

A Bad Learning Algorithm (memorization algorithm) Set 1 when x=x i and y i =1 f(x) = 1 otherwise. For zero-one loss perfect performance on training data! R emp [f] = 1 m L(f(x i ), y i ) = 0 Will it generalize well to testing examples? Why not? 8

The True Risk Assume some distribution on inputs: p(x) Distribution on (x, y) examples: p(x, y) = δ(y f 0 (x)) p(x) p(x, y) = p ( y x ) p(x) or True risk: R[f] = E [L(f(x), y)] = X Y p(x, y) L(f(x), y) dx dy. This is what we really want to minimize in discriminative learning. 9

True Risk vs. Empirical Risk R[f] = E [L(f(x), y)] R emp [f] = 1 m L(f(x i ), y i ). Just minimizing R emp is BAD (see previous algorithm). Optimizing the training error at the expense of the testing error is called over tting. But we do not know p(x, y)!!! Can we still do anything? 10

Bounding the True Risk For many practical learning algorithms R[f] = E [L(f(x), y)] R emp [f] = 1 m L(f(x i ), y i ). Uniform error bounds: For any distribution D, with probability 1 δ (over the choice of training set) R[f] R emp [f] ɛ for all hypotheses f F simultaneously. PAC bound: probably approximately correct 11

KEY CONCEPT: Capacity Control [ R[f] P Remp [f] ] ɛ 1 δ Generally, ɛ is a complicated function of δ depending crucially on F (hypothesis class) that f is chosen from. Compromise: model exibility generalization performance large F want small ɛ complexity generality 12

Capacity control Too in exible? Just right. Over tting? How do we quantify complexity of f? 13

Uniform Error Bounds [ P R[f] R emp [f] ɛ f F ] 1 δ. [ ] P sup [R[f] R emp [f] ] ɛ f F 1 δ, Not equivalent to: [ ] P R[f] R emp [f] ɛ 1 δ f F. 14

Vapnik-Chervonenkis type bounds With probability 1 δ [ sup R[f] Remp [f] ] h (log (2m/h) + 1) log (δ/4) f F m where h is the VC dimension of F. Linear discriminators in R n : h = n + 1... with margin γ in ball of radius D : h = min ( n, D 2 /γ 2 ) + 1 Large margin is good! 15

Covering number bounds With probability 1 δ [ sup R[f] Remp [f] ] log (12m E N1 (S, ɛ/8)) log δ 16M f F m where M is an upper bound on L(f(x), y). The covering number N 1 (S, ɛ) is the number of vectors v 1, v 2,..., v n to ensure that for any f F, there is a v k such that needed L(f(x i ), v k ) ɛ. 16

Stability-based bounds If f 1 δ is the hypothesis returned by a β-stable algorithm, then with probability R[f ] R emp [f ] β + 2 (mβ + M) 2 log (δ/2) m where M is an upper bound on L(f(x), y). An algorithm is β-stable if for all training sets, and any example (x, y), L(f (x), y) changes by at most β when we replace any one of the training examples by any other example. 17

Rademacher bounds For Ra r [f] = r with probability 1 δ [ ( R[f] inf (1+α) R emp [f] + 1 + 1 ) ( )] 31r log 2 b α>0 4α r + 50bɛ. n Rademacher average: Ra r [f] = E S,σ [ where P [σ = 1] = P [σ = 1] = 1/2. sup f F : EL(f(x),y) r ] σ i L (f(x i ), y i ) 18

Structural Risk Minimization If we have bound of form [ ] P sup [R[f] R emp [f] ] ɛ F f F 1 δ 1. Fix δ 2. Compute f F = arg min f F [R emp [f] + ɛ F ] for a sequence of spaces F 1 F 2... F k 3. Return f i with smallest R emp [f i ] + ɛ F i Does this work? 19

The problem with error bounds Most bounds are hopelessly loose. Typically, we get for 1 δ =.95 ɛ = 3000. Main culprit is the uniformity requirement. Can we still use them for anything or are they just a weird sport? Form of bounds is important, even if their value is not. In particular, large margin is good. 20

Hilbert Space Methods 21

SVM s: the old story Kernel k : X X R pos.def. similarity measure Feature map Φ : X F obeys k(x, x ) = Φ(x), Φ(x ) e.g. Gaussian Kernel: k(x, x ) = e x x 2 /(2σ 2 ) Find maximum margin separating hyperplane in high dimensional space! f(x) = sgn [ b + ] α i k(x i, x) 22

Want more general story behind Hilbert space methods. How do we tell what is a good kernel, anyway? Want large margin. What kernel will give us large margin? Lessons so far: capacity control is crucial; large margin is good; pursue abstract approach looking for general f : X Y, worry about actual algorithm later. 23

Regularized Risk Motivated by form of error bounds, minimize R reg [f] = 1 L(f(x i ), y i ) m }{{} R emp [f] over some large space of functions H. + Ω[f] }{{} regularizer Ω[f] is a penalty term penalizing hypotheses that are too complex. Effectively SRM. See Regularization networks of Poggio & Girosi. 24

Regularized Spaces of Functions Given {(x 1, y 1 ),..., (x m, y m )} look for f : X Y in some linear space of functions H minimizing R reg [f] = 1 m L(f(x i ), y i ) + f 2 H 25

Regularized Spaces of Functions Given {(x 1, y 1 ),..., (x m, y m )} look for f : X Y in some linear space of functions H minimizing R reg [f] = 1 m R reg [f] = 1 m L(f(x i ), y i ) + f 2 H L(f(x i ), y i ) + f, f H Hilbert space 26

Regularized Spaces of Functions Given {(x 1, y 1 ),..., (x m, y m )} look for f : X Y in some linear space of functions H minimizing R reg [f] = 1 m R reg [f] = 1 m R reg [f] = 1 m L(f(x i ), y i ) + f 2 H L(f(x i ), y i ) + f, f H L( f, k xi, y i ) + f, f H Hilbert space RKHS 27

Regularized Spaces of Functions Given {(x 1, y 1 ),..., (x m, y m )} look for f : X Y in some linear space of functions H minimizing R reg [f]. R reg [f] = 1 m R reg [f] = 1 m R reg [f] = 1 m L(f(x i ), y i ) + f 2 H L(f(x i ), y i ) + f, f H L( f, k xi, y i ) + f, f H Hilbert space RKHS The k x are prototypical functions s.t. f(x) = f, k x. 28

Representer Theorem Minimizer of R reg [f] = 1 m will be in the span of k x1, k x2,..., k xm! L( f, k xi, y i ) + f, f H The hypothesis can be written f(x) = f, k x = where k(x, x ) = k x, k x. α i k xi, k x = α i k(x, x i ). All we need to nd are α 1, α 2,..., α m. How do we construct the RKHS? 29

Constructing the RKHS f(x) = f, k x Bootstrap everything from k(x, x ) = k x, k x for x, x X! 1. Anything outside span { k x x X } is uninteresting, so f = 2. To evaluate f(x ) use f(x) = f, k x = X β(x) k x, k x dx = X X β(x) k(x, x ) dx β(x) k x dx. 3. To compute f, f use f, f = β(x) β(x ) k x, k x dxdx = X X X X β(x) β(x ) k(x, x ) dxdx 4. Note that k x (x ) = k x, k x = k(x, x ) so we simply have k x = k(x, x ). 5. H is a particular instance of a feature space F if we set Φ(x) = k x. 30

Correspondence R reg [f] = 1 m L( f, k xi, y i ) + f, f H Kernel methods make sense from Regularization Theory point of view if kernel corresponds to sensible operator Ω[f] = f, f H. 31

Fourier regularization Fourier transform on R n : ˆf(ω) = 1 (2π) n/2 R n f(x) e iω x dx Inverse trasform: f(x) = 1 (2π) n/2 C n ˆf(x) e iω x dω Fourier regularization: Ω[f] = f, f H = e σ2 ω 2 /2 ˆf(ω) 2 dω Corresponding kernel: k(x, x ) = e x x 2 /(2σ 2 ) The Gaussian kernel will heavily penalize non-smooth functions! 32

Other kernels Homogeneous polynomial: k(x, x ) = (x x ) p Non-homogeneous polynomial: k(x, x ) = (x x + 1) p tanh kernel: k(x, x ) = tanh(κ (() x x ) + δ) Triangular kernel: k(x, x ) = 1 (x x ) /d String kernels: k(string 1, string 2 ) Kernels on distributions: Fisher, etc. Diffusion kernels: k(x, x ) = [ e β ] x,x Similarity measure Regularization 33

Algorithms 34

Modularity of Hilbert space methods f = arg min 1 ftinh m L( f, k xi, y i ) + f, f }{{} H }{{} Determines algorithm Determines kernel Same algorithm (SVM) can be used in very different contexts by changing the kernel kernel engineering Regularization scheme can be studied independent of application (classi cation, regression, etc.) ANY kernel method can be formulated as one of these minimization problems 35

Soft margin SVM s Relax problem to learning continuous functions f : X R with hinge loss Then L(f(x), y) = C max (0, y f(x) + 1) f = arg min f H reduces to soft margin SVM [ f = arg min f H [ f, f + C 1 m ] L( f, k xi, y i ) + f, f H ] ξ i subject to Probably the most popular algorithm for classi cation. y i f(x i ) 1 ξ i 36

Kernel Regression If we set then f = arg min f H [ 1 m ] L( f, k xi, y i ) + f, f H reduces to soft kernel regression f = arg min f H [ f, f + C ] (ξ i + ξ i) subject to y i f(x i ) ɛ + ξ y i f(x i ) ɛ ξ 37