Learning the Kernel Function via Regularization

Size: px

Start display at page:

Download "Learning the Kernel Function via Regularization"

Phyllis Cooper
5 years ago
Views:

1 Learning the Kernel Function via Regularization Author: Micchelli, C.A., Pontil, M Speaker: Yonggang Yao April 9, 2007

2 Outline

3 Settings for a Class of Regularization Problems 1 Training data: (y i, x i ); x i X, y i R, i = 1,, m. 2 f : X R and f H K where H K is a reproducing kernel Hilbert space (RKHS). 3 Regularization Minimizer: Q(Ix(f )) + µp(f ), (1) where Ix(f ) = (f (x 1 ),, f (x m )) and µ > 0. 4 Q is a measure of the empirical error. 5 P(f ) =< f, f > is a measure of smoothness, where <, > denotes the inner product in H K.

4 Some Properties of the Regularization Problems 1 By the representor theorem, if f H K is a minimizer of (1), it has the form f (x) = m c i K(x j, x); x X. (2) j=1 2 If Q is convex, the unique solution of (1): Q µ (K) := inf{q(ix(f )) + µp(f ) : f H K } can be found by replacing f with m j=1 c ik(x j, x), and then optimizing the minimizer w.r.t. vector c := (c 1,, c m ).

5 Problems 1 Choice of the regularization parameter µ. Typically, this can be solved by cross validation (CV), generalized CV (GCV), or path methods. 2 Choice of the kernel K. The authors proposed the following method in choosing the kernel: Q µ (K) := inf{q µ (K) : K K}, (3) where Q µ (K) := inf{q(ix(f )) + µp(f ) : f H K } and K is a set of reproducing kernels.

6 Solution Existence 1 Lemma 1. If Q : R m R + is continuous, there is an f in H K minimizing (1): Q(Ix(f )) + µp(f ). If Q is convex, the solution is unique. 2 Lemma 2. If K is a compact and convex set satisfying Kx being positive definite for any K K, and Q is continuous, there is a solution for the question in (3): Q µ (K) := inf{q µ (K) : K K}, where Kx := [K(x i, x j )] is the m m matrix associated with x.

7 Kernel Sets 1 A convex hull of a finite number of kernels, {K 1,, K n } with Kx i being positive definite for i = 1,, n. 2 A more general case, { } K(G) := G(ω)dp(ω) : p M(Ω), ω Ω where Ω is a Hausdorff space, G(ω) is a kernel satisfying G(ω)x being positive definite for any ω Ω, M(Ω) is the set of all probability measures on Ω.

8 Properties of Kernel Sets (1) 1 If a is an element in the convex hull of A with A R d, a can be represented as a convex combination of at most d + 1 elements of A. 2 Let G := {G(ω) : ω Ω}. For fixed x, any positive matrix C in the convex hull of Gx := {Kx : K G} can be represented as a convex combination of at most m(m+1) elements in Gx.

9 Properties of Kernel Sets (2) 1 Let K be the closure of G s convex hull, and K be a solution of the problem in (3): Q µ (K ) = Q µ (K) := inf{q µ (K) : K K} for some K K. Then, K can be represented as a convex combination of at most m + 2 elements in G. 2 Main Points of the Proof: For any kernel K, its optimal minimizer in (1) can be completely decided by (Kxc, c Kxc) R m+1 where c is as defined in (2). We have {(Kxc, c Kxc) : K K} R m+1 being a convex set for any fixed c.

10 Properties of Kernel Sets (3) 1 Let µ > 0 be a fixed constant and x be a set of distinct points in X. If Q : R m R + is convex, the function Q µ (K) is convex and non-increasing where K is a kernel with Kx being positive definite.

11 Problem Setting 1 Consider the square loss regularization with minimizer being: 2 Similarly to Q µ (K), define y Ix(f ) 2 + µ f 2 K ; S µ (K) := inf{ y Ix(f ) 2 + µ f 2 K : f H K }. 3 Let K be a kernel set such that Kx is positive definite for any K K. Define: (K) := inf{ f 2 K : I x(f ) = y; f H K }.

12 Results (1) 1 For any kernel K, inputs x, response y, and µ > 0, we have S µ (K) = µ(y, (µi + Kx) 1 y). 2 If f is the solution to minimize f 2 K with I x(f ) = y over f H K, we have f (x) = m c i K(x j, x); x X for some c := (c 1,, c m ) R m, j=1 and f 2 K = (y, K 1 x y).

13 Results (2) 1 Define (K) := inf{(k) : K K}. 2 If K is compact and convex with Kx being positive definite for K K, (K) exists.

14 Results (3) 1 Let c(a) := A 1 y matrix. (y,a 1 y) with A being a m m positive definite 2 Let K n := {K 1,, K n } be a kernel set with Kx,i being positive definite for i = 1,, n. 3 There exists a kernel ˆK = i J λ ik i with i J λ i = 1, λ i > 0 for i J {1 :, n}, and J min(m + 1, n), such that (K n ) = ( ˆK) 1 = (y, ˆK x y). 4 Let ĉ := c( ˆKx). Then, for any i J, we have (ĉ, Kx,iĉ) = max{(ĉ, Kx,lĉ) : l = 1,, n}. 5 For any c R m with (c, y) = 1 and K the convex hall of K n, we have the saddle point property of (ĉ, ˆK) (ĉ, Kxĉ) (ĉ, ˆKxĉ) (c, ˆKxc).

15 Related Works 1 Lanckriet et al. (2004) apply the following criteria to select the optimal kernel K via its matrix K x. 1 hard (K) := min{ f 2 K : y if (x i ) 1, f H K ; i = 1,, m}; 2 m soft (K) := min{ ξ i + f 2 K : y i f (x i ) 1 ξ i, f H K ; i = 1,, m} i=1 2 Ong et al. (2003) consider learning a kernel function rather than a kernel matrix. They choose kernel from a kernel space generated by a hyper-kernel function. A hyper-kernel function is a H : X 2 X 2 R function, such that H((v, w), (, )) is a kernel function.

16 Method 1 An interior point method: DefineF v (λ) := S µ ( n l=1 λ l K l ) v n l=1 λ l and find the optimal kernel via { } n min F v (λ) : λr n, λ l = 1. 2 If v 0 +, the method converges to the one minimizing S µ ( ). l=1

17 Results Table: Average mean square error for methods 1 to 3. µ Method Method Method Note: Method 1 is the interior point method, Method 2 is K = 1 n n l=1 K l, and Methods 3 is the ideal one K = K 2 + K 5 which generates the simulated data, where K l (v, w) = sin(lv) sin(lw).

18 Important Reference 1 Lanckriet et al. (2004) Learning the kernel matrix with semi-positive definite programming, Journal of Machine Learning Research, 5: 27-72, Lin and Zhang (2003) Component selection and smoothing in smoothing spline analysis of variance models-cosso, Institute of Statistics Mimeo Series 2556, NCSU, 2003.

A DC-Programming Algorithm for Kernel Selection

A DC-Programming Algorithm for Kernel Selection Andreas Argyriou a.argyriou@cs.ucl.ac.uk Department of Computer Science, University College London, Gower Street, London WC1E 6BT, UK Raphael Hauser raphael.hauser@comlab.ox.ac.uk Oxford University Computing