Designing Kernel Functions Using the Karhunen-Loève Expansion

Size: px

Start display at page:

Download "Designing Kernel Functions Using the Karhunen-Loève Expansion"

Gerald Waters
6 years ago
Views:

1 July 7, Designing Kernel Functions Using the Karhunen-Loève Expansion 2 1 Fraunhofer FIRST, Germany Tokyo Institute of Technology, Japan 1,2 2 Masashi Sugiyama and Hidemitsu Ogawa

2 Learning with Kernels 2 Kernel methods: Approximate unknown function f (x) by fˆ ( x) = n i= 1 α K( x, i x i ) α i K ( x, x ) : Parameters : Kernel function x : Training points i Kernel methods are known to generalize very well, given appropriate kernel function. Therefore, how to choose (or design) kernel function is critical in kernel methods.

3 Recent Development 3 in Kernel Design Recently, a lot of attention have been paid to designing kernel functions for non-vectorial structured data. e.g., strings, sequence, trees, graphs. In this talk, however, we discuss the problem of designing kernel functions for standard vectorial data.

4 Choice of Kernel Function 4 A kernel function is specified by A family of functions (Gaussian, polynomial, etc.) Kernel parameters (width, order, etc.) We usually focus on a particular family (say Gaussian), and optimize kernel parameters by, e.g., cross-validation. In principle, it is possible to optimize the family of kernels by CV. However, this does not seem so common because of too many degrees of freedom.

5 Goal of Our Research 5 We propose a method for finding optimal family of kernel functions using some prior knowledge on problem domain. We focus on Regression (squared-loss) Translation-invariant kernel K ( x, x ) = K( x x ) We do not assume kernel is positive semidefinite, since kernel trick is not needed in some regression methods (e.g. ridge).

6 Outline of The Talk 6 A general method for designing translation-invariant kernels. Example of kernel design for binary regression. Implication of the results.

7 Specialty of Learning with 7 Translation-Invariant Kernels Ordinary linear models: fˆ ( x) Kernel models: fˆ ( x) = = p i= 1 n i= 1 α ϕ ( x) i i α K( x i x i x i is center of kernels. All basis functions have same shape! ) α i ϕ (x) i : Parameters : Basis function K ( x x ) : Translationinvariant kernel

8 Local Approximation by Kernels 8 Intuitively, each kernel function is responsible for local approximation in the vicinity of each training input point. xi Therefore, we consider the problem of approximating a function locally by a single kernel function. x j

9 Set of Local Functions 9 and Function Space ψ (x) : A local function centered at x Ψ : Set of all local functions H : A functional Hilbert space which contains Ψ (i.e., space of local functions) Suppose ψ (x) is a probabilistic function. H ψ (x) ψ (x) x

10 Optimal Approximation to 10 Set of Local Functions We are looking for the optimal approximation to the set Ψ of local functions ψ (x). Since we are interested in optimizing the family of functions, scaling is not important. We search the optimal direction in H. φ opt = arg min φ H E ψ ψ φ 2 ψ φ opt H E : Expectation over ψ ψ ψ φ : Projection of onto φ φ ψ φ

11 Karhunen-Loève Expansion φ opt R : Correlation operator of local functions R ϕ = E, [ ϕ ψ ψ ] arg min φ H, : Inner product in H Optimal direction φ opt is given by the eigenfunction φ max associated with the largest eigenvalue of R. λ max E ψ ψ Similar to PCA, but E ψ. = Rφ = λ max max φ max [ ] 0 φ 2 ψ If is vector, R = E [ ] T ψψ φ max H 11

12 Principal Component Kernel 12 Using, we define the kernel function by φ opt x x K( x, x ) = φ opt c c : Width x : Center Since the above kernel consists of the principal component of the correlation operator, we call it the principal component (PC) kernel.

13 Example of Kernel Design: 13 Binary Regression Problem Learning target function is binary. 1 f (x) 0 The set of local functions is a set of rectangular functions with different width. 1 0 x i ψ (x)

14 14 Widths of Rectangular Functions We assume that the width of rectangular functions is bounded (and normalized). Since we do not have prior knowledge on the width, we should define its distribution in an unbiased manner. We use uniform distribution for the width since it is non-informative. ~, U θ l θ r (0,1) 1 0 θl θr

15 Eigenvalue Problem 15 We use L -space as a function space H. 2 Considering the symmetry, the eigenvalue problem Rφ = λφ is expressed as 1 0 r( x, y) φ ( y) dy = λφ( x) r( x, y) = 1 max( x, y) The principal component is given by φ max ( x) π = 2 cos x 2

16 16 PC Kernel for Binary Regression K( x, x ) = x x cos c 0 if x c x otherwise π 2 x : Center c : Width x = 0, c = 1

17 Implication of The Result 17 Binary classification is often solved as binary regression with squared-loss (e.g., regularization networks, least-squares SVMs). Although binary function is not smooth at all, smooth Gaussian kernel often works very well in practice. Why?

18 18 Implication of The Result (cont.) By proper scaling, it can be confirmed that the shape of the obtained PC kernel is similar to Gaussian kernel. Both kernels work similarly in experiments. Datasets Banana B.Cancer Diabetes F.Solar Heart Ringnorm Thyroid Titanic Twonorm Waveform PC kernel Gauss kernel

19 19 Implication of The Result (cont.) This implies that Gaussian-like bellshaped function approximates binary functions very well. This partially explains why smooth Gaussian kernel is suitable for nonsmooth classification tasks.

20 Conclusions 20 Optimizing the family of kernel functions is a difficult task because it has infinitely many degrees of freedom. We proposed a method for designing kernel functions in regression scenarios. The optimal kernel shape is given by the principal component of correlation operator of local functions. We can beneficially use prior knowledge on problem domain (e.g., binary)

Release from Active Learning / Model Selection Dilemma: Optimizing Sample Points and Models at the Same Time

IJCNN2002 May 12-17, 2002 Release from Active Learning / Model Selection Dilemma: Optimizing Sample Points and Models at the Same Time Department of Computer Science, Tokyo Institute of Technology, Tokyo,