Theory of Positive Definite Kernel and Reproducing Kernel Hilbert Space Statistical Inference with Reproducing Kernel Hilbert Space Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department of Statistical Science, Graduate University for Advanced Studies June 20, 2008 / Statistical Learning Theory II
Outline Positive and negative definite kernels 1 Positive and negative definite kernels 2 3 2 / 31
1 Positive and negative definite kernels 2 3 3 / 31
Review: operations that preserve positive definiteness I Proposition 1 If k i : X X C (i = 1, 2,...) are positive definite kernels, then so are the following: 1 (positive combination) ak 1 + bk 2 (a, b 0). 2 (product) k 1 k 2 (k 1 (x, y)k 2 (x, y)). 3 (limit) lim i k i (x, y), assuming the limit exists. Remark. Proposition 1 says that the set of all positive definite kernels is closed (w.r.t. pointwise convergence) convex cone stable under multiplication. Example: If k(x, y) is positive definite, e k(x,y) = 1 + k + 1 2 k2 + 1 3! k3 + is also positive definite. 4 / 31
Review: operations that preserve positive definiteness II Proposition 2 Let k : X X C be a positive definite kernel and f : X C be an arbitrary function. Then, is positive definite. In particular, is a positive definite kernel. k(x, y) = f(x)k(x, y)f(y) f(x)f(y) 5 / 31
Review: operations that preserve positive definiteness III Corollary 3 (Normalization) Let k : X X C be a positive definite kernel. If k(x, x) > 0 for any x X, then k(x, y) k(x, y) = k(x, x)k(y, y) is positive definite. This is called normalization of k. Note that k(x, y) 1 for any x, y X. Example: Polynomial kernel k(x, y) = (x T y + c) d (c > 0). (x k(x, T y + c) d y) = (x T x + c) d/2 (y T y + c). d/2 6 / 31
1 Positive and negative definite kernels 2 3 7 / 31
Definition. A function ψ : X X C is called a negative definite kernel if it is Hermitian i.e. ψ(y, x) = ψ(x, y), and n c i c j ψ(x i, x j ) 0 i,j=1 for any x 1,..., x n (n 2) in X and c 1,..., c n C with n i=1 c i = 0. Note: a negative definite kernel is not necessarily minus pos. def. kernel because of the condition n i=1 c i = 0. 8 / 31
Properties of negative definite kernels Proposition 4 1 If k is positive definite, ψ = k is negative definite. 2 Constant functions are negative definite. (2) n i,j=1 c ic j = n i=1 c i n j=1 c j = 0. Proposition 5 If ψ i : X X C (i = 1, 2,...) are negative definite kernels, then so are the following: 1 (positive combination) aψ 1 + bψ 2 (a, b 0). 2 (limit) lim i ψ i (x, y), assuming the limit exists. The set of all negative definite kernels is closed (w.r.t. pointwise convergence) convex cone. Multiplication does not preserve negative definiteness. 9 / 31
Example of negative definite kernel Proposition 6 Let V be an inner product space, and φ : X V. Then, is a negative definite kernel on X. ψ(x, y) = φ(x) φ(y) 2 Proof. Suppose n i=1 ci = 0. n i,j=1cicj φ(xi) φ(xj) 2 = n { i,j=1 cicj φ(xi) 2 + φ(x j) 2 (φ(x i), φ(x j)) (φ(x j), φ(x } i)) = n i=1 ci φ(xi) 2 n j=1 cj + n j=1 cj φ(xj) 2 n i=1 ci ( n i=1 ciφ(xi), n j=1 cjφ(xj)) ( n j=1 cjφ(xj), n i=1 ciφ(xi)) = n i=1 ciφ(xi) 2 n i=1 ciφ(xi) 2 0 10 / 31
Relation between positive and negative definite kernels Lemma 7 Let ψ(x, y) be a hermitian kernel on X. Fix x 0 X and define ϕ(x, y) = ψ(x, y) + ψ(x, x 0 ) + ψ(x 0, y) ψ(x 0, x 0 ). Then, ψ is negative definite if and only if ϕ is positive definite. Proof. "If" part is easy (exercise). Suppose ψ is neg. def. Take any x i X and c i C (1 = 1,..., n). Define c 0 = n i=1 c i. Then, 0 n i,j=0 c ic j ψ(x i, x j ) [for x 0, x 1,..., x n ] = n i,j=1 c ic j ψ(x i, x j ) + c 0 n i=1 c iψ(x i, x 0 ) + c 0 n j=1 c iψ(x 0, x j ) + c 0 2 ψ(x 0, x 0 ) = n i,j=1 c { ic j ψ(xi, x j ) ψ(x i, x 0 ) ψ(x 0, x j ) + ψ(x 0, y 0 ) } = n i,j=1 c ic j ϕ(x i, x j ). 11 / 31
Schoenberg s theorem Theorem 8 (Schoenberg s theorem) Let X be a nonempty set, and ψ : X X C be a kernel. ψ is negative definite if and only if exp( tψ) is positive definite for all t > 0. Proof. If part: 1 exp( tψ(x, y)) ψ(x, y) = lim. t 0 t Only if part: We can prove only for t = 1. Take x 0 X and define ϕ(x, y) = ψ(x, y) + ψ(x, x 0 ) + ψ(x 0, y) ψ(x 0, x 0 ). ϕ is positive definite (Lemma 7). e ψ(x,y) = e ϕ(x,y) e ψ(x,x0) e ψ(y,x0) e ψ(x0,x0). This is also positive definite. 12 / 31
1 Positive and negative definite kernels 2 3 13 / 31
More examples I Positive and negative definite kernels Proposition 9 If ψ : X X C is negative definite and ψ(x, x) 0. Then, for any 0 < p 1, ψ(x, y) p is negative definite. Proof. Use the following formula. ψ(x, y) p = p Γ(1 p) 0 t p 1 (1 e tψ(x,y) )dt The integrand is negative definite for all t > 0.. For any 0 < p 2 and α > 0, exp( α x y p ) is positive definite on R n. α = 2 Gaussian kernel. α = 1 Laplacian kernels. 14 / 31
More examples II Positive and negative definite kernels Proposition 10 If ψ : X X C is negative definite and ψ(x, x) 0. Then, log(1 + ψ(x, y)) is negative definite. Proof. log(1 + ψ(x, y)) = 0 (1 e tψ(x,y) ) e t t dt. 15 / 31
More example III Positive and negative definite kernels Corollary 11 If ψ : X X (0, ) is negative definite. Then, log ψ(x, y) is negative definite. Proof. For any c > 0, log(ψ + 1/c) = log(1 + cψ) log c is negative definite. Take the limit of c. ψ(x, y) = x + y is negative definite on R. ψ(x, y) = log(x + y) is negative definite on (0, ). 16 / 31
More examples IV Positive and negative definite kernels Proposition 12 If ψ : X X C is negative definite and Reψ(x, y) 0. Then, for any a > 0, 1 ψ(x, y) + a is positive definite. Proof. 1 ψ(x, y) + a = 0 e t(ψ(x,y)+a) dt. The integrand is positive definite for all t > 0.. For any 0 < p 2, is positive definite on R. 1 1 + x y p 17 / 31
1 Positive and negative definite kernels 2 3 18 / 31
Positive definite functions Definition. Let φ : R n C be a function. φ is called a positive definite function (or function of positive type) if k(x, y) = φ(x y) is a positive definite kernel on R n, i.e. n i,j=1 c ic j φ(x i x j ) 0 for any x 1,..., x n X and c 1,..., c n C. A positive definite kernel of the form φ(x y) is called shift invariant (or translation invariant). Gaussian and Laplacian kernels are examples of shift-invariant positive definite kernels. 19 / 31
I The characterizes all the continuous shift-invariant kernels on R n. Theorem 13 (Bochner) Let φ be a continuous function on R n. Then, φ is positive definite if and only if there is a finite non-negative Borel measure Λ on R n such that φ(x) = e 1ω T x dλ(ω). φ is the inverse Fourier (or Fourier-Stieltjes) transform of Λ. Roughly speaking, the shift invariant functions are the class that have non-negative Fourier transform. 20 / 31
II The Fourier kernel e 1x T ω is a positive definite function for all ω R n. exp( 1(x y) T ω) = exp( 1x T ω)exp( 1y T ω). The set of all positive definite functions is a convex cone, which is closed under the pointwise-convergence topology. The generator of the convex cone is the Fourier kernels {e 1x T ω ω R n }. Example on R: (positive scales are neglected) exp( 1 2σ 2 x 2 ) exp( α x ) exp( σ2 2 ω 2 ) 1 ω 2 + α 2 is extended to topological groups and semigroups [BCR84]. 21 / 31
1 Positive and negative definite kernels 2 3 22 / 31
Integral characterization of positive definite kernels I Ω: compact Hausdorff space. µ: finite Borel measure on Ω. Proposition 14 Let K(x, y) be a continuous function on Ω Ω. K(x, y) is a positive definite kernel on Ω if and only if K(x, y)f(x)f(y)dxdy 0 for each function f L 2 (Ω, µ). Ω c.f. Definition of positive definiteness: K(x i, x j )c i c j 0. Ω i,j 23 / 31
Integral characterization of positive definite kernels II Proof. ( ). For a continuous function f, a Riemann sum satisfies i,j K(x i, x j )f(x i )f(x j )µ(e i )µ(e j ) 0. The integral is the limit of such sums, thus non-negative. For f L 2 (Ω, µ), approximate it by a continuous function. ( ). Suppose n i,j=1 c ic j K(x i, x j ) = δ < 0. By continuity of K, there is an open neighborhood U i of x i such that n i,j=1 c ic j K(z i, z j ) δ/2. for all z i U i. We can approximate c i i µ(u I i) U i by a continuous function f with arbitrary accuracy. 24 / 31
Integral Kernel Positive and negative definite kernels (Ω, B, µ): measure space. K(x, y): measurable function on Ω Ω such that K(x, y) 2 dxdy <. (square integrability) Ω Ω Define an operator T K on L 2 (Ω, µ) by (T K f)(x) = K(x, y)f(y)dy (f L 2 (Ω, µ)). T K : integral operator with integral kernel K. Fact: T K f L 2 (Ω, µ). Ω ) TK f(x) 2 dx = { K(x, y)f(y)dy } 2 dx K(x, y) 2 dy f(y) 2 dydx = K(x, y) 2 dxdy f 2 L 2. 25 / 31
Hilbert-Schmidt operator I H: separable Hilbert space. Definition. An operator T on H is called Hilbert-Schmidt if for a CONS {ϕ i } i=1 i=1 T ϕ i 2 <. For a Hilbert-Schmidt operator T, the Hilbert-Schmidt norm T HS is defined by T HS = ( i=1 T ϕ i 2) 1/2. T HS does not depend on the choice of a CONS. ) From Parseval s equality, for a CONS {ψ j } j=1, T 2 HS = i=1 T ϕ i 2 = = j=1 i=1 j=1 (ψ j, T ϕ i ) 2 i=1 (T ψ j, ϕ i ) 2 = j=1 T ψ j 2. 26 / 31
Hilbert-Schmidt operator II Fact: T T HS. Hilbert-Schmidt norm is an extension of Frobenius norm of a matrix: T 2 HS = (ψ j, T ϕ i ) 2. i=1 j=1 (ψ j, T ϕ i ) is the component of the matrix expression of T with the CONS s {ϕ i } and {ψ j }. 27 / 31
Hilbert-Schmidt operator and integral kernel I Recall (T K f)(x) = with square integrable kernel K. Theorem 15 Ω K(x, y)f(y)dy (f L 2 (Ω, µ)) Assume L 2 (Ω, µ) is separable. Then, T K is a Hilbert-Schmidt operator, and T K 2 HS = K(x, y) 2 dxdy. Proof. Let {ϕ i } be a CONS. From Parseval s equality, K(x, y) 2 dy = i (K(x, ), ϕ i) L 2 2 = i K(x, y)ϕ i(y)dy 2 = i TKϕi(x) 2. Integrate w.r.t. x, ({ϕ i } is also a CONS) K(x, y) 2 dxdy = i T Kϕ i 2 = T K 2 HS. 28 / 31
Hilbert-Schmidt operator and integral kernel II Converse is true! Theorem 16 Assume L 2 (Ω, µ) is separable. For any Hilbert-Schmidt operator T on L 2 (Ω, µ), there is a square integrable kernel K(x, y) such that T ϕ = K(x, y)ϕ(y)dy. Outline of the proof. Fix a CONS {ϕ i }. Define K n (x, y) = n i=1 (T ϕ i)(x)ϕ i (y) (n = 1, 2, 3,..., ). We can show {K n (x, y)} is a Cauchy sequence in L 2 (Ω Ω, µ µ), and the limit works as K in the statement. 29 / 31
Integral operator by positive definite kernel Ω: compact Hausdorff space. µ: finite Borel measure on Ω. K(x, y): continuous positive definite kernel on Ω. (T K f)(x) = K(x, y)f(y)dy (f L 2 (Ω, µ)) Fact: From Proposition 14 Ω (T K f, f) L2 (Ω,µ) 0 ( f L 2 (Ω, µ)). In particular, any eigenvalue of T K is non-negative. 30 / 31
Positive and negative definite kernels K(x, y): continuous positive definite kernel on Ω. {λ i } i=1, {ϕ i} i=1 : the positive eigenvalues and eigenfunctions of T K. λ 1 λ 2 > 0, lim λ i = 0. i T K ϕ i = λ i ϕ i, K(x, y)ϕi (y)dy = λ i ϕ i (x). Theorem 17 (Mercer) K(x, y) = λ i ϕ i (x)ϕ i (y), i=1 where the convergence is absolute and uniform over Ω Ω. Proof is omitted. See [RSN65], Section 98, or [Ito78], Chapter 13. 31 / 31
References I Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic Analysis on Semigroups. Springer-Verlag, 1984. Seizo Ito. Kansu-Kaiseki III (Iwanami kouza Kiso-suugaku). Iwanami Shoten, 1978. Frigyes Riesz and Béla Sz.-Nagy. Functional Analysis (2nd ed.). Frederick Ungar Publishing Co, 1965. 32 / 31