The Kernel Trick. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Size: px

Start display at page:

Download "The Kernel Trick. Robert M. Haralick. Computer Science, Graduate Center City University of New York"

Lambert Nash
5 years ago
Views:

1 The Kernel Trick Robert M. Haralick Computer Science, Graduate Center City University of New York

2 Outline

3 SVM Classification < (x 1, c 1 ),..., (x Z, c Z ) > is the training data c 1,..., c Z { 1, 1} specifies the corresponding classes α 1,..., α Z are the Langrange multipliers Assign x to class +1 when w x > θ Else assign to class 1 ( ) w x = α z c z x z x z S = z S α z c z (x zx)

4 Dual Form Minimize L(α 1,..., α Z ) = subject to Z α z 1 2 z=1 Z Z α i α j c i c j (x i x j) α z 0, z = 1,..., Z Z α z c z = 0 z=1

5 Making Non-Linear Boundaries Example: Change x 2 1 = ( x1 x 2 ) To φ(x) 5 1 = x 2 1 x 2 2 x 1 x 2 x 1 x 2 Minimize L(α 1,..., α Z ) = subject to Z α z 1 2 z=1 Z Z α i α j c i c j (φ(x i ) φ(x j )) α z 0, z = 1,..., Z Z α z c z = 0 z=1 w = z S α z c z φ(x z )

6 Making Non-Linear Boundaries Assign x to class +1 when w φ(x) > θ w 1 x w 2x w 3x 1 x 2 + w 4 x 1 + w 5 x 2 > θ The linear problem in the 5-dimensional space is the same as the quadratic problem in the 2-dimensional space. The kernel trick avoids the explicit mapping that is needed to get linear learning algorithms to learn a nonlinear function or decision boundary.

7 Making Non-Linear Boundaries Minimize L(α 1,..., α Z ) = Z α z 1 2 z=1 Z Z α i α j c i c j (φ(x i ) φ(x j )) Set K (x, y) = φ(x) φ(y) Minimize L(α 1,..., α Z ) = Z α z 1 2 z=1 Z Z α i α j c i c j K (x i, x j ) w φ(x) = α z c z φ(x z ) φ(x) z S = α z c z K (x z, x) z S

8 The Kernel Trick What happens when K (x, y) = (x y) 2? Example: x = y = ( x1 x 2 ( y1 y 2 ) ) x y = x 1 y 1 + x 2 y 2 (x y) 2 = x1 2 y x 1y 1 x 2 y 2 + x2 2 y 2 2 x 2 1 y1 2 = 2x1 x 2 2y1 y 2 x 2 2 = φ(x) φ(y) The components of the weight vector can compensate for coefficients that are not 1. y 2 2

9 Two-Class Data Set Figure: Raw Data in 2D is not separable with hyperplane.

10 Two-Class Data Set Transformed Figure: Transformed: (x 1, x 2 ) (x 2 1, 2x 1 x 2, x 2 2 )

11 Brute Force: Quadratic Case If x has N components, to make it quadratic requires an additional N(N 1)/2 components x N 1 φ(x) (N+N (N 1)/2) 1 w (N+N (N 1)/2) 1 = z S α zc z φ(x z ) w φ(x) = z S α zc z φ(x z ) φ(x) Takes N + N (N 1)/2 operations for x φ(x) Takes N + N (N 1)/2 operations for w φ(x)

12 The Kernel Trick x N 1 x y takes N operations (x y) 2 takes N + 1 operations z S α zc z K (x z, x) K (x z, x) = (x zx) 2 takes N + 1 operations Compute w x = S (N + 1) operations Brute Force N + N (N 1)/2 operations

13 The Kernel Trick What happens when K (x, y) = (x y) 3? Example: x y = x 1 y 1 + x 2 y 2 (x y) 3 = (x 1 y 1 ) 3 + 3(x 1 y 1 ) 2 x 2 y 2 + 3(x 1 y 1 )(x 2 y 2 ) 2 + (x 2 y 2 ) 3 x 3 1 y 3 1 = 3x 2 3x11 x 2 3y 2 1 y 2 3y1 x2 2 y2 2 x 3 2 y 3 2

14 The Kernel Trick What kind of functions K (x, y) can be represented in the form K (x, y) = φ(x) φ(y)

15 Kernel Definition A Kernel is a function K : R M R M R, such that for all x, y R M, there is a feature function φ : R M R N such that Proposition K (x, y) = φ(x) φ(y) Kernels are symmetric functions.

16 Kernel Let x = (x 1,..., x M ) and y = (y 1,..., y M ) K (x, y) = (x y) 2 ( M ) 2 = x m y m = = = = φ(x) = m=1 x i y i i=1 M j=1 x j y j x i y i x j y j (x i x j )(y i y j ) ( (x i x j ) (M,M) (i,j)=(1,1) ( (x i x j ) (M,M) (i,j)=(1,1) ) ( ) ) (y i y j ) (M,M) (i,j)=(1,1)

17 Kernel Let x = (x 1,..., x M ) and y = (y 1,..., y M ) K (x, y) = (x y + 1) 2 ( M = x i y i + 1) x j y j + 1 i=1 j=1 = x i x j y i y j + 2 x i y i + 1 i=1 = (x i x j )(y i y j ) + ( 2x i )( 2y i ) + (1)(1) i=1

18 Common Kernels x, y R N Polynomial K (x, y) = (x y + 1) q N (log and exponentiate Radial K (x, y) = exp ) ( x y 2 σ 2 N (exponentiate) Sigmoid K (x, y) = tanh(βx y + γ) N (hyperbolic tangent) K (x, y) measures the similarity between x and y

19 Kernels and Feature Spaces Given a function K : R M R M R, how can we tell that there exists a feature function φ such that thereby making K a kernel? K (x, y) = φ(x) φ(y)

20 Mercer s Kernel Characterization Theorem Let K : R M R M R be a symmetric function. There exists φ : R M R N such that if and only if x R M K (x, y) = φ(x) φ(y) y R M K (x, y)g(x)g(y)dxdy 0 for every function g : R M R satisfying x R M g(x) 2 dx <

21 Mercer s Kernel Characterization Theorem Let K : R M R M R be a symmetric function. There exists φ : R M R N such that K (x, y) = φ(x) φ(y) if and only if for every L and every set {x 1,..., x L } of L points from R M, the matrix K, K (x 1, x 1 ) K (x 1, x 2 )..., K (x 1, x L ) K (x 2, x 1 ) K (x 2, x 2 )..., K (x 2, x L ) K =.. K (x L, x 1 ) K (x L, x 2 )..., K (x L, x L ) is positive semidefinite.

22 Symmetric Positive Semi-definite Definition A symmetric matrix B N N is Positive Semi-definite if and only if for every x R N, x Bx 0 Theorem A symmetric matrix B is positive semi-definite if and only if all its eigenvalues are non-negative. Theorem If B is a symmetric positive semi-definite matrix, then there exists a matrix A satisfying B = A A.

23 Positive Definite Kernel Definition A symmetric function K : R N R N R is Positive Definite (pd) if and only if for any x 1,..., x L R N and any c 1,..., c L R L L c i K (x i, x j )c j 0 An alternative way of writing L i=1 L j=1 c ik (x i, x j )c j is c Kc where c = (c 1,..., c L ) and K ij = K (x i, x j )

24 A Positive Definite Kernel Proposition Let K : R N R N R be defined by K (x, y) = x y. Then K is a positive definite kernel. Proof. Let x 1,..., x K R N. Define A = K = AA x 1 x 2. x K Let c R K. c Kc = c AA c = (A c) (A c) 0

25 Positive Definite Kernel Properties Proposition Suppose K i : R N R N R, i = 1, 2,..., are positive definite (pd) kernels then K (x, y) = J j=1 a jk j (x, y) is a pd kernel when a j 0 K (x, y) = J j=1 K j(x, y) is a pd kernel K (x, y) = lim j K j (x, y) is a pd kernel K (x, y) = q(k 1 (x, y)) is a pd kernel where q is any polynomial with non-negative coefficients K (x, y) = f (x)k 1 (x, y)f (y) is a pd kernel for any f : R N R K (x, y) = f (x)f (y) is a pd kernel for any f : R N R K (x, y) = K 1 (φ(x), φ(y)) K (x, y) = x Bz, where B is symmetric positive semidefinite matrix

26 Products of pd kernels are pd kernels Proposition Let ψ 1, ψ 2 : R N R N R be positive definite kernels. Define ψ : R N R N R by ψ(x, y) = ψ 1 (x, y)ψ 2 (x, y) Then ψ is a positive definite kernel. Proof. Let x 1,..., x M R N and let c 1,..., c M R. Then c i ψ(x i, x j )c j = c i ψ 1 (x i, x j )ψ 2 (x i, x j )c j Since ψ 1 and ψ 2 are both positive definite, there exists matrices A M M and B M M such that ψ 1 (x i, x j ) = ( A A ) ij = ψ 2 (x i, x j ) = ( B B ) ij = a ki a kj k=1 K b ki b kj k=1

27 Products of pd kernels are pd kernels Proof. Hence, ψ 1 (x i, x j ) = M k=1 a kia kj and ψ 2 (x i, x j ) = M l=1 b lib lj. Therefore, i=1 c i ψ 1 (x i, x j )ψ 2 (x i, x j )c j = j=1 = i=1 j=1 k=1 l=1 c i M k=1 i=1 a ki a kj M l=1 b li b lj ( M ) c i a ki b li c j a kj b lj j=1 = ( M ) 2 c i a ki b li 0 k=1 l=1 i=1

28 Exponential Kernel Proposition Define q j (x) = j n=0 x n n!. Suppose K : RN R N R is a positive definite kernel, then, is a positive definite kernel. Proof. exp (K (x, y)) = lim j q j (K (x, y)) q j (x) is a polynomial with all positive coefficients. Since K is a positive definite kernel, q j (K (x, y)) is a positive definite kernel. Limits of positive definite kernels are positive definite kernels. Therefore exp (K (x, y)) is a positive definite kernel.

29 Positive Definite Kernel Property Proposition Let ψ : R N R N R be a symmetric positive definite kernel. Let f : R N R be any function. Then K (x, y) = f (x)ψ(x, y)f (y) is a symmetric positive definite kernel. Proof. The symmetry is immediate because ψ is symmetric and multiplication is commutative. Let x 1,..., x M R N and c 1,..., c M R. Then c i K (x i, x j )c j = = c i f (x i )ψ(x i, x j )f (x j c j ) (c i f (x i ))ψ(x i, x j )(f (x j )c j ) Let d i = c i f (x i ). Then, since ψ is positive definite, c i K (x i, x j )c j = d i ψ(x i, x j )d j 0

30 Positive Definite Kernel Proposition Let K : R N R N R be a positive definite kernel. Then K (x, x)k (y, y) K (x, y) 2. Proof. ( ) K (x, x) K (x, y) Let x, y R N. Then the matrix is K (y, x) K (y, y) positive semidefinite. Hence the determinant must be non-negative so that K (x, x)k (y, y) K (x, y) 2 0. Therefore K (x, x)k (y, y) K (x, y) 2.

31 Example Proposition Let K : R N R N R be defined by ( K (x, y) = exp (x ) y) (x y) σ 2 Then K is a positive definite kernel. Proof. x y is a pd kernel. 2x y is a pd kernel. σ 2 exp ( 2x y ) is a pd kernel. σ 2 exp ( x x ) exp ( y y ) exp ( 2x y ) is a pd kernel. σ 2 σ 2 σ 2 ) is a pd kernel. exp( (x y) (x y) σ 2

32 Positive Definite Kernel Properties Proposition Let K : R N R N R be a positive definite kernel. Then K 1 (x, y) = K (x, y) K (x, x)k (y, y) is a positive definite kernel. Proof. K 1 (x, y) = = K (x, y) K (x, x)k (y, y) 1 1 K (x, y) K (x, x) K (y, y)

33 Positive Definite Kernel Proposition Let h : R N R be any function satisfying h(x) 0 with minimum at 0. Then, is a positive definite kernel. K (x, y) = h(x + y) h(x y)

34 Symmetric Negative Definite Definition A symmetric function ψ : R N R N R is Negative Definite if and only if for every x 1,..., x M R N and c 1,..., c M R satisfying M m=1 c m = 0 Proposition c i ψ(x i, x j )c j 0 Let M M be defined by ij = ψ(x i, x j ). If is negative semi-definite Then ψ is negative definite.

35 Properties of Kernels Proposition Let K : R N R N R be a symmetric function. If K is a positive definite kernel then K is a negative definite kernel. Proof. Let x 1,..., x K R N and c R K Suppose K is a positive definite kernel. then c Kc 0. Then c ( K)c = c Kc 0. Since this is true for all c, it is certainly true for c satisfying 1 c = 0 and this implies that K is a negative definite kernel.

36 A Negative Definite Kernel Proposition Let ψ : R N R N be defined by ψ(x, y) = (x y) (x y). Then ψ is a negative definite kernel. Proof. Let x 1,..., x K R N and c 1,..., c K R satisfying K k=1 c k = 0. Then, K K c i (x i x j ) (x i x j )c j = K i=1 c i x i x i K K K c j 2 j=1 K K c i x j x j c j = 2 K K c i x i x j c j 0 c i x i x j c j +

37 Negative Definite Kernel Properties Proposition Let K m : R N R N R, m = 1, 2,..., be symmetric and negative definite. Then J 1, J 2, J 3, J 4 and J 5 defined by J 1 (x, y) = a m K m (x, y), a m 0 m=1 J 2 (x, y) = lim m(x, y) m J 3 (x, y) = log(1 + K (x, y)) J 4 (x, y) = K (x, y) α, 0 < α < 1 J 5 (x, y) = f (x) + f (y), for any f : R N R are negative definite kernels.

38 Negative Definite Kernel Proposition Let ψ : R N R N R be a symmetric negative definite kernel. Then ψ(x, x) + ψ(y, y) 2ψ(x, y) Proof. Let x, y R and c 1, c 2 R satisfying c 1 + c 2 = 0. Since ψ is negative definite, 0 c 1 ψ(x, x)c 1 + c 1 ψ(x, y)c 2 + c 2 ψ(y, x)c 1 + c 2 ψ(y, y)c 2 c 1 ψ(x, x)c 1 + c 1 ψ(x, y)( c 1 ) + ( c 1 )ψ(y, x)c 1 + ( c 1 )ψ(y, y)( c 1 ) ψ(x, x) 2ψ(x, y) + ψ(y, y) 2ψ(x, y) ψ(x, x) + ψ(y, y)

39 Positive and Negative Definite Symmetric Functions Theorem (Shoenberg,1938) Let ψ : R N R N R be symmetric. Then ψ(x, y) is negative definite if and only if exp ( tψ(x, y) is positive definite for all t > 0. Proposition Let K : R n R N R be a positive definite kernel satisfying K (x, y) 0, x, y R N. Then is negative definite if and only if is positive definite for all t > 0 logk (x, y) K (x, y) t

40 Positive and Negative Definite Symmetric Functions Proposition Let ψ : R N R N R be a symmetric function. Let z R N. Define φ(x, y) = ψ(x, y) + ψ(x, z) + ψ(z, y) ψ(z, z) Then ψ is negative definite if and only if φ is positive definite. Proof. Suppose ψ is negative definite. Let x k R N, k = 0,..., K and c k R, k = 1,..., K with c 0 = K k=1 c k. Then since ψ is negative definite, 0 K i=0 K c i ψ(x i, x j )c j j=0 0 K K c i ψ(x i, x j )c j = i=0 j=0 K K K c i ψ(x i, x j )c j + c 0 ψ(z, x j )c j + j=1 c 0 K c i ψ(z, x j ) + c0ψ(z, 2 z) i=1

41 Positive and Negative Proof. K K c i ψ(x i, x j )c j = K K K c i ψ(x i, x j )c j i=0 j=0 K K K c j c i ψ(x i, z) + c i c i K ψ(z, x j )c j K c j ψ(z, z) K K = c i (ψ(x i, x j ) ψ(x i, z) ψ(z, x j ) + ψ(z, z))c j K K 0 c i φ(x i, x j )c j K K 0 c i φ(x i, x j )c j This make φ positive definite.

42 Positive and Negative Proof. Suppose φ is positive definite. Let c 1,... c K R satisfy K k=1 c k = 0.Then K i=1 K c i φ(x i, x j )c j 0 Since φ(x, y) = ψ(x, y) + ψ(x, z) + ψ(z, y) ψ(z, z) j=1 0 K K c i ( ψ(x i, x j ) + ψ(x i, z) + ψ(z, x j ) ψ(z, z))c j K K K K c i ψ(x i, x j )c j + c i ψ(x i, z) c j + i=1 j=1 K c i K K ψ(z, x j )c j ψ(z, z) c i K K K c j = c i ψ(x i, x j )c j Therefore ψ is negative definite.

43 Negative Definite From Positive Definite Proposition Let K : R N R N R be a positive definite symmetric function. Define φ(x, y) = K (x, x) + K (y, y) 2K (x, y). Then φ is negative definite. Proof. Let x 1,..., x M R N and c 1,..., c M R satisfying M m=1 cm = 0. Then, i=1 c i φ(x i, x j )c j = j=1 = i=1 i=1 c i (K (x i, x i ) + K (x j, x j ) 2K (x i, x j ))c j j=1 c i K (x i, x i ) c j + c j K (x j, x j ) c i + 2 i=1 j=1 j=1 c i K (x i, x j )c j 0 j=1 i=1

44 Kernel Properties Proposition Let K : R n R N R be a positive definite kernel satisfying K (x, y) 0, x, y R N. Then is negative definite if and only if is positive definite for all t > 0 logk (x, y) K (x, y) t

45 Kernel Properties Proposition Let K : R N R N R be a negative definite kernel that satisfies Then, is a positive definite kernel. K (x, y) > 0 1 K (x, y)

46 Other Positive Definite Kernels ( ) Laplacian: K (x, y) = exp x y σ ANOVA: K (x, y) = N n=1 exp( σ(x n y n ) 2 ) d Rational Quadratic: K (x, y) = 1 x y 2 x y 2 +c Multiquadric: K (x, y) = x y 2 + c 2 Inverse Multiquadric: K (x, y) = Circular: K (x, y) = 2 π 1 x y 2 +c 2 x y arccos( ) 2 σ π Spherical: K (x, y) = 1 3 x y σ 2 Wave Kernel: K (x, y) = Spline: θ x y sin x y θ ( x y σ x y σ ( ) x y 2 1 σ if x y < σ, zero otherwise ) 3 if x z < σ, zero otherwise K (x, y) = N xn+yn n=1 (1 + xnyn + xnyn min(xn, yn) min(x 2 n, y n) 2 + min(xn,yn)3 Cauchy: K (x, y) = 1 1+ x y 2 σ 2 3 ) Chi-Square: K (x, y) = 1 N (x n y n) 2 n=1 1 2 (xn+yn) Histogram: K (x, y) = N n=1 min(xn, yn), where x = (x 1,..., x N ) and y = (y 1,..., y N )

Kernel Principal Component Analysis

Kernel Principal Component Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr