CS81B/Stat41B Spring 008) Statistical Learning Theory Lecture: 8 Representer theorem and kernel examples Lecturer: Peter Bartlett Scribe: Howard Lei 1 Representer Theorem Recall that the SVM optimization problem can be ressed as follows: Jf ) = min f H Jf) where Jf) = C n hingeloss fx i ), y i ) + f H and H is a Reproducing Kernel Hilbert Space RKHS). Theorem 1.1. Fix a kernel k, and let H be the corresponding RKHS. Then, for a function L: R n R and non-decreasing Ω: R R, if the SVM optimization problem can be ressed as: Jf ) = min f H Jf) = min Lfx1 )... fx n )) + Ω f H)) f H then the solution can be ressed as: f = α i kx i, ) Furthermore, if Ω is strictly increasing, then all solutions have this form. This shows that to solve the SVM optimization problem, we only need to solve for the α i, which agrees with the solution obtained via the Lagrangian formulation of the problem. Furthermore, our solution lies in the span of the kernels. Suppose we project f onto the subspace: span{kx i, ): 1 i n} obtaining f s the component along the subspace) and f the component perpendicular to the subspace). We have: f = f s + f f = f s + f f s Since Ω is non-decreasing, Ω f H) Ω f s H) 1
Representer theorem and kernel examples implying that Ω ) is minimized if f lies in the subspace. Furthermore, since the kernel k has the reproducing property, we have: Implying that: fx i ) = f, kx i, ) = f s, kx i, ) + f, kx i, ) = f s, kx i, ) = f s x i ) Lfx 1 ),..., fx n )) = Lf s x 1 ),..., f s x n )) Hence, L ) depends only on the component of f lying in the subspace: span{kx i, ): 1 i n}, and Ω ) is minimized if f lies in that subspace. Hence, Jf) is minimized if f lies in that subspace, and we can ress the minimizer as: f ) = α i kx i, ) Note that if Ω ) is strictly non-decreasing, then f must necessarily be zero for f to be the minimizer of Jf), implying that f must necessarily lie in the subspace: span{kx i, ): 1 i n}. Constructing Kernels In this section, we discuss ways to construct new kernels from previously defined kernels. Suppose k 1 and k are valid symmetric, positive definite) kernels on X. Then, the following are valid kernels: 1. ku, v) = αk 1 u, v) + βk u, v), for α, β 0 Since αk 1 u, v) = αφ 1 u), αφ 1 v) and βk u, v) = βφ u), βφ v), then: ku, v) = αk 1 u, v) + βk u, v) 1) = αφ 1 u), αφ 1 v) + βφ u), βφ v) ) = [ αφ 1 u) βφ u)], [ αφ 1 v) βφ v)] 3) and we see that ku, v) can be ressed as an inner product. ku, v) = k 1 u, v)k u, v) Note that the gram matrix K for k is the Hadamard product or element-by-element product) of K 1 and K K = K 1 K ). Suppose that K 1 and K are covariance matrices of X 1,..., X n ) and Y 1,..., Y n ) respectively. Then K is simply the covariance matrix of X 1 Y 1,..., X n Y n ), implying that it is symmetric and positive definite. 3. ku, v) = k 1 fu), fv)), where f: X X Since f is a transformation in the same domain, k is simply a different kernel in that domain: ku, v) = k 1 fu), fv)) = Φfu)), Φfv)) = Φ f u), Φ f v)
Representer theorem and kernel examples 3 4. ku, v) = gu)gv), for g: X R We can ress the gram matrix K as the outer product of the vector γ = [gx 1 ),..., gx n )]. Hence, K is symmetric and positive semi-definite with rank 1. It is positive semi-definite because the non-zero eigenvalue of γγ is the trace of γγ which is the trace of γ γ which is simply γ γ which is greater than or equal to 0). 5. ku, v) = fk 1 u, v)), where f is a polynomial with positive coefficients. Since each polynomial term is a product of kernels with a positive coefficient, the proof follows by applying 1 and. 6. ku, v) = k 1 u, v)) Since: The proof follows from 5 and the fact that: x) = lim 1 + x + + x ) i i i! ku, v) = lim i k i u, v) ) 7. ku, v) = u v σ ku, v) = u v = σ ) = u σ ) ) u v +u v σ v σ )) ) u v σ = gu)gv))k 1 u, v)) 6) gu)gv) is a kernel according to 4, and k 1 u, v)) is a kernel according to 6. According to, the product of two kernels is a valid kernel. 4) 5) Note that the Gaussian kernel is translation-invariant, where ku, v) can be ressed as fu v) = fx). Example: Translation-invariant kernels Consider the function f: [ π, π] R, and suppose that f is continuous and even i.e. fx) = f x)). Then, we can ress f via the Fourier ansion as: fx) = a n cosnx) n=0
4 Representer theorem and kernel examples where a n 0. If we let x be the difference of u and v, then we have: fx) = fu v) = a 0 + = a n sinnu)sinnv) + cosnu)cosnv)) 7) n=1 λ i Ψ i u)ψ i v), 8) i=0 where {Ψ i } = {sinnu) : n 1} {cosnu) : n 0}. We see that fu v) is a valid kernel that s translation invariant. This example shows that we can choose the kernel by choosing the a i coefficients, which is equivalent to choosing a filter. Example: Bag-of-words kernel Suppose that Φ w d) is the number of times word w appears in document d. If we want to classify documents by their word counts, we can use the kernel kd 1, d ) = Φd 1 ), Φd ). In practice, these counts are weighted to take into account the relative frequency of different words.) Example: Marginalized kernel Given the probability distribution px, h) and hence ph x)) and a kernel defined for x,h) pairs kx, h), x, h ))), we can obtain a kernel on only the x s as follows: k m x, x ) = h,h kx, h), x, h ))ph x)ph x ) Exercise: Prove that this is a valid kernel! Example: Convolution kernel or string kernel) Define a i to be a letter of the alphabet, s = s i,..., s l ) to be a string of letters, and Σ to be the space of all possible letter sequences. Suppose that s has a = a 1,..., a n ) as a subsequence if there exists a sequence of indices I = i 1,..., i n ), where i 1 < i < < i n with s ij = a j, where j = 1,..., n. Define the length of the set of indices i 1,..., i n ) forming the subsequence as li) = i n i 1 + 1. For simplicity, we use the notation s[i] = a. Define, for fixed n, the feature map for a particular sequence a and string s: Φ a s) = I: s[i]=a λ li) where λ 0, 1). To compare two strings s and s, we can use the following kernel: ks, s ) = a Σ n Φ a s)φ a s )
Representer theorem and kernel examples 5 We can also derive the above kernel via convolution. Define the following kernel: k 0 s, i), s, i )) = 1[si) = s i )] Set k n s, i), s, i )) = k 0 s, i), s, i ))h k n 1 )s, i), s, i )) where hi j) = 1[i j > 0]λ i j), and is the convolution operator. Then: h k n 1 )s, i), s, i )) = hi j)hi j )k n 1 s, i), s, i )) j,j and ks, s ) = i,i k n s, i), s, i ))