Deviations from linear separability Kernel methods CSE 250B Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression. Systematic deviation What to do with this? Adding new features Actual boundary is something like x 1 = x 2 2 5. This is quadratic in x = (x 1, x 2 ) But it is linear in Φ(x) = (x 1, x 2, x 2 1, x 2 2, x 1x 2 ) Basis expansion: embed data in higherdimensional feature space. Then we can use a linear classifier! Basis expansion for quadratic boundaries How to deal with a quadratic boundary? Idea: augment the regular features x = (x 1, x 2,..., x d ) with x 2 1, x 2 2,..., x 2 d x 1 x 2, x 1 x 3,..., x d 1 x d Enhanced data vectors of the form: Φ(x) = (x 1,..., x d, x 2 1,..., x 2 d, x 1x 2,..., x d 1 x d )
Quick question Perceptron revisited Suppose x = (x 1,..., x d ). What is the dimension of Φ(x)? Learning in the higherdimensional feature space: while some y(w Φ(x) b) 0 : w = w y Φ(x) b = b y Perceptron with basis expansion: examples Perceptron with basis expansion: examples
Perceptron with basis expansion The kernel trick Learning in the higherdimensional feature space: while some y(w Φ(x) b) 0 : w = w y Φ(x) b = b y Problem: number of features has now increased dramatically. For MNIST, with quadratic boundary: from 784 to 308504. The kernel trick: implement this without ever writing down a vector in the higherdimensional space! while some y (i) (w Φ(x (i) ) b) 0 : w = w y (i) Φ(x (i) ) b = b y (i) 1 Represent w in dual form: α = (α 1,..., α n ). w = α j y (j) Φ(x (j) ) j=1 2 Compute w Φ(x) using the dual representation. w Φ(x) = α j y (j) (Φ(x (j) ) Φ(x)) j=1 3 Compute Φ(x) Φ(z) without ever writing out Φ(x) or Φ(z). Computing dot products First, in 2d. Suppose x = (x 1, x 2 ) and Φ(x) = (x 1, x 2, x 2 1, x 2 2, x 1x 2 ). Actually, tweak a little: Φ(x) = (1, 2x 1, 2x 2, x 2 1, x 2 2, 2x 1 x 2 ) What is Φ(x) Φ(z)? Computing dot products Suppose x = (x 1, x 2,..., x d ) and Φ(x) = (1, 2x 1,..., 2x d, x1 2,..., xd 2, 2x 1 x 2,..., 2x d 1 x d ) Φ(x) Φ(z) = (1, 2x 1,..., 2x d, x1 2,..., xd 2, 2x 1 x 2,..., 2x d 1 x d ) (1, 2z 1,..., 2z d, z 2 1,..., z 2 d, 2z 1 z 2,..., 2z d 1 z d ) = 1 2 i x i z i i x 2 i z 2 i 2 i j x i x j z i z j = (1 x 1 z 1 x d z d ) 2 = (1 x z) 2 For MNIST: We are computing dot products in 308504dimensional space. But it takes time proportional to 784, the original dimension!
Kernel Perceptron Learning from data (x (1), y (1) ),..., (x (n), y (n) ) X { 1, 1} Primal form: while there is some i with y (i) (w Φ(x (i) ) b) 0 : w = w y (i) Φ(x (i) ) b = b y (i) Dual form: w = j α jy (j) Φ(x (j) ), where α R n α = 0 and b = 0 while some i has y (i) j α jy (j) Φ(x (j) ) Φ(x (i) ) b 0 : α i = α i 1 b = b y (i) To classify a new point x: sign j α jy (j) Φ(x (j) ) Φ(x) b. Does this work with SVMs? (primal) min w R d,b R,ξ R n s.t.: y (i) (w x (i) b) 1 ξ i ξ 0 (dual) max α R n α i s.t.: Solution: w = i α iy (i) x (i). w 2 C ξ i for all i = 1, 2,..., n α i α j y (i) y (j) (x (i) x (j) ) i,j=1 α i y (i) = 0 0 α i C Kernel SVM 1 Basis expansion. Mapping x Φ(x). 2 Learning. Solve the dual problem: max α R n α i α i α j y (i) y (j) (Φ(x (i) ) Φ(x (j) )) i,j=1 s.t.: α i y (i) = 0 0 α i C This yields w = i α iy (i) Φ(x (i) ). Offset b also follows. 3 Classification. Given a new point x, classify as ( ) sign α i y (i) (Φ(x (i) ) Φ(x)) b. i Kernel Perceptron vs. Kernel SVM: examples Perceptron: SVM:
Kernel Perceptron vs. Kernel SVM: examples Polynomial decision boundaries When decision surface is a polynomial of order p: Perceptron: SVM: Let Φ(x) consist of all terms of order p, such as x 1 x 2 2 x p 3 3. (How many such terms are there, roughly?) Degreep polynomial in x linear in Φ(x). Same trick works: Φ(x) Φ(z) = (1 x z) p. Kernel function: k(x, z) = (1 x z) p. String kernels Sequence data: text documents speech signals protein sequences Each data point is a sequence of arbitrary length. This yields input spaces like: X = {A, C, G, T } X = {English words} What kind of embedding Φ(x) is suitable for variablelength sequences x? We will use an infinitedimensional embedding! String kernels, cont d For each substring s, define feature: Φ s (x) = # of times substring s appears in x and let Φ(x) be a vector with one coordinate for each string: Φ(x) = (Φ s (x) : all strings s). Example: the embedding of aardvark includes features Φ ar (aardvark) = 2, Φ th (aardvark) = 0,... Linear classifier based on such features is very expressive. To compute k(x, z) = Φ(x) Φ(z): for each substring s of x: count how often s appears in z Using dynamic programming, this takes time O( x z ).
The kernel function We never explicitly construct the embedding Φ(x). What we actually use: kernel function k(x, z) = Φ(x) Φ(z). Think of k(x, z) as a measure of similarity between x and z. Rewrite learning algorithm and final classifier in terms of k. Kernel Perceptron: α = 0 and b = 0 while some i has y (i) j α jy (j) k(x (j), x (i) ) b 0 : α i = α i 1 b = b y (i) To classify a new point x: sign j α jy (j) k(x (j), x) b. Kernel SVM, revisited 1 Kernel function. Define a similarity function k(x, z). 2 Learning. Solve the dual problem: max α R n α i s.t.: α i α j y (i) y (j) k(x (i), x (j) ) i,j=1 α i y (i) = 0 0 α i C This yields α. Offset b also follows. 3 Classification. Given a new point x, classify as ( ) sign α i y (i) k(x (i), x) b. i Choosing the kernel function The RBF kernel The final classifier is a similarityweighted vote, F (x) = α 1 y (1) k(x (1), x) α n y (n) k(x (n), x) (plus an offset term, b). Can we choose k to be any similarity function? Not quite: need k(x, z) = Φ(x) Φ(z) for some embedding Φ. Mercer s condition: same as requiring that for any finite set of points x (1),..., x (m), the m m similarity matrix K given by K ij = k(x (i), x (j) ) is positive semidefinite. A popular similarity function: the Gaussian kernel or RBF kernel k(x, z) = e x z 2 /s 2, where s is an adjustable scale parameter.
RBF kernel: examples RBF kernel: examples The scale parameter Recall prediction function: F (x) = α 1 y (1) k(x (1), x) α n y (n) k(x (n), x). For the RBF kernel, k(x, z) = e x z 2 /s 2, 1 How does this function behave as s? 2 How does this function behave as s 0? 3 As we get more data, should we increase or decrease s? Kernels: postscript 1 Customized kernels For different domains (NLP, biology, speech,...) Over different structures (sequences, sets, graphs,...) 2 Learning the kernel function Given a set of plausible kernels, find a linear combination of them that works well. 3 Speeding up learning and prediction The n n kernel matrix k(x (i), x (j) ) is a bottleneck for large n. One idea: Go back to the primal space! Replace the embedding Φ by a lowdimensional mapping Φ such that Φ(x) Φ(z) Φ(x) Φ(z). This can be done, for instance, by writing Φ in the Fourier basis and then randomly sampling features.