Kernel methods CSE 250B

Deviations from linear separability Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression.

Deviations from linear separability Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression. Systematic deviation + + + + + + + + + What to do with this?

Adding new features + + + + + + + + + Actual boundary is something like x 1 = x2 2 + 5.

Adding new features + + + + + + + + + Actual boundary is something like x 1 = x2 2 + 5. This is quadratic in x = (x 1, x 2 ) But it is linear in Φ(x) = (x 1, x 2, x1 2, x 2 2, x 1x 2 ) Basis expansion: embed data in higherdimensional feature space. Then we can use a linear classifier!

Basis expansion for quadratic boundaries How to deal with a quadratic boundary? + + + + + + + + + Idea: augment the regular features x = (x 1, x 2,..., x d ) with x1 2, x2 2,..., xd 2 x 1 x 2, x 1 x 3,..., x d 1 x d Enhanced data vectors of the form: Φ(x) = (x 1,..., x d, x1 2,..., xd 2, x 1x 2,..., x d 1 x d )

Quick question Suppose x = (x 1, x 2, x 3 ). What is the dimension of Φ(x)?

Suppose x = (x 1,..., x d ). What is the dimension of Φ(x)?

Perceptron revisited Learning in the higherdimensional feature space: w = 0 and b = 0 while some y(w Φ(x) + b) 0 : w = w + y Φ(x) b = b + y

Perceptron with basis expansion: examples

Perceptron with basis expansion Learning in the higherdimensional feature space: w = 0 and b = 0 while some y(w Φ(x) + b) 0 : w = w + y Φ(x) b = b + y

Perceptron with basis expansion Learning in the higherdimensional feature space: w = 0 and b = 0 while some y(w Φ(x) + b) 0 : w = w + y Φ(x) b = b + y Problem: number of features has now increased dramatically. For MNIST, with quadratic boundary: from 784 to 308504.

The kernel trick w = 0 and b = 0 while some y (i) (w Φ(x (i) ) + b) 0 : w = w + y (i) Φ(x (i) ) b = b + y (i)

The kernel trick w = 0 and b = 0 while some y (i) (w Φ(x (i) ) + b) 0 : w = w + y (i) Φ(x (i) ) b = b + y (i) 1 Represent w in dual form: α = (α 1,..., α n ). w = n α j y (j) Φ(x (j) ) j=1

The kernel trick w = 0 and b = 0 while some y (i) (w Φ(x (i) ) + b) 0 : w = w + y (i) Φ(x (i) ) b = b + y (i) 1 Represent w in dual form: α = (α 1,..., α n ). w = n α j y (j) Φ(x (j) ) j=1 2 Compute w Φ(x) using the dual representation. n w Φ(x) = α j y (j) (Φ(x (j) ) Φ(x)) j=1

Computing dot products First, in 2d. Suppose x = (x 1, x 2 ) and Φ(x) = (x 1, x 2, x 2 1, x 2 2, x 1x 2 ). Actually, tweak a little: Φ(x) = (1, 2x 1, 2x 2, x 2 1, x 2 2, 2x 1 x 2 ) What is Φ(x) Φ(z)?

Computing dot products Suppose x = (x 1, x 2,..., x d ) and Φ(x) = (1, 2x 1,..., 2x d, x 2 1,..., x 2 d, 2x 1 x 2,..., 2x d 1 x d ) Φ(x) Φ(z) = (1, 2x 1,..., 2x d, x 2 1,..., x 2 d, 2x 1 x 2,..., 2x d 1 x d ) (1, 2z 1,..., 2z d, z 2 1,..., z 2 d, 2z 1 z 2,..., 2z d 1 z d ) = 1 + 2 i x i z i + i x 2 i z 2 i + 2 i j x i x j z i z j = (1 + x 1 z 1 + + x d z d ) 2 = (1 + x z) 2 For MNIST: We are computing dot products in 308504dimensional space. But it takes time proportional to 784, the original dimension!

Kernel Perceptron Learning from data (x (1), y (1) ),..., (x (n), y (n) ) X { 1, 1} Primal form: w = 0 and b = 0 while there is some i with y (i) (w Φ(x (i) ) + b) 0 : w = w + y (i) Φ(x (i) ) b = b + y (i) Dual form: w = j α jy (j) Φ(x (j) ), where α R n α = 0 and b = 0 ( ) while some i has y (i) j α jy (j) Φ(x (j) ) Φ(x (i) ) + b 0 : α i = α i + 1 b = b + y (i) ( ) To classify a new point x: sign j α jy (j) Φ(x (j) ) Φ(x) + b.

Does this work with SVMs? (primal) min w R d,b R,ξ R n s.t.: y (i) (w x (i) + b) 1 ξ i ξ 0 w 2 + C n i=1 ξ i for all i = 1, 2,..., n (dual) max α R n n n α i α i α j y (i) y (j) (x (i) x (j) ) i=1 s.t.: i,j=1 n α i y (i) = 0 i=1 0 α i C Solution: w = i α iy (i) x (i).

Kernel SVM 1 Basis expansion. Mapping x Φ(x). 2 Learning. Solve the dual problem: max α R n n n α i α i α j y (i) y (j) (Φ(x (i) ) Φ(x (j) )) i=1 i,j=1 s.t.: n α i y (i) = 0 i=1 0 α i C This yields w = i α iy (i) Φ(x (i) ). Offset b also follows. 3 Classification. Given a new point x, classify as ( ) sign α i y (i) (Φ(x (i) ) Φ(x)) + b. i

Kernel Perceptron vs. Kernel SVM: examples Perceptron: SVM:

Polynomial decision boundaries When decision surface is a polynomial of order p: + + + + + + + + +

Polynomial decision boundaries When decision surface is a polynomial of order p: + + + + + + + + + Let Φ(x) consist of all terms of order p, such as x 1 x 2 2 x p 3 3. (How many such terms are there, roughly?) Degreep polynomial in x linear in Φ(x).

String kernels Sequence data: text documents speech signals protein sequences

String kernels Sequence data: text documents speech signals protein sequences Each data point is a sequence of arbitrary length. This yields input spaces like: X = {A, C, G, T } X = {English words}

String kernels Sequence data: text documents speech signals protein sequences Each data point is a sequence of arbitrary length. This yields input spaces like: X = {A, C, G, T } X = {English words} What kind of embedding Φ(x) is suitable for variablelength sequences x?

String kernels, cont d For each substring s, define feature: Φ s (x) = # of times substring s appears in x and let Φ(x) be a vector with one coordinate for each string: Φ(x) = (Φ s (x) : all strings s). Example: the embedding of aardvark includes features Φ ar (aardvark) = 2, Φ th (aardvark) = 0,... Linear classifier based on such features is very expressive.

The kernel function We never explicitly construct the embedding Φ(x). What we actually use: kernel function k(x, z) = Φ(x) Φ(z). Think of k(x, z) as a measure of similarity between x and z. Rewrite learning algorithm and final classifier in terms of k. Kernel Perceptron: α = 0 and b = 0 ( ) while some i has y (i) j α jy (j) k(x (j), x (i) ) + b 0 : α i = α i + 1 b = b + y (i) ( ) To classify a new point x: sign j α jy (j) k(x (j), x) + b.

Kernel SVM, revisited 1 Kernel function. Define a similarity function k(x, z). 2 Learning. Solve the dual problem: max α R n n α i i=1 s.t.: n α i α j y (i) y (j) k(x (i), x (j) ) i,j=1 n α i y (i) = 0 i=1 0 α i C This yields α. Offset b also follows. 3 Classification. Given a new point x, classify as ( ) sign α i y (i) k(x (i), x) + b. i

Choosing the kernel function The final classifier is a similarityweighted vote, F (x) = α 1 y (1) k(x (1), x) + + α n y (n) k(x (n), x) (plus an offset term, b).

Choosing the kernel function The final classifier is a similarityweighted vote, F (x) = α 1 y (1) k(x (1), x) + + α n y (n) k(x (n), x) (plus an offset term, b). Can we choose k to be any similarity function? Not quite: need k(x, z) = Φ(x) Φ(z) for some embedding Φ. Mercer s condition: same as requiring that for any finite set of points x (1),..., x (m), the m m similarity matrix K given by K ij = k(x (i), x (j) ) is positive semidefinite.

The RBF kernel A popular similarity function: the Gaussian kernel or RBF kernel k(x, z) = e x z 2 /s 2, where s is an adjustable scale parameter.

RBF kernel: examples

The scale parameter Recall prediction function: F (x) = α 1 y (1) k(x (1), x) + + α n y (n) k(x (n), x). For the RBF kernel, k(x, z) = e x z 2 /s 2, 1 How does this function behave as s? 2 How does this function behave as s 0? 3 As we get more data, should we increase or decrease s?

Kernels: postscript 1 Customized kernels For different domains (NLP, biology, speech,...) Over different structures (sequences, sets, graphs,...)

Kernels: postscript 1 Customized kernels For different domains (NLP, biology, speech,...) Over different structures (sequences, sets, graphs,...) 2 Learning the kernel function Given a set of plausible kernels, find a linear combination of them that works well. 3 Speeding up learning and prediction The n n kernel matrix k(x (i), x (j) ) is a bottleneck for large n. One idea: Go back to the primal space! Replace the embedding Φ by a lowdimensional mapping Φ such that Φ(x) Φ(z) Φ(x) Φ(z). This can be done, for instance, by writing Φ in the Fourier basis and then randomly sampling features.