Kernel methods CSE 250B
Deviations from linear separability Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression.
Deviations from linear separability Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression. Systematic deviation + + + + + + + + + What to do with this?
Adding new features + + + + + + + + + Actual boundary is something like x 1 = x2 2 + 5.
Adding new features + + + + + + + + + Actual boundary is something like x 1 = x2 2 + 5. This is quadratic in x = (x 1, x 2 ) But it is linear in Φ(x) = (x 1, x 2, x1 2, x 2 2, x 1x 2 ) Basis expansion: embed data in higherdimensional feature space. Then we can use a linear classifier!
Basis expansion for quadratic boundaries How to deal with a quadratic boundary? + + + + + + + + + Idea: augment the regular features x = (x 1, x 2,..., x d ) with x1 2, x2 2,..., xd 2 x 1 x 2, x 1 x 3,..., x d 1 x d Enhanced data vectors of the form: Φ(x) = (x 1,..., x d, x1 2,..., xd 2, x 1x 2,..., x d 1 x d )
Quick question Suppose x = (x 1, x 2, x 3 ). What is the dimension of Φ(x)?
Suppose x = (x 1,..., x d ). What is the dimension of Φ(x)?
Perceptron revisited Learning in the higherdimensional feature space: w = 0 and b = 0 while some y(w Φ(x) + b) 0 : w = w + y Φ(x) b = b + y
Perceptron with basis expansion: examples
Perceptron with basis expansion: examples
Perceptron with basis expansion: examples
Perceptron with basis expansion: examples
Perceptron with basis expansion Learning in the higherdimensional feature space: w = 0 and b = 0 while some y(w Φ(x) + b) 0 : w = w + y Φ(x) b = b + y
Perceptron with basis expansion Learning in the higherdimensional feature space: w = 0 and b = 0 while some y(w Φ(x) + b) 0 : w = w + y Φ(x) b = b + y Problem: number of features has now increased dramatically. For MNIST, with quadratic boundary: from 784 to 308504.
Perceptron with basis expansion Learning in the higherdimensional feature space: w = 0 and b = 0 while some y(w Φ(x) + b) 0 : w = w + y Φ(x) b = b + y Problem: number of features has now increased dramatically. For MNIST, with quadratic boundary: from 784 to 308504. The kernel trick: implement this without ever writing down a vector in the higherdimensional space!
The kernel trick w = 0 and b = 0 while some y (i) (w Φ(x (i) ) + b) 0 : w = w + y (i) Φ(x (i) ) b = b + y (i)
The kernel trick w = 0 and b = 0 while some y (i) (w Φ(x (i) ) + b) 0 : w = w + y (i) Φ(x (i) ) b = b + y (i) 1 Represent w in dual form: α = (α 1,..., α n ). w = n α j y (j) Φ(x (j) ) j=1
The kernel trick w = 0 and b = 0 while some y (i) (w Φ(x (i) ) + b) 0 : w = w + y (i) Φ(x (i) ) b = b + y (i) 1 Represent w in dual form: α = (α 1,..., α n ). w = n α j y (j) Φ(x (j) ) j=1 2 Compute w Φ(x) using the dual representation. n w Φ(x) = α j y (j) (Φ(x (j) ) Φ(x)) j=1
The kernel trick w = 0 and b = 0 while some y (i) (w Φ(x (i) ) + b) 0 : w = w + y (i) Φ(x (i) ) b = b + y (i) 1 Represent w in dual form: α = (α 1,..., α n ). w = n α j y (j) Φ(x (j) ) j=1 2 Compute w Φ(x) using the dual representation. n w Φ(x) = α j y (j) (Φ(x (j) ) Φ(x)) j=1 3 Compute Φ(x) Φ(z) without ever writing out Φ(x) or Φ(z).
Computing dot products First, in 2d. Suppose x = (x 1, x 2 ) and Φ(x) = (x 1, x 2, x 2 1, x 2 2, x 1x 2 ). Actually, tweak a little: Φ(x) = (1, 2x 1, 2x 2, x 2 1, x 2 2, 2x 1 x 2 ) What is Φ(x) Φ(z)?
Computing dot products Suppose x = (x 1, x 2,..., x d ) and Φ(x) = (1, 2x 1,..., 2x d, x 2 1,..., x 2 d, 2x 1 x 2,..., 2x d 1 x d ) Φ(x) Φ(z) = (1, 2x 1,..., 2x d, x 2 1,..., x 2 d, 2x 1 x 2,..., 2x d 1 x d ) (1, 2z 1,..., 2z d, z 2 1,..., z 2 d, 2z 1 z 2,..., 2z d 1 z d ) = 1 + 2 i x i z i + i x 2 i z 2 i + 2 i j x i x j z i z j = (1 + x 1 z 1 + + x d z d ) 2 = (1 + x z) 2 For MNIST: We are computing dot products in 308504dimensional space. But it takes time proportional to 784, the original dimension!
Kernel Perceptron Learning from data (x (1), y (1) ),..., (x (n), y (n) ) X { 1, 1} Primal form: w = 0 and b = 0 while there is some i with y (i) (w Φ(x (i) ) + b) 0 : w = w + y (i) Φ(x (i) ) b = b + y (i) Dual form: w = j α jy (j) Φ(x (j) ), where α R n α = 0 and b = 0 ( ) while some i has y (i) j α jy (j) Φ(x (j) ) Φ(x (i) ) + b 0 : α i = α i + 1 b = b + y (i) ( ) To classify a new point x: sign j α jy (j) Φ(x (j) ) Φ(x) + b.
Does this work with SVMs? (primal) min w R d,b R,ξ R n s.t.: y (i) (w x (i) + b) 1 ξ i ξ 0 w 2 + C n i=1 ξ i for all i = 1, 2,..., n (dual) max α R n n n α i α i α j y (i) y (j) (x (i) x (j) ) i=1 s.t.: i,j=1 n α i y (i) = 0 i=1 0 α i C Solution: w = i α iy (i) x (i).
Kernel SVM 1 Basis expansion. Mapping x Φ(x). 2 Learning. Solve the dual problem: max α R n n n α i α i α j y (i) y (j) (Φ(x (i) ) Φ(x (j) )) i=1 i,j=1 s.t.: n α i y (i) = 0 i=1 0 α i C This yields w = i α iy (i) Φ(x (i) ). Offset b also follows. 3 Classification. Given a new point x, classify as ( ) sign α i y (i) (Φ(x (i) ) Φ(x)) + b. i
Kernel Perceptron vs. Kernel SVM: examples Perceptron: SVM:
Kernel Perceptron vs. Kernel SVM: examples Perceptron: SVM:
Polynomial decision boundaries When decision surface is a polynomial of order p: + + + + + + + + +
Polynomial decision boundaries When decision surface is a polynomial of order p: + + + + + + + + + Let Φ(x) consist of all terms of order p, such as x 1 x 2 2 x p 3 3. (How many such terms are there, roughly?)
Polynomial decision boundaries When decision surface is a polynomial of order p: + + + + + + + + + Let Φ(x) consist of all terms of order p, such as x 1 x 2 2 x p 3 3. (How many such terms are there, roughly?) Degreep polynomial in x linear in Φ(x).
Polynomial decision boundaries When decision surface is a polynomial of order p: + + + + + + + + + Let Φ(x) consist of all terms of order p, such as x 1 x 2 2 x p 3 3. (How many such terms are there, roughly?) Degreep polynomial in x linear in Φ(x). Same trick works: Φ(x) Φ(z) = (1 + x z) p.
Polynomial decision boundaries When decision surface is a polynomial of order p: + + + + + + + + + Let Φ(x) consist of all terms of order p, such as x 1 x 2 2 x p 3 3. (How many such terms are there, roughly?) Degreep polynomial in x linear in Φ(x). Same trick works: Φ(x) Φ(z) = (1 + x z) p. Kernel function: k(x, z) = (1 + x z) p.
String kernels Sequence data: text documents speech signals protein sequences
String kernels Sequence data: text documents speech signals protein sequences Each data point is a sequence of arbitrary length. This yields input spaces like: X = {A, C, G, T } X = {English words}
String kernels Sequence data: text documents speech signals protein sequences Each data point is a sequence of arbitrary length. This yields input spaces like: X = {A, C, G, T } X = {English words} What kind of embedding Φ(x) is suitable for variablelength sequences x?
String kernels Sequence data: text documents speech signals protein sequences Each data point is a sequence of arbitrary length. This yields input spaces like: X = {A, C, G, T } X = {English words} What kind of embedding Φ(x) is suitable for variablelength sequences x? We will use an infinitedimensional embedding!
String kernels, cont d For each substring s, define feature: Φ s (x) = # of times substring s appears in x and let Φ(x) be a vector with one coordinate for each string: Φ(x) = (Φ s (x) : all strings s).
String kernels, cont d For each substring s, define feature: Φ s (x) = # of times substring s appears in x and let Φ(x) be a vector with one coordinate for each string: Φ(x) = (Φ s (x) : all strings s). Example: the embedding of aardvark includes features Φ ar (aardvark) = 2, Φ th (aardvark) = 0,... Linear classifier based on such features is very expressive.
String kernels, cont d For each substring s, define feature: Φ s (x) = # of times substring s appears in x and let Φ(x) be a vector with one coordinate for each string: Φ(x) = (Φ s (x) : all strings s). Example: the embedding of aardvark includes features Φ ar (aardvark) = 2, Φ th (aardvark) = 0,... Linear classifier based on such features is very expressive. To compute k(x, z) = Φ(x) Φ(z): for each substring s of x: count how often s appears in z
String kernels, cont d For each substring s, define feature: Φ s (x) = # of times substring s appears in x and let Φ(x) be a vector with one coordinate for each string: Φ(x) = (Φ s (x) : all strings s). Example: the embedding of aardvark includes features Φ ar (aardvark) = 2, Φ th (aardvark) = 0,... Linear classifier based on such features is very expressive. To compute k(x, z) = Φ(x) Φ(z): for each substring s of x: count how often s appears in z Using dynamic programming, this takes time O( x z ).
The kernel function We never explicitly construct the embedding Φ(x). What we actually use: kernel function k(x, z) = Φ(x) Φ(z). Think of k(x, z) as a measure of similarity between x and z. Rewrite learning algorithm and final classifier in terms of k.
The kernel function We never explicitly construct the embedding Φ(x). What we actually use: kernel function k(x, z) = Φ(x) Φ(z). Think of k(x, z) as a measure of similarity between x and z. Rewrite learning algorithm and final classifier in terms of k. Kernel Perceptron: α = 0 and b = 0 ( ) while some i has y (i) j α jy (j) k(x (j), x (i) ) + b 0 : α i = α i + 1 b = b + y (i) ( ) To classify a new point x: sign j α jy (j) k(x (j), x) + b.
Kernel SVM, revisited 1 Kernel function. Define a similarity function k(x, z). 2 Learning. Solve the dual problem: max α R n n α i i=1 s.t.: n α i α j y (i) y (j) k(x (i), x (j) ) i,j=1 n α i y (i) = 0 i=1 0 α i C This yields α. Offset b also follows. 3 Classification. Given a new point x, classify as ( ) sign α i y (i) k(x (i), x) + b. i
Choosing the kernel function The final classifier is a similarityweighted vote, F (x) = α 1 y (1) k(x (1), x) + + α n y (n) k(x (n), x) (plus an offset term, b).
Choosing the kernel function The final classifier is a similarityweighted vote, F (x) = α 1 y (1) k(x (1), x) + + α n y (n) k(x (n), x) (plus an offset term, b). Can we choose k to be any similarity function?
Choosing the kernel function The final classifier is a similarityweighted vote, F (x) = α 1 y (1) k(x (1), x) + + α n y (n) k(x (n), x) (plus an offset term, b). Can we choose k to be any similarity function? Not quite: need k(x, z) = Φ(x) Φ(z) for some embedding Φ.
Choosing the kernel function The final classifier is a similarityweighted vote, F (x) = α 1 y (1) k(x (1), x) + + α n y (n) k(x (n), x) (plus an offset term, b). Can we choose k to be any similarity function? Not quite: need k(x, z) = Φ(x) Φ(z) for some embedding Φ. Mercer s condition: same as requiring that for any finite set of points x (1),..., x (m), the m m similarity matrix K given by K ij = k(x (i), x (j) ) is positive semidefinite.
The RBF kernel A popular similarity function: the Gaussian kernel or RBF kernel k(x, z) = e x z 2 /s 2, where s is an adjustable scale parameter.
RBF kernel: examples
RBF kernel: examples
RBF kernel: examples
RBF kernel: examples
The scale parameter Recall prediction function: F (x) = α 1 y (1) k(x (1), x) + + α n y (n) k(x (n), x). For the RBF kernel, k(x, z) = e x z 2 /s 2, 1 How does this function behave as s? 2 How does this function behave as s 0? 3 As we get more data, should we increase or decrease s?
Kernels: postscript 1 Customized kernels For different domains (NLP, biology, speech,...) Over different structures (sequences, sets, graphs,...)
Kernels: postscript 1 Customized kernels For different domains (NLP, biology, speech,...) Over different structures (sequences, sets, graphs,...) 2 Learning the kernel function Given a set of plausible kernels, find a linear combination of them that works well.
Kernels: postscript 1 Customized kernels For different domains (NLP, biology, speech,...) Over different structures (sequences, sets, graphs,...) 2 Learning the kernel function Given a set of plausible kernels, find a linear combination of them that works well. 3 Speeding up learning and prediction The n n kernel matrix k(x (i), x (j) ) is a bottleneck for large n. One idea: Go back to the primal space! Replace the embedding Φ by a lowdimensional mapping Φ such that Φ(x) Φ(z) Φ(x) Φ(z). This can be done, for instance, by writing Φ in the Fourier basis and then randomly sampling features.