EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015
overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means, PCA Support Vector Machines (SVMs) classification regression
Introduction high dimensional spaces Data points in high dimensional spaces can be better separated Exemple: linear classifier (e.g. perceptron) linear decision function => map feaure in high dimensional space here: polynomial kernel (x) = (x 1,x 1 )=(x 2 1, p 2x 1 x 2,x 2 2) Questions: how to map data efficiently in high dimension? how does such mapping affect existing methods/classifiers?
Introduction comparing samples We often think of distances in (euclidian) metric spaces distance <-> scalar product kx x 0 k 2 =(x x 0 ) (x x 0 )=x x 2x x 0 + x 0 x 0 Might not always be easy or relevant how to compare 2 strings, 2 text paragraphs, 2 sequences, 2 images... However: often we can define some similarity measures between elements e.g. for strings: Sim(s1,s2) = EditDistance(s1,s2) note: often triangular inequality not respected How can we exploit such measures in classification algorithms? which properties of these measures are useful?
Introduction: classifiers Two types of classifiers model-based (classification, regression) eg. linear classifier h(x) = data used to learn the model parameters, and then removed +1 if w x + b>0 1 otherwise non-parametric approach training data points are kept in classifier definition knn Parzen windows P (x) = 1 X 1 x xi K n i h d n h n memory-based methods (fast at training, slow at testing) Indeed, in many methods, the solution can be written a linear combination of kernel function at training data points representing scalar product in high dimension This linear combination is often referred to as the dual representation
Illustration - Perceptron Update rule at iteration l w l+1 = w l + yl x l if y l w l x l apple 0 0 otherwise x! (x) yl (x w l+1 = w l + l ) if y l (w l (x l )) apple 0 0 otherwise In (high dimension) projection space As a results, weights are a linear combination of training data w = X y l (x l ) l The decision function can be rewritten as w (x) = X y l (x l ) (x) = X l l y l k(x l, x) The data is thus used only through dot products in projected space and implicitly, through a Kernel k(x, x 0 )= (x) (x 0 )
Kernels Valid kernels: Mercer Kernel consider a smooth symmetric function k() over a compact C k : C C! IR k() is a kernel if and only if it can be decomposed into k(x, x 0 )= 1X i i(x) i(x 0 ) and if and only if for all finite set {x 1,...,x p } C the matrix K defined by K ij = K(x 1,..., x p ) is semi-definite positive
Building Kernel Kernel can be constructed by combining kernels, e.g. like k(x, x 0 ) = c 1 k 1 (x, x 0 )+c 2 k 2 (x, x 0 ) k(x, x 0 ) = f(x)k 1 (x, x 0 )f(x) k(x, x 0 ) = q(k 1 (x, x 0 )) k(x, x 0 ) = exp(k 1 (x, x 0 )) k(x, x 0 ) = k 1 (x, x 0 )k 2 (x, x 0 ) k(x, x 0 ) = k 3 ( (x), (x 0 )) k(x, x 0 ) = xax 0 k(x, x 0 ) = k(x, x 0 ) = k a (x a, x 0 a)+k b (x b, x 0 b ) k a (x a, x 0 a)k b (x b, x 0 b ) where kernels on the right are valid kernels on their respective domains, c1>0 and c2>0, A is a symmetric semidefinitive positive matrix, f is any function, q is a polynomial of non-negative coefficients, and x a and x b are variables (not necessarily disjoint) with x = (x a, x b ) Properties can be used to demonstrate whether a proposed kernel is a Mercer Kernel
Notable kernels Polynomial Kernels k(x, x 0 )=(u x x 0 + v) p, u,v 0,p2 IN Gaussian Kernels k(x, x 0 )=exp kx x0 k 2, >0 note: not considered as a distribution here => no need for normalization constant implicit projection: in an infinite dimension space String Kernel Fisher Kernel k(x, x 0 )= X s2a? w s s(x) s (x 0 ) count number of times substring s occurs in x
Kernelizing algorithms Many algorithms can be «Kernelized» Straightforward for the perceptron k-nn? k-mean? PCA? how? express results on the form of dot product use the kernel trick k-nn: requires distances between two examples k (x) (x 0 )k 2 = (x) (x) 2 (x) (x 0 )+ (x 0 ) (x 0 ) easy to kernelize...
Kernel K-Means apply K-means in projected space... assumes µ i denotes the means/centroids in this space.. as the projected space can be infinite, we keep the means in their dual form { i 1, i 2,...} i.e. as a weighted sum of the samples... µ i = X j i (x j ) j for each data sample, we need to find the closest mean k (x) µ i k 2 = X j,k i j i k (x j ) (x k ) 2 X j i j (x j ) (x)+cste
Kernel PCA Standard PCA Way to remove correlation between points => reduce dimensions through linear projection Data driven: training samples compute mean and covariance find largest eigenvalues of covariance matrix => sort eigenvectors ui by decreasing order of eigenvalues => form matrix U =(u 1,...,u M ) lower dimensional representation of datapoints is given by y n = U T (x n x) approximate reconstruction x n ' x + Uy n 12
Kernel PCA - intuition Apply normal PCA in high-dimensional projected space (straight) lines of constant projections in projected space correspond to nonlinear projections in original space
Kernal PCA Assume projected data are centered (have 0 mean) X (x i )=0 Covariance matrix in projected space C = 1 NX (x i ) (x i ) T = 1 N N XXT i where X is the design matrix, with column i defined by Φ(x i ) PCA computes the eigenvalues/eigenvector of C How can we compute them (or involve) in terms of K kl = k(x k, x l )= (x k ) (x l ) T? Note that K = X T X
Kernal PCA By definition, we have Cv i = i v i Substituting the covariance definition leads to 1 NX (x l ) (x l ) T v i = i v i N l=1 Consequence: the eigenvector can be expressed as a linear combination of the projected samples NX v i = (x l ) a il, (with a il = 1 in (x l) T v i ) l=1 Then, how can we actually determine the a coefficients? ( and involve only the kernel function k(.,.) )
Kernel PCA In matrix form, eigengenvectors can thus be written as v i = Xa i Eigenvalue problem Cv i = i v i C = 1 N XXT 1 N XXT v i = Introducing the decomposition into it leads to 1 N XXT Xa i = i Xa i X T XX T Xa i = N i X T Xa i i.e. K 2 a i = i NKa i i v i Thus, we can find solutions for a i by solving the eigenvalue problem Ka i = i Na i
Kernel PCA We need to normalize the coefficient a i impose that eigenvectors in projected space have norm 1 1=v T i v i =(Xa i ) T (Xa i )=a T i X T Xa i = a T i Ka i = i Na T i a i We need to center the data (in projected space) we can not compute the mean (in projected space) as we want to avoid working directly in this projection space => we need to formulate the algorithm purely in term of the kernel function (xj )= (x j ) 1 N NX l=1 (x l ) 1 => X = X N X11T K = X T X = X T X 1 N XT X11 T 1 N 11T X T X + 1 N 2 11 T X T X11 T = K 1 N K11T 1 N 11T K + 1 N 2 11 T K11 T Projection (coordinate) of a point on eigenvector i y i (x) =v T i (x) =( NX a il (x l )) T (x) = NX a il (x l ) T (x) = l=1 l=1 l=1 NX a il k(x, x l )
Kernel PCA - illustration (Schölkopf et al1998) Kernel PCA with Gaussian kernel first 8 eigenvalues contour lines = points with equal projection on corresponding eigenvector first two eigenvectors, separate the 3 main clusters following eigenvectors split cluster into halves; and further 3 as well (along orthogonal directions)
Kernel PCA - Summary Given a set of data points, stacked as X compute K and then K K = K 1 N K11T 1 N 11T K + 1 N 2 11T K11 T Compute the eigenvectors and eigenvalues Ka i = i a i Normalize them properly ia T i a i =1 Projection of a new data point onto the principal components y i (x) = NX a il k(x, x l ) l=1
overview Kernel methods introduction and main elements defining kernels Kernelization of k-nn, K-Means, PCA Support Vector Machines (SVMs) classification regression
Support Vector Machines (SVM) - principle H1: does not separate the classes H2: separate classes, but by a small margin H3: maximum margin separable data: several classifiers available. Which one is the best? perceptron: classifier depends on initialization, order of visit of datapoints margin distance from the closest datapoint to the decision boundary large margin: classification more immune to small perturbation of the datapoints
SVM margin geometry signed distance to decision boundary - b linear decision function
SVM max margin dataset D = {(x i,t i ) t i 2 { 1, +1},,...,N} linear classifier if y(x) > 0 then t=1 otherwise t = -1 y(x) =w T (x)+b distance to decision surface max-margin solution t i y(x i ) kwk (max of the minimum distance to the decision surface) arg max w,b 1 kwk min ti (w T i (x i )+b)
SVM max margin max-margin solution arg max w,b 1 kwk min i ti (w T (x i )+b) note: rescaling w and b by s does not change the solution use that to constrain the problem set closest point (exists) to the decision surface as t i (w T (x i )+b) =1 all other points are further away Max-margin problem ( arg min 1 2 kwk2 subject to t i (w T (x i )+b) 1 8i =1,...,N quadratic programming problem minimizing quadratic function subject to constraints
SVM Lagrangian duality Primal optimization problem introduce generalized Lagrangian primal problem P (w) = Dual optimization problem Under certain constraints (f and g i convex, h i affine; constraints are feasible) D (, )=min w dual problem leads to the same solution than the primal solution satisfy L(w,, ) (necessary and sufficient condition) Karush-Kuhn-Tucker conditions 8 < : max L(w,, ), 0 min w s.t. L(w,, min w max D(, ) = max min L(w,, ), 0, 0 w 8 >< >: f(w) g i (w) apple 0,,...,k h i (w) =0,,...,l kx )=f(w)+ i g i (w)+ P(w) =min w @L(w?,?,? ) @w i = 0,,...n @L(w?,?,? ) lx max L(w,, ), 0 ih i (w) @ i = 0,,...l i? g i(w? ) = 0,,...k g i (w? ) apple 0,,...k i? 0,,...k
SVM Dual form Primal problem note: constraint is positive Lagrangian ( arg min 1 2 kwk2 subject to t i (w T (x i )+b) 1 8i =1,...,N Dual problem: given a, minimize w.r.t. the weights and bias => derivatives @L (w,b; a) =w X a i t i (x i )=0) w = @w @L N @b (w,b; a) = X a i t i =0 i By substitution in the lagrangian, we end-up with the following problem 8 >< >: L(w,b; a) = 1 2 kwk2 max a L(a) = P N a i P N a it i = 0 a i N X 1=1 0,,...,N t i y(x i ) 1 0,,...,N a i (t i y(x i ) 1) = 0,,...,N a i t i (w T (x i )+b) 1 NX a i t i (x i ) 1 2 P N P N l=1 a ia l t i t l k(x i, x l ) subject to last inequalities: either the point is on the margin (constraint satisfied; then a i will be non 0, we have a support vector), or the constraint is not satisfied (the point is beyond the margin), and then a i = 0
SVM - illustration Illustration with Radial Basis Function Shown: decision boundary, plus margins Support Vectors (with non-zero weights) are on margin curves Dual form interest allows to introduce the kernel unique solution quadratic optimization Computation of a new score (and classification) weights as linear combination of projected data point the sum needs to run only on the set of Support Vectors NX y(x) =w T (x) +b = Bias computation a i t i (x i ) T (x) +b = X i2s can by compute from any satisfied constraint, ie on support vectors! average on all support vectors b = 1 X X t i a l t l k(x l, x i ) N S i2s l2s a i t i k(x i, x) +b
SVM the non separable case 8 >< >: Primal problem 1 arg min w,b 2 kwk2 + C P N i t i y(x i ) 1 i 8i =1,...,N i 0 arg min w,b 1 2 kwk2 + C! NX max(0, 1 t i y(x i )) subject to introduced variables are slack variables their sum provides an upper bound on the error framework is sensitive to outliers : errors grows linearly with distance C is analagous to (the inverse of) a regularisation coeeficient. It controls the trade-off between model complexity (the margin) and training errors when C è, we recover the separable case
SVM non-separable case dual form Lagrangian L(w,b, ; a, r) = 1 2 kwk2 + C NX i X N X N a i t i (w T (x i )+b) 1+ i r i i Derivating w.r.t. weights, bias, and slack variables @L N @b =0) X @L a i t i (x i ) a i t i =0 =0) a i = C @ i @L N @w =0) w = X We end with the dual problem, very similar to the separable case 8 >< max a L(a) = P N a i P N a it i =0 1 2 P N P N l=1 a ia l t i t l k(x i, x l ) subject to r i >: 0 apple a i apple C, i =1,...,N a i (t i y(x i ) 1+ i ) = 0 and i r i =0, i =1,...,N prediction formula is the same than in the separable case some a i will be 0 and will not contribute to the prediction; the rest will be Support Vectors if a i < C, then r i > 0 and thus the slack variable ξ i =0 => the data are on the margin a i = C,then r i = 0 : point will lie in the margin (well classified or not) or on the opposite side
SVM the regression case E (z) Idea: fit the training data using an ε-insensitive error function min 1 NX 2 kwk2 + C E (y(x i ) t i ) As before, introduce relaxed constraints, resulting in primal: 8 1 arg min >< w,b 2 kwk2 + C P N ( i + ˆ i ) subject to >: t i apple y(x i )+ + i, i 0 8i =1,...,N t i y(x i ) ˆ i, ˆ i 0 8i =1,...,N
8 >< SVM regression case dual form Introducing Lagrangian variables, we end up maximizing P N max a,â L(a, â )= (a P 1 N P N i â i )t i 2 l=1 (a i â i ) P N (a l â l )k(x i, x l ) P N (a i +â i ) subject >: P N (a i â i )=0 0 apple a i, â i apple C Weights are still obtained as linear combination: w = NX (a i â i ) (x i ) Score of a new observation NX y(x) = (a i â i )k(x, x i )+b
SVM - optimization Both the classification and regression can be viewed as a minimization of the form J(a) = 1 2 at Qa T a under the constraints a T =0 C min apple a apple C max This problem is quadratic, convex, and in 0(N 3 )
SVM optimization Sequential Minimum Optimization (SMO) algorithm Can we do do coordinate descent with one variable? no: first constraint imposes that when N-1 parameters are known/fixed, the last one can only be set to a single value to satisfy the constraint idea: optimize with respect to two variables a i and a j (other are fixed) constraints are reduced to a i i + a j j = c ij C min apple a i,a j apple C max and the optimization problem can be solved analytically Choosing pairs of a i and a j consider the strongest gradient g i = [Qa-β] i make sure going towards these gradient directions will not hit the bounds C min apple a i g i, a j g j apple C max g i γ i and g j γ j must point to opposite directions Cost about 0(N 2 )
Kernel Machines & sparsity Other existing Kernel Machines NX y(x) = w i k(x, x i )+b p(t x, ) =N (t; y(x), L(w,b)=L nll (D)+ kwk 1 2 ) directly express output as linear combination and estimate the weights fit a probabilistic model (e.g. as in logistic regression) use negative log-likelihood measure optimize penalized loss using explicit sparsity advantage: no need for Mercer Kernel, explicit sparsity, probabilistic interpretation, better extension to Multiple classes Relevance Vector Machines (RVMs) other penalization function
SVM - Summary Classification SVM finds the largest margin separating hyperplane there is a unique solution indirectly induces sparsity of support vectors It leads to a quadratic (convex) minimization problem The capacity (to fit) can be controlled in several ways C : controls the trade-off classification error/margin Kernel choice Kernel parameters, if any The idea can be generalized to regression Other sparser methods Relevance Vector Machines L1 regularization kernel machines