Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701
SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem: w weights on features (d-dim problem) Convex quadratic program quadratic objective, linear constraints But expensive to solve if d is very large Often solved in dual form (n-dim problem) 2
Constrained Optimization x = max(b, 0) Constraint inactive Constraint active and tight 3
Constrained Optimization Dual Problem b +ve Primal problem: Moving the constraint to objective function Lagrangian: α = 0 constraint is inactive α > 0 constraint is active Dual problem: 4
Connection between Primal and Dual Primal problem: p* = Dual problem: d* = Ø Weak duality: The dual solution d* lower bounds the primal solution p* i.e. d* p* To see this, recall For every feasible x (i.e. x b) and feasible α (i.e. α 0), notice that d(α) = p* Ø Dual problem (maximization) is always concave even if primal is not convex 5
Connection between Primal and Dual Primal problem: p* = Dual problem: d* = Ø Weak duality: The dual solution d* lower bounds the primal solution p* i.e. d* p* Ø Strong duality: d* = p* holds often for many problems of interest e.g. if the primal is a feasible convex objective with linear constraints 6
Connection between Primal and Dual What does strong duality say about (the that achieved optimal value of dual) and x (the x that achieves optimal value of primal problem)? Whenever strong duality holds, the following conditions (known as KKT conditions) are true for and x : 1. 5L(x, ) = 0 i.e. Gradient of Lagrangian at x and is zero. 2. x 3. b i.e. x is primal feasible 0 i.e. is dual feasible 4. (x b) = 0 (called as complementary slackness) We use the first one to relate x and. We use the last one (complimentary slackness) to argue that = 0 if constraint is inactive and > 0 if constraint is active and tight. 7
Solving the dual Dual: d(alpha) Optimization over x is unconstrained. Now need to maximize L(x *,α) over α 0 Solve unconstrained problem to get α and then take max(α,0) ) 0 =2b α = 0 constraint is inactive, α > 0 constraint is active and tight 8
Dual SVM linearly separable case n training points, d features (x 1,, x n ) where x i is a d-dimensional vector Primal problem: Dual problem (derivation): w weights on features (d-dim problem) alpha weights on training pts (n-dim problem) 9
Dual SVM linearly separable case Dual problem: If we can solve for αs (dual problem), then we have a solution for w,b (primal problem) 10
Dual SVM linearly separable case Dual problem is also QP Solution gives α j s What about b? 11
Dual SVM: Sparsity of dual solution α j = 0 α j > 0 α j > 0 α j > 0 α j = 0 α j = 0 Only few α j s can be non-zero : where constraint is active and tight (w.x j + b)y j = 1 Support vectors training points j whose α j s are non-zero 12
Dual SVM linearly separable case Dual problem is also QP Solution gives α j s Use support vectors with α k >0 to compute b since constraint is tight (w.x k + b)y k = 1 13
Dual SVM non-separable case Primal problem:,{ξ j } Dual problem:,{ξ j } L(w,b,,,µ) Lagrange Multipliers 14
Dual SVM non-separable case comes from @L @ =0 Intuition: If C, recover hard-margin SVM Dual problem is also QP Solution gives α j s 15
So why solve the dual SVM? There are some quadratic programming algorithms that can solve the dual faster than the primal, (specially in high dimensions d>>n) But, more importantly, the kernel trick!!! 16
Separable using higher-order features x 2 θ x 1 r = x 12 +x 2 2 x 1 2 x 1 x 1 17
What if data is not linearly separable? Use features of features of features of features. Φ(x) = (x 12, x 22, x 1 x 2,., exp(x 1 )) Feature space becomes really large very quickly! 18
Higher Order Polynomials m input features d degree of polynomial grows fast! d = 6, m = 100 about 1.6 billion terms 19
Dual formulation only depends on dotproducts, not on w! Φ(x) High-dimensional feature space, but never need it explicitly as long as we can compute the dot product fast using some Kernel K 20
Dot Product of Polynomials d=1 d=2 d 21
Finally: The Kernel Trick! Never represent features explicitly Compute dot products in closed form Constant-time high-dimensional dotproducts for many classes of features 22
Common Kernels Polynomials of degree d Polynomials of degree up to d Gaussian/Radial kernels (polynomials of all orders recall series expansion of exp) Sigmoid 23
Mercer Kernels What functions are valid kernels that correspond to feature vectors ϕ(x)? Answer: Mercer kernels K K is continuous K is symmetric K is positive semi-definite x T Kx 0 for all x 24
Overfitting Huge feature space with kernels, what about overfitting??? Maximizing margin leads to sparse set of support vectors Some interesting theory says that SVMs search for simple hypothesis with large margin Often robust to overfitting 25
What about classification time? For a new input x, if we need to represent Φ(x), we are in trouble! Recall classifier: sign(w.φ(x)+b) Using kernels we are cool! 26
SVMs with Kernels Choose a set of features and kernel function Solve dual problem to obtain support vectors α i At classification time, compute: Classify as 27
SVMs with Kernels Iris dataset, 2 vs13, Linear Kernel 28
SVMs with Kernels Iris dataset, 1 vs 23, Polynomial Kernel degree 2 29
SVMs with Kernels Iris dataset, 1 vs23, Gaussian RBF kernel 30
SVMs with Kernels Iris dataset, 1 vs23, Gaussian RBF kernel 31
SVMs with Kernels Chessboard dataset, Gaussian RBF kernel 32
SVMs with Kernels Chessboard dataset, Polynomial kernel 33
Corel Dataset 34
Corel Dataset Olivier Chapelle 1998 35
USPS Handwritten digits 36
SVMs vs. Kernel Regression SVMs Kernel Regression or Differences: SVMs: Learn weights α i (and bandwidth) Often sparse solution KR: Fixed weights, learn bandwidth Solution may not be sparse Much simpler to implement 37
SVMs vs. Logistic Regression SVMs Logistic Regression Loss function Hinge loss Log-loss 38
SVMs vs. Logistic Regression SVM : Hinge loss Logistic Regression : Log loss ( -ve log conditional likelihood) Log loss Hinge loss 0-1 loss -1 0 1 39
SVMs vs. Logistic Regression SVMs Logistic Regression Loss function Hinge loss Log-loss High dimensional features with kernels Yes! Yes! Solution sparse Often yes! Almost always no! Semantics of output Margin Real probabilities 40
Kernels in Logistic Regression Define weights in terms of features: Derive simple gradient descent rule on α i 41
Can we use kernels in regression? 42
Ridge regression (A T A + I) 1 A T Y Recall Hence A T A is a p x p matrix whose entries denote the (sample) correlation between the features NOT inner products between the data points the inner product matrix would be AA T which is n x n (also known as Gram matrix) Using dual formulation, we will write the solution in terms of AA T 43
Ridge regression (A T A + I) 1 A T Y Similarity with SVMs Primal problem: nx zi 2 + k k 2 2 min,zi i=1 s.t. z i = Y i X i SVM Primal problem: nx C i + 1 2 kwk2 2 min w, i i=1 s.t. i = max(1 Y i X i w, w, 0) 0) Lagrangian: nx zi 2 + k k 2 + i=1 nx i (z i Y i + X i ) i=1 α i Lagrange parameter, one per training point 44
Ridge regression (dual) (A T A + I) 1 A T Y Dual problem: α = {α i } for i = 1,, n Taking derivatives of Lagrangian wrt β and z i we get: Dual problem: max min,z i max nx zi 2 + k k 2 + i=1 > 2 n-dimensional optimization problem nx i (z i Y i + X i ) i=1 = 1 2 A> z i = i 2 4 4 1 2 > AA > > Y 45
Ridge regression (dual) (A T A + I) 1 A T Y = A T (AA T + I) 1 Y Dual problem: max > 1 2 2 > AA > > Y ) b = 4 4! 1 AA > + I 2 Y can get back ˆ = 1 2 A> ˆ = A > (AA > + I) 1 Y Weighted average of training points Weight of each training point (but typically not sparse) 46
Kernelized ridge regression (A T A + I) 1 A T Y Using dual, can re-write solution as: b = A T (AA T + I) 1 Y How does this help? Only need to invert n x n matrix (instead of p x p or m x m) More importantly, kernel trick! AA T involves only inner products between the training points BUT still have an extra A T Recall the predicted label is A T (AA T + I) 1 Y XA T contains inner products between test point X and training points! 47
Kernelized ridge regression (A T A + I) 1 A T Y Using dual, can re-write solution as: b = A T (AA T + I) 1 Y How does this help? Only need to invert n x n matrix (instead of p x p or m x m) More importantly, kernel trick! bf n (X) =K X (K + I) 1 Y where K X (i) = (X) (X i ) K(i, j) = (X i ) (X j ) Work with kernels, never need to write out the high-dim vectors 48
Kernelized ridge regression bf n (X) =K X (K + where Work with kernels, never need to write out the high-dim vectors Examples of kernels: Polynomials of degree exactly d Polynomials of degree up to d Gaussian/Radial kernels I) 1 Y K X (i) = (X) (X i ) K(i, j) = (X i ) (X j ) Ridge Regression with (implicit) nonlinear features (X)! f(x) = (X) 49