Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1
The perceptron algorithm [Rosenblatt 58, 62] Classification setting: y in {-1,+1} Linear model - Prediction: Training: - Initialize weight vector: - At each time step: Observe features: Make prediction: Observe true class: Update model: - If prediction is not equal to truth 3 Perceptron prediction: Margin of confidence 4 2
Hinge loss Perceptron prediction: Makes a mistake when: Hinge loss (same as maximizing the margin used by SVMs) 5 Minimizing hinge loss in batch setting Given a dataset: Minimize average hinge loss: How do we compute the gradient? 6 3
Subgradients of convex functions Gradients lower bound convex functions: Gradients are unique at x if function differentiable at x Subgradients: Generalize gradients to non-differentiable points: - Any plane that lower bounds function: 7 Subgradient of hinge Hinge loss: Subgradient of hinge loss: - If y t (w.x t ) > 0: - If y t (w.x t ) < 0: - If y t (w.x t ) = 0: - In one line: 8 4
Subgradient descent for hinge minimization Given data: Want to minimize: Subgradient descent works the same as gradient descent: - But if there are multiple subgradients at a point, just pick (any) one: 9 Perceptron revisited Perceptron update: Batch hinge minimization update: Difference? 10 5
The kernelized perceptron What if the data are not linearly separable? Use features of features of features of features. 12 Feature space can get really large really quickly! 6
Higher order polynomials d input dimension p degree of polynomial number of monomial terms number of input dimensions p=4 p=3 p=2 grows fast! p = 6, d = 100 about 1.6 billion terms 13 Perceptron revisited Given weight vector w (t), predict point x by: Mistake at time t: w (t+1) w (t) + y t x t Thus, write weight vector in terms of mistaken data points only: - Let M (t) be time steps up to t when mistakes were made: Prediction rule now: When using high dimensional features: 14 7
Dot-product of polynomials polynomials of degree exactly p 15 Finally, the kernel trick! Kernelized perceptron Every time you make a mistake, remember (x t,y t ) Kernelized perceptron prediction for x: 16 8
Polynomial kernels All monomials of degree p in O(d) operations: polynomials of degree exactly p How about all monomials of degree up to p? - Solution 0: - Better solution: 17 Common kernels Polynomials of degree exactly p Polynomials of degree up to p Gaussian (squared exponential) kernel Sigmoid 18 9
What you need to know Linear separability in higher-dim feature space The kernel trick Kernelized perceptron Derive polynomial kernel Common kernels 19 Support vector machines (SVMs) 10
Linear classifiers Which line is better? 21 Pick the one with the largest margin! 22 11
Maximize the margin 23 But there are many planes 24 12
Review: Normal to a plane 25 A convention: Normalized margin Canonical hyperplanes x + x - 26 13
Margin maximization using canonical hyperplanes Unnormalized problem: Normalized Problem: 27 Support vector machines (SVMs) Solve efficiently by many methods, e.g., - quadratic programming (QP) Well-studied solution algorithms - Stochastic gradient descent Hyperplane defined by support vectors 28 14
What if data are not linearly separable? Use features of features of features of features. 29 What if data are still not linearly separable? If data are not linearly separable, some points don t satisfy margin constraint: How bad is the violation? Tradeoff margin violation with w : 30 15
SVMs for non-linearly separable data, meet my friend the Perceptron Perceptron was minimizing the hinge loss: SVMs minimizes the regularized hinge loss!! 31 Stochastic gradient descent for SVMs Perceptron minimization: SVMs minimization: SGD for Perceptron: SGD for SVMs: 32 16
Mixture model example 33 From Hastie, Tibshirani, Friedman book Mixture model example kernels 34 From Hastie, Tibshirani, Friedman book 17
What you need to know Maximizing margin Derivation of SVM formulation Non-linearly separable case - Hinge loss - a.k.a. adding slack variables SVMs = Perceptron + L 2 regularization Can optimize SVMs with SGD - Many other approaches possible 35 18