Machine Learning A Geometric Approach

Similar documents
Machine Learning A Geometric Approach

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Machine Learning. Support Vector Machines. Manfred Huber

ICS-E4030 Kernel Methods in Machine Learning

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

L5 Support Vector Classification

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (SVM) and Kernel Methods

Lecture 18: Optimization Programming

Support Vector Machines: Maximum Margin Classifiers

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machines and Kernel Methods

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

CS-E4830 Kernel Methods in Machine Learning

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Lecture Notes on Support Vector Machine

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Structured Prediction

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Kernelized Perceptron Support Vector Machines

Support Vector Machines

Statistical Machine Learning from Data

Deep Learning for Computer Vision

LECTURE 7 Support vector machines

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Support Vector Machines

Support Vector Machines

Support Vector Machines

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines, Kernel SVM

Linear & nonlinear classifiers

Max Margin-Classifier

Support Vector Machines

(Kernels +) Support Vector Machines

Support vector machines

Perceptron Revisited: Linear Separators. Support Vector Machines

Machine Learning Lecture 6 Note

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machine (continued)

SUPPORT VECTOR MACHINE

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Cheng Soon Ong & Christian Walder. Canberra February June 2018

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Support Vector Machines

Recita,on: Loss, Regulariza,on, and Dual*

Warm up: risk prediction with logistic regression

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Incorporating detractors into SVM classification

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Constrained Optimization and Support Vector Machines

Linear & nonlinear classifiers

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

COMP 875 Announcements

Support Vector Machines

Lecture 2: Linear SVM in the Dual

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machines for Classification and Regression

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

Announcements - Homework

Support vector machines Lecture 4

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Kernel Methods and Support Vector Machines

Introduction to Logistic Regression and Support Vector Machine

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CS798: Selected topics in Machine Learning

Dan Roth 461C, 3401 Walnut

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Introduction to Support Vector Machines

Soft-Margin Support Vector Machine

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

ECS171: Machine Learning

UVA CS 4501: Machine Learning

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Applied inductive learning - Lecture 7

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines.

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Support Vector Machines for Regression

CS269: Machine Learning Theory Lecture 16: SVMs and Kernels November 17, 2010

Discriminative Models

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Support Vector Machines

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

Lecture 10: A brief introduction to Support Vector Machine

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Constrained Optimization and Lagrangian Duality

Numerical Optimization

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Transcription:

Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU)

Linear Separator Ham Spam

From Perceptron to SVM 1959 osenblatt invention +max margin batch online 1962 Novikoff proof +kernels 1964 Vapnik Chervonenkis +soft-margin fall of USSR conservative updates inseparable case 1997 Cortes/Vapnik SVM online approx. max margin 1999 Freund/Schapire voted/avg: revived subgradient descent 2003 Crammer/Singer MIRA minibatch 2007--2010* Singer group Pegasos 2006 Singer group aggressive *mentioned in lectures but optional (others papers all covered in detail) AT&T Research 2002 Collins structured 2005* McDonald/Crammer/Pereira structured MIRA ex-at&t and students 3

Large Margin Classifier hw, xi + b apple 1 hw, xi + b 1 linear function f(x) =hw, xi + b

Large Margin Classifier hw, xi + b apple 1 hw, xi + b 1 linear function f(x) =hw, xi + b

Why large margins? Maximum o robustness relative o o r + to uncertainty Symmetry breaking Independent of o ρ + + correctly classified instances Easy to find for + easy problems

Feature Map Φ SVM is often used with kernels

Large Margin Classifier hw, xi + b apple 1 hw, xi + b = 1 hw, xi + b =1 hw, xi + b 1 w functional margin: y i (w x i ) geometric margin: y i (w x i ) kwk = 1 kwk

Large Margin Classifier Q1: what if we want functional margin of 2? Q2: what if we want geometric margin of 1? hw, xi + b apple 1 hw, xi + b = 1 hw, xi + b =1 hw, xi + b 1 w SVM objective (max version): max w 1 kwk s.t. 8(x,y) 2 D, y(w x) 1 max. geometric margin s.t. functional margin is at least 1

Large Margin Classifier hw, xi + b apple 1 hw, xi + b = 1 hw, xi + b =1 hw, xi + b 1 w SVM objective (min version): min w kwk s.t. 8(x,y) 2 D, y(w x) 1 interpretation: small models generalize better min. weight vector s.t. functional margin is at least 1

Large Margin Classifier hw, xi + b apple 1 hw, xi + b = 1 hw, xi + b =1 hw, xi + b 1 w SVM objective (min version): min w 1 2 kwk2 s.t. 8(x,y) 2 D, y(w x) 1 w not differentiable, w 2 is. min. weight vector s.t. functional margin is at least 1

SVM vs. MIRA SVM: min weight vector to enforce functional margin of at least 1 on ALL EXAMPLES MIRA: min weight change to enforce functional margin of at least 1 on THIS EXAMPLE MIRA is 1-step or online approximation of SVM Aggressive MIRA SVM as p 1 min w 1 2 kwk2 s.t. 8(x,y) 2 D, y(w x) 1 min kw 0 wk 2 w 0 s.t. w 0 x 1 perceptron w MIRA x i

Convex Hull Interpretation max. distance between convex hulls how many support vectors in 2D? weight vector is determined by the support vectors alone X c.f. perceptron: w = what about MIRA? (x,y)2errors y x why don t use convex hulls for SVMs in practice??

Convexity and Convex Hulls convex combination

Optimization Primal optimization problem 1 min w 2 kwk2 s.t. 8(x,y) 2 D, y(w x) 1 constraint Convex optimization: convex function over convex set! Quadratic prog.: quadratic function w/ linear constraints linear quadratic

MIRA as QP MIRA is a trivial QP; can be solved geometrically what about multiple constraints (e.g. minibatch)? min kw 0 wk 2 w 0 s.t. w 0 x 1 kx i k w i x i 1 kx i k x i MIRA w i kx i k 1 w i x i

Optimization Primal optimization problem 1 min w 2 kwk2 s.t. 8(x,y) 2 D, y(w x) 1 Convex optimization: convex function over convex set! Lagrange function L(w, b, ) = 1 2 kwk2 X Derivatives in w need to vanish @ w L(w, b, αa) =w @ b L(w, b, a) w = X i i X i y i y i x=0 i i [y i [hx i,wi + b] 1] i y i x i =0 constraint model is a linear combo of a small subset of input (the support vectors) i.e., those with α i > 0

Lagrangian & Saddle Point equality: min x 2 s.t. x = 1 inequality: min x 2 s.t. x >= 1 Lagrangian: L(x, α)=x 2 - α(x-1) derivative in x need to vanish optimality is at saddle point with α x min x in primal => max α in dual α

Constrained Optimization minimize w,b 1 Quadratic Programming Quadratic Objective Linear Constraints 2 kwk2 subject to y i [hx i,wi + b] 1 Karush Kuhn Tucker w = X i constraint y i i x i KKT condition (complementary slackness) optimal point is achieved at active constraints where α i > 0 (α i =0 => inactive) i [y i [hw, x i i + b] 1] = 0

KKT => Support Vectors minimize w,b 1 2 kwk2 subject to y i [hx i,wi + b] 1 w = X i y i i x i w Karush Kuhn Tucker (KKT) Optimality Condition i [y i [hw, x i i + b] 1] = 0 i =0 i > 0=) y i [hw, x i i + b] =1

Properties w = X i y i i x i w Weight vector w as weighted linear combination of instances Only points on margin matter (ignore the rest and get same solution) Only inner products matter Quadratic program We can replace the inner product by a kernel Keeps instances away from the margin

Alternative: Primal=>Dual Lagrange function L(w, b, ) = 1 2 kwk2 X i [y i [hx i,wi + b] 1] Derivatives in w need to vanish α @ w L(w, b, a) =w i X i i y i x i =0 X @ b L(w, b, a) w = i y i y i x=0 i Plugging w back into L yields maximize 1 2 X i,j i j y i y j hx i,x j i + X i i dual variables subject to X i y i = 0 and i 0

Primal vs. Dual Primal minimize w,b 1 2 kwk2 subject to y i [hx i,wi + b] 1 w = X i y i i x i w Dual maximize 1 2 X i,j i j y i y j hx i,x j i + X i i subject to X i y i = 0 and i 0 dual variables

Solving the optimization problem Dual problem maximize 1 2 X i,j i j y i y j hx i,x j i + X i i subject to X i i y i = 0 and i 0 dual variables If problem is small enough (1000s of variables) we can use off-the-shelf solver (CVXOPT, CPLEX, OOQP, LOQO) For larger problem use fact that only SVs matter and solve in blocks (active set method).

Quadratic Program in Dual Dual problem maximize 1 subject to 0 2 T Q T b Q: what s the Q in SVM primal? how about Q in SVM dual? Methods Gradient Descent Coordinate Descent aka., Hildreth Algorithm Sequential Minimal Optimization (SMO) Quadratic Programming Objective: Quadratic function Q is positive semidefinite Constraints: Linear functions

Convex QP if Q is positive (semi) definite, i.e., x T Qx 0 for all x, then convex QP => local min/max is global min/max if Q = 0, it reduces to linear programming if Q is indefinite => saddlepoint CP general QP is NP-hard; convex QP is polynomial-time svm LP QP

QP: Hildreth Algorithm idea 1: update one coordinate while fixing all other coordinates e.g., update coordinate i is to solve: 1 argmax 2 T Q T b i subject to 0 Quadratic function with only one variable Maximum => first-order derivative is 0

QP: Hildreth Algorithm idea 2: choose another coordinate and repeat until meet stopping criterion reach maximum or increase between 2 consecutive iterations is very small or after some # of iterations how to choose coordinate: sweep pattern Sequential: 1, 2,..., n, 1, 2,..., n,... 1, 2,..., n, n-1, n-2,...,1, 2,... Random: permutation of 1,2,..., n Maximal Descent choose i with maximal descent in objective

QP: Hildreth Algorithm initialize i =0 for all i repeat pick i following sweep pattern solve i 1 argmax 2 T Q T b i subject to 0 until meet stopping criterion

QP: Hildreth Algorithm maximize 1 2 T 4 1 1 2 T 6 4 subject to 0 choose coordinates 1, 2, 1, 2,...

QP: Hildreth Algorithm pros: extremely simple no gradient calculation easy to implement cons: converges slow, compared to other methods can t deal with too many constraints works for minibatch MIRA but not SVM

Linear Separator Ham Spam

Large Margin Classifier hw, xi + b apple 1 hw, xi + b 1 + + linear function f(x) =hw, xi + b linear separator is impossible

Large Margin Classifier hw, xi + b apple 1 hw, xi + b 1 + minimum error separator Theorem (Minsky & Papert) is impossible Finding the minimum error separating hyperplane is NP hard

Adding slack variables hw, xi + b apple 1+ hw, xi + b 1 + - minimize amount Convex optimization problem of slack

margin violation vs. misclassification misclassification is also margin violation (ξ>0)

Adding slack variables Hard margin problem minimize w,b 1 2 kwk2 subject to y i [hw, x i i + b] 1 With slack variables w determines ξ 1 minimize w,b 2 kwk2 + C X i ξ i subject to y i [hw, x i i + b] 1 i and i 0 Problem is always feasible. Proof: w = 0 and b = 0 and i =1 C=0? C=+? (also yields upper bound) C=+ => not tolerant on violation => hard margin

Review: Convex Optimization Primal optimization problem minimize x f(x) subjecttoc i (x) apple 0 Lagrange function L(x, ) =f(x)+ X i i c i (x) First order optimality conditions in x @ x L(x, ) =@ x f(x)+ X i i @ x c i (x) =0 Solve for x and plug it back into L maximize (keep explicit constraints) L(x( ), )

Dual Problem Primal optimization problem 1 minimize w,b 2 kwk2 + C X i ξ i subject to y i [hw, x i i + b] 1 i and i 0 Lagrange function L(w, b, ) = 1 2 kwk2 + C X ξ,,η) i i X i [y i [hx i,wi + b]+ i 1] X i i i i optimality in (w, ξ) is at saddle point with α,η Derivatives in (w, ξ) need to vanish

Dual Problem Lagrange function L(w, b, ) = 1 2 kwk2 + C X ξ,,η) i Derivatives in w need to vanish i @ w L(w, b,,, ) =w X i [y i [hx i,wi + b]+ i 1] i X i i y i x i =0 X i i i @ b L(w, b,,, ) = X i i y i =0 @ i L(w, b,,, ) =C i i =0 Plugging terms back into L yields maximize subject to X i 1 2 X i,j i j y i y j hx i,x j i + X i i y i = 0 and i 2 [0,C] dual variables i bound influence

Karush Kuhn Tucker Conditions L(w, b, ) = 1 2 kwk2 + C X ξ,,η) i i X i [y i [hx i,wi + b]+ i 1] X i i i i w = X i y i i x i α=c α=0 0<α<C 0<α<C w α=0 i [y i [hw, x i i + b]+ i 1] = 0 i i =0 0 apple i = C i apple C α=c i =0=) y i [hw, x i i + b] 1 0 < i <C=) y i [hw, x i i + b] =1 i = C =) y i [hw, x i i + b] apple 1 why these are not disjoint?

Support Vectors and Violations all circled and squared examples are support vectors (α>0) they include ξ=0 (hard-margin SVs), ξ>0 (margin violations), and ξ>1 (misclassifications)

Support Vectors and Violations non-support vectors (α=0, ξ=0) (0<α C, ξ 0) support vectors (α=c, ξ>0) margin violations α=c (α=c, ξ>1) misclassifications all circled and squared examples are support vectors (α>0) they include ξ=0 (hard-margin SVs), ξ>0 (margin violations), and ξ>1 (misclassifications)

SVM with sklearn python demo.py 1e10 python demo.py 1e10

SVM with sklearn python demo.py 0.1 python demo.py 0.01

SVM with sklearn + + In [2]: clf = svm.svc(kernel='linear', C=1e10) In [3]: X = [[1,1], [1,-1], [-1,1], [-1,-1]] In [4]: Y = [1,1,-1,-1] In [5]: clf.fit(x, Y) In [6]: clf.support_ Out[6]: array([3, 1], dtype=int32) In [7]: clf.dual_coef_ Out[7]: array([[-0.5, 0.5]]) In [8]: clf.coef_ Out[8]: array([[ 1., 0.]]) In [9]: clf.intercept_ Out[9]: array([-0.]) In [10]: clf.support_vectors_ Out[10]: array([[-1., -1.], [ 1., -1.]]) In [11]: clf.n_support_ Out[11]: array([1, 1], dtype=int32) In [2]: X = [[1,1], [1,-1], [-1,1], [-1,-1], [-0.1,-0.1]] In [3]: Y = [1,1,-1,-1, 1] In [4]: clf = svm.svc(kernel='linear', C=1) In [5]: clf.fit(x, Y) In [6]: clf.support_vectors_ Out[6]: array([[-1., 1. ], [-1., -1. ], [ 1., -1. ], [-0.1, -0.1]]) In [7]: clf.dual_coef_ α=c Out[7]: array([[-0.45, -0.6, 0.05, 1. ]]) In [8]: clf.coef_ Out[8]: array([[ 1.00000000e+00, 1.49011611e-09]]) In [9]: clf.intercept_ Out[9]: array([-0.]) + α=c + +

SVM with sklearn + + In [2]: X = [[1,1], [1,-1], [-1,1], [-1,-1], [-0.1,-0.1]] In [3]: Y = [1,1,-1,-1, 1] In [12]: clf = svm.svc(kernel='linear', C=1e10) In [13]: clf.fit(x, Y) In [14]: clf.coef_ Out[14]: array([[ 2.02010102e+00, 1.00999543e-04]]) In [15]: clf.intercept_ Out[15]: array([ 1.02013469]) In [16]: clf.support_vectors_ Out[16]: array([[-1., 1. ], [-1., -1. ], [-0.01, -0.01]]) In [17]: clf.dual_coef_ Out[17]: array([[-1.01000001, -1.03050607, 2.04050608]]) In [2]: clf = svm.svc(kernel='linear', C=1e10) In [3]: X = [[1,1], [1,-1], [-1,1], [-1,-1]] In [4]: Y = [1,1,-1,-1] In [5]: clf.fit(x, Y) In [6]: clf.support_ Out[6]: array([3, 1], dtype=int32) In [7]: clf.dual_coef_ Out[7]: array([[-0.5, 0.5]]) In [8]: clf.coef_ Out[8]: array([[ 1., 0.]]) In [9]: clf.intercept_ Out[9]: array([-0.]) In [10]: clf.support_vectors_ Out[10]: array([[-1., -1.], [ 1., -1.]]) In [11]: clf.n_support_ Out[11]: array([1, 1], dtype=int32) + + +

SVM with sklearn + + In [2]: clf = svm.svc(kernel='linear', C=1e10) In [3]: X = [[1,1], [1,-1], [-1,1], [-1,-1]] In [4]: Y = [1,1,-1,-1] In [5]: clf.fit(x, Y) In [6]: clf.support_ Out[6]: array([3, 1], dtype=int32) In [7]: clf.dual_coef_ Out[7]: array([[-0.5, 0.5]]) In [8]: clf.coef_ Out[8]: array([[ 1., 0.]]) In [9]: clf.intercept_ Out[9]: array([-0.]) In [10]: clf.support_vectors_ Out[10]: array([[-1., -1.], [ 1., -1.]]) In [11]: clf.n_support_ Out[11]: array([1, 1], dtype=int32) In [2]: X = [[1,1], [1,-1], [-1,1], [-1,-1], [-2,0]] In [3]: Y = [1,1,-1,-1, 1] In [4]: clf = svm.svc(kernel='linear', C=1) what if C=1e10? In [5]: clf.fit(x, Y) In [6]: clf.coef_ Out[6]: array([[ 1., 0.]]) In [7]: clf.support_vectors_ Out[7]: array([[-1., 1.], + [-1., -1.], α=c [ 1., 1.], [ 1., -1.], i =0=) y i [hw, x i i + b] 1 [-2., 0.]]) In [8]: clf.dual_coef_ 0 < i <C=) y i [hw, x i i + b] =1 Out[8]: array([[-1., -1., 0.5, 0.5, 1. ]]) i = C =) y i [hw, x i i + b] apple 1 In [9]: clf.intercept_ Out[9]: array([-0.]) α=c α=c + +

hard vs. soft margins

hard vs. soft margins

C=1 51

C=20 52

From Constrained Optimization to Unconstrained Optimization Optimization (back to Primal) Learning an SVM has been formulated as a constrained optimization problem over w and ξ min w R d,ξ i R w 2 + C + NX i ξ i subject to y i ³ w > x i + b 1 ξ i for i =1...N The constraint y i ³ w > x i + b 1 ξ i, can be written more concisely as y i f(x i ) 1 ξ i which, together with ξ i 0, is equivalent to ξ i = max (0, 1 y i f(x i )) w determines ξ Hence the learning problem is equivalent to the unconstrained optimization problem over w min w R w 2 + C d regularization NX i max (0, 1 y i f(x i )) loss function slides 49-56 from Andrew Zisserman (Oxford/DeepMind); with annotations 53

Loss function min w 2 + C w R d NX i max (0, 1 y i f(x i )) loss function w T x + b = 0 Points are in three categories: Support Vector 1. y i f(x i ) > 1 Point is outside margin. No contribution to loss 2. y i f(x i )=1 Point is on margin. No contribution to loss. As in hard margin case. 3. y i f(x i ) < 1 Point violates margin constraint. Contributes to loss (margin violation ξ>0, including misclassification ξ>1) w Support Vector 54

Loss functions very bad (misclassification) not good enough! good SVM uses hinge loss an approximation to the 0-1 loss max (0, 1 y i f(x i )) y i f(x i ) (perceptron uses a shifted hinge-loss touching the origin) 55

+ SVM min w R d C N X i max (0, 1 y i f(x i )) + w 2 convex convex + convex = convex! 56

Gradient (or steepest) descent algorithm for SVM To minimize a cost function C(w) use the iterative update where η is the learning rate. First, rewrite the optimization problem as an average min w C(w) = λ 2 w 2 + 1 N = 1 N NX i w t+1 w t η t w C(w t ) NX i max (0, 1 y i f(x i )) µ λ 2 w 2 +max(0, 1 y i f(x i )) (with λ =2/(NC)uptoanoverallscaleoftheproblem)and f(x) =w > x + b Because the hinge loss is not differentiable, a sub-gradient is computed 57

Sub-gradient for hinge loss L(x i,y i ; w) =max(0, 1 y i f(x i )) f(x i )=w > x i + b L w = y ix i sub-gradients L w =0 y i f(x i ) 58

Sub-gradient descent algorithm for SVM The iterative update is C(w) = 1 N NX i µ λ 2 w 2 + L(x i,y i ; w) w t+1 w t η wt C(w t ) w t η 1 N where η is the learning rate. NX i (λw t + w L(x i,y i ; w t )) batch gradient Then each iteration t involves cycling through the training data with the updates: just like perceptron! perc: 0 w t+1 w t η(λw t y i x i ) if y i f(x i ) < 1 w t ηλw t otherwise In the Pegasos algorithm the learning rate is set at η t = λt 1 online gradient 59

60

Pegasos Stochastic Gradient Descent Algorithm (SGD) Randomly sample from the training data 10 2 6 10 1 4 2 energy 10 0 0 10-1 -2-4 10-2 0 50 100 150 200 250 300-6 -6-4 -2 0 2 4 6 SGD is online update: gradient on one example (unbiasedly) approximates the gradient on the whole training data slides 49-56 from Andrew Zisserman (Oxford/DeepMind); with annotations 61