COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

COMP 551 Applied Machine Learning Lecture 11: Supprt Vectr Machines Instructr: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/cmp551 Unless therwise nted, all material psted fr this curse are cpyright f the instructr, and cannt be reused r repsted withut the instructr s written permissin.

Tday s quiz In the randm frest apprach prpsed by Breiman, hw many hyper-parameters need t be specified? 1, 2, 3, 4, 5 What is the cmplexity f each iteratin f Adabst, assuming yur weak learner is a decisin stump and yu have all binary variables? Let M be the number f features and N be the number f examples. O(M), O(N), O(MN), O(MN 2 ) Which f the tw ensemble strategies is mst effective fr high variance base classifiers? Bagging, Bsting 2

Prject #2 3

Outline Perceptrns Definitin Perceptrn learning rule Cnvergence Margin & max margin classifiers Linear Supprt Vectr Machines Frmulatin as ptimizatin prblem Generalized Lagrangian and dual Nn-linear Supprt Vectr Machines (next class) 4

A simple linear classifier Given a binary classificatin task: {x i, y i } i=1:n, y i ={-1,+1}. The perceptrn (Rsenblatt, 1957) is a classifier f the frm: h w (x) = sign(w T x) = {+1 if w T x 0; -1 therwise} The decisin bundary is w T x=0. An example <x i, y i > is classified crrectly if and nly if: y i (w T x i )>0. 1 w 0 Linear + threshld x 1 w 1 y x m w m 5

Perceptrn learning rule (Rsenblatt, 1957) Cnsider the fllwing prcedure: Initialize w j, j=0:m randmly, While any training examples remain incrrectly classified: Lp thrugh all misclassified examples x i Perfrm the update: w w + α y i x i where α is the learning rate (r step size). Intuitin: Fr misclassified psitive examples, increase w T x, and reduce it fr negative examples. 6

Gradient-descent learning The perceptrn learning rule can be interpreted as a gradient descent prcedure, with ptimizatin criterin: Err(w) = i=1:n { 0 if y i w T x i 0; -y i w T x therwise } 7

Gradient-descent learning The perceptrn learning rule can be interpreted as a gradient descent prcedure, with ptimizatin criterin: Err(w) = i=1:n { 0 if y i w T x i 0; -y i w T x therwise } Fr crrectly classified examples, the errr is zer. Fr incrrectly classified examples, the errr tells by hw much w T x is n the wrng side f the decisin bundary. The errr is zer when all examples are classified crrectly. 8

Linear separability The data is linearly separable if and nly if there exists a w such that: Fr all examples, y i w T x i > 0 Or equivalently, the 0-1 lss is zer fr sme set f parameters (w). x 2 + x 2 + + - + - - - x 1 - + x 1 Linearly separable Nt linearly separable 9

Perceptrn cnvergence therem The basic therem: If the perceptrn learning rule is applied t a linearly separable dataset, a slutin will be fund after sme finite number f updates. 10

Perceptrn cnvergence therem The basic therem: If the perceptrn learning rule is applied t a linearly separable dataset, a slutin will be fund after sme finite number f updates. Additinal cmments: The number f updates depends n the dataset, n the learning rate, and n the initial weights. If the data is nt linearly separable, there will be scillatin (which can be detected autmatically). Decreasing the learning rate t 0 can cause the scillatin t settle n sme particular slutin. 11

Perceptrn learning example 1 0.9 0.8 0.7 0.6 x2 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 x1 12

Perceptrn learning example '!"+!"&!"*!"%,#!")!"$!"(!"#!"'!!!"#!"$!"%!"& ',' 13

Weight as a cmbinatin f input vectrs Recall perceptrn learning rule: w w + α y i x i If initial weights are zer, then at any step, the weights are a linear cmbinatin f feature vectrs f the examples: w = i=1:n α i y i x i where α i is the sum f step sizes used fr all updates applied t example i. By the end f training, sme examples may have never participated in an update, s will have α i =0. This is called the dual representatin f the classifier. 15

Perceptrn learning example Examples used (bld) and nt (faint). What d yu ntice? '!"+!"&!"*!"%,#!")!"$!"(!"#!"'!!!"#!"$!"%!"& ',' 16

Perceptrn learning example Slutins are ften nn-unique. The slutin depends n the set f instances and the rder f sampling in updates.!"+ -./.0#"'(+).'"+(*#1...-!./.!#,#!"&!"*!"%!")!"$!"(!"#!"'!!!"#!"$!"%!"& ',' 17

A few cmments n the Perceptrn Perceptrns can be learned t fit linearly separable data, using a gradient-descent rule. The lgistic functin ffers a smth versin f the perceptrn. 18

A few cmments n the Perceptrn Perceptrns can be learned t fit linearly separable data, using a gradient-descent rule. The lgistic functin ffers a smth versin f the perceptrn. Tw issues: Slutins are nn-unique. What abut nn-linearly separable data? (Tpic fr next class.) Perhaps data can be linearly separated in a different feature space? Perhaps we can relax the criterin f separating all the data? 20

The nn-uniqueness issue Cnsider a linearly separable binary classificatin dataset. There is an infinite number f hyper-planes that separate the classes: Which plane is best?!!!!! " " " " " 21

The nn-uniqueness issue Cnsider a linearly separable binary classificatin dataset. There is an infinite number f hyper-planes that separate the classes: Which plane is best?!!!!! " " " " " Related questin: Fr a given plane, fr which pints shuld we be mst cnfident in the classificatin? 22

Linear Supprt Vectr Machine (SVM) A linear SVM is a perceptrn fr which we chse w such that the margin is maximized. Fr a given separating hyper-plane, the margin is twice the (Euclidean) distance frm hyper-plane t nearest training example. I.e. the width f the strip arund the decisin bundary that cntains n training examples.!!!!!!!!!! " " " " " " " " " " 23

Distance t the decisin bundary Suppse we have a decisin bundary that separates the data. w T x>0 + + + + + + + + w T x<0 Class 1 Class 2 Assuming y i ={-1, +1}, cnfidence = y i w T x i 24

Distance t the decisin bundary Suppse we have a decisin bundary that separates the data. + + + + + + + + x i 0 γ i x i w Class 1 Class 2 Let ɣ i be the distance frm instance x i t the decisin bundary. Define vectr w t be the nrmal t the decisin bundary. 25

Distance t the decisin bundary Hw can we write ɣ i in terms f x i, y i, w? Let x i0 be the pint n the decisin bundary nearest x i The vectr frm x i0 t x i is ɣ i w / w. ɣ i is a scalar (distance frm x i t x i0 ) w/ w is the unit nrmal. S we can define x i0 = x i -ɣ i w / w. As x i0 is n the decisin bundary, we have + + + + x i 0 γ i x i w w T ( x i -ɣ i w / w ) = 0 Slving fr ɣ i yields, fr a psitive example: ɣ i = w T x i / w r fr examples f bth classes: ɣ i = y i w T x i / w 27

Optimizatin First suggestin: Maximize M with respect t w subject t y i w T x i / w M, i This is nt very cnvenient fr ptimizatin: w appears nnlinearly in the cnstraints. Prblem is undercnstrained. If (w, M) is ptimal, s is (βw, M), fr any β>0. Add a cnstraint: w M = 1 Instead try: Minimize w with respect t w subject t y i w T x i 1 28

Final frmulatin Let s minimize ½ w 2 instead f w (Taking the square is a mntne transfrm, as w is psitive, s it desn t change the ptimal slutin. The ½ is fr mathematical cnvenience.) This gets us t: Min ½ w 2 w.r.t. w s.t. y i w T x i 1 This can be slved! Hw? It is a quadratic prgramming (QP) prblem a standard type f ptimizatin prblem fr which many efficient packages are available. Better yet, it s a cnvex (psitive semidefinite) QP. 30

Cnstrained ptimizatin Picture frm: http://www.cs.cmu.edu/~aarti/class/10701_spring14/ 31

!"+ Example -./.0''"*+)+.'#"&!%%1...-!./.!'#"+'*$,#!"&!"*!"%!")!"$!"(!"#!"'!!!"#!"$!"%!"& ',' We have a unique slutin, but n supprt vectrs yet. Recall the dual slutin fr the Perceptrn: Extend fr the margin case. 32

Lagrange multipliers Cnsider the fllwing ptimizatin prblem, called primal: min w f(w) s.t. g i (w) 0, i=1 k We define the generalized Lagrangian: L(w, α) = f(w) + i=1:k α i g i (w) where α i, i=1 k are the Lagrange multipliers. Figure : Find x and y t maximize f(x, y) subject t a cnstraint (shwn in red) g(x, y) = c. Frm: https://en.wikipedia.rg/wiki/lagrange_multiplier 33

Lagrangian ptimizatin Cnsider P(w) = max α:αi 0 L(w,α) (P stands fr primal ) Observe that the fllwing is true: P(w) = { f(w), if all cnstraints are satisfied, +, therwise } Hence, instead f cmputing min w f(w) subject t the riginal cnstraints, we can cmpute: p* = min w P(w) = min w max α:αi 0 L(w,α) Primal Alternately, invert max and min t get: d* = max α:αi 0 min w L(w,α) Dual 34

Maximum Margin Perceptrn We wanted t slve: Min ½ w 2 The Lagrangian is: w.r.t. w s.t. y i w T x i 1 L(w, α) = ½ w 2 + i α i (1 y i (w T x i ) ) The primal prblem is: min w max α:αi 0 L(w,α) The dual prblem is: max α:αi 0 min w L(w,α) 35

Dual ptimizatin prblem Cnsider bth slutins: p* = min w max α:αi 0 L(w,α) Primal d* = max α:αi 0 min w L(w,α) Dual If f and g i are cnvex and the g i can all be satisfied simultaneusly fr sme w, then we have equality: d* = p* = L(w*, α*). w* is the ptimal weight vectr (= primal slutin) α* is the ptimal set f supprt vectrs (=dual slutin) Fr SVMs, we have a quadratic bjective and linear cnstraints s bth f and g i are cnvex. Fr linearly separable data, all g i can be satisfied simultaneusly. Nte: w*, α* slve the primal and dual if and nly if they satisfy the Karush-Kunh-Tucker cnditins (see suggested readings). 36

Slving the dual Taking derivatives f L(w, α) wrt w, setting t 0, and slving fr w : L(w, α) = ½ w 2 + i α i (1 y i (w T x i ) ) δl/δw = w - i α i y i x i = 0 w* = i α i y i x i Just like fr the perceptrn with zer initial weights, the ptimal slutin w* is a linear cmbinatin f the x i. Plugging this back int L we get the dual: max α i α i ½ i,j y i y j α i α j (x i x) with cnstraints α i 0 and i α i y i = 0. Quadratic prgramming prblem. Cmplexity f slving quadratic prgram? Plynmial time, O( v 3 ) (where v =# variables in ptimizatin; here v =n). Fast apprximatins exist. 37

The supprt vectrs Suppse we find the ptimal α s (e.g. using a QP package.) Cnstraint i is active when α i > 0. This crrespnds fr the pints fr which (1-y i w T x i )=0. These are the pints lying n the edge f the margin. We call them supprt vectrs. They define the decisin bundary. The utput f the classifier fr query pint x is cmputed as: h w (x) = sign( i=1:n α i y i (x i x) ) It is determined by cmputing the dt prduct f the query pint with the supprt vectrs. 38

Example Example Supprt vectrs are in bld 39

What yu shuld knw Frm tday: The perceptrn algrithm. The margin definitin fr linear SVMs. The use f Lagrange multipliers t transfrm ptimizatin prblems. The primal and dual ptimizatin prblems fr SVMs. After the next class: Nn-linearly separable case. Feature space versin f SVMs. The kernel trick and examples f cmmn kernels. 40