COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Size: px

Start display at page:

Download "COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines"

Chrystal Owen
5 years ago
Views:

1 COMP 551 Applied Machine Learning Lecture 11: Supprt Vectr Machines Instructr: Class web page: Unless therwise nted, all material psted fr this curse are cpyright f the instructr, and cannt be reused r repsted withut the instructr s written permissin.

2 Tday s quiz In the randm frest apprach prpsed by Breiman, hw many hyper-parameters need t be specified? 1, 2, 3, 4, 5 What is the cmplexity f each iteratin f Adabst, assuming yur weak learner is a decisin stump and yu have all binary variables? Let M be the number f features and N be the number f examples. O(M), O(N), O(MN), O(MN 2 ) Which f the tw ensemble strategies is mst effective fr high variance base classifiers? Bagging, Bsting 2

3 Prject #2 3

4 Outline Perceptrns Definitin Perceptrn learning rule Cnvergence Margin & max margin classifiers Linear Supprt Vectr Machines Frmulatin as ptimizatin prblem Generalized Lagrangian and dual Nn-linear Supprt Vectr Machines (next class) 4

5 A simple linear classifier Given a binary classificatin task: {x i, y i } i=1:n, y i ={-1,+1}. The perceptrn (Rsenblatt, 1957) is a classifier f the frm: h w (x) = sign(w T x) = {+1 if w T x 0; -1 therwise} The decisin bundary is w T x=0. An example <x i, y i > is classified crrectly if and nly if: y i (w T x i )>0. 1 w 0 Linear + threshld x 1 w 1 y x m w m 5

6 Perceptrn learning rule (Rsenblatt, 1957) Cnsider the fllwing prcedure: Initialize w j, j=0:m randmly, While any training examples remain incrrectly classified: Lp thrugh all misclassified examples x i Perfrm the update: w w + α y i x i where α is the learning rate (r step size). Intuitin: Fr misclassified psitive examples, increase w T x, and reduce it fr negative examples. 6

7 Gradient-descent learning The perceptrn learning rule can be interpreted as a gradient descent prcedure, with ptimizatin criterin: Err(w) = i=1:n { 0 if y i w T x i 0; -y i w T x therwise } 7

8 Gradient-descent learning The perceptrn learning rule can be interpreted as a gradient descent prcedure, with ptimizatin criterin: Err(w) = i=1:n { 0 if y i w T x i 0; -y i w T x therwise } Fr crrectly classified examples, the errr is zer. Fr incrrectly classified examples, the errr tells by hw much w T x is n the wrng side f the decisin bundary. The errr is zer when all examples are classified crrectly. 8

9 Linear separability The data is linearly separable if and nly if there exists a w such that: Fr all examples, y i w T x i > 0 Or equivalently, the 0-1 lss is zer fr sme set f parameters (w). x 2 + x x x 1 Linearly separable Nt linearly separable 9

10 Perceptrn cnvergence therem The basic therem: If the perceptrn learning rule is applied t a linearly separable dataset, a slutin will be fund after sme finite number f updates. 10

11 Perceptrn cnvergence therem The basic therem: If the perceptrn learning rule is applied t a linearly separable dataset, a slutin will be fund after sme finite number f updates. Additinal cmments: The number f updates depends n the dataset, n the learning rate, and n the initial weights. If the data is nt linearly separable, there will be scillatin (which can be detected autmatically). Decreasing the learning rate t 0 can cause the scillatin t settle n sme particular slutin. 11

12 Perceptrn learning example x x1 12

13 Perceptrn learning example '!"+!"&!"*!"%,#!")!"$!"(!"#!"'!!!"#!"$!"%!"& ',' 13

14 Weight as a cmbinatin f input vectrs Recall perceptrn learning rule: w w + α y i x i If initial weights are zer, then at any step, the weights are a linear cmbinatin f feature vectrs f the examples: w = i=1:n α i y i x i where α i is the sum f step sizes used fr all updates applied t example i. 14

15 Weight as a cmbinatin f input vectrs Recall perceptrn learning rule: w w + α y i x i If initial weights are zer, then at any step, the weights are a linear cmbinatin f feature vectrs f the examples: w = i=1:n α i y i x i where α i is the sum f step sizes used fr all updates applied t example i. By the end f training, sme examples may have never participated in an update, s will have α i =0. This is called the dual representatin f the classifier. 15

16 Perceptrn learning example Examples used (bld) and nt (faint). What d yu ntice? '!"+!"&!"*!"%,#!")!"$!"(!"#!"'!!!"#!"$!"%!"& ',' 16

17 Perceptrn learning example Slutins are ften nn-unique. The slutin depends n the set f instances and the rder f sampling in updates.!"+ -./.0#"'(+).'"+(*#1...-!./.!#,#!"&!"*!"%!")!"$!"(!"#!"'!!!"#!"$!"%!"& ',' 17

18 A few cmments n the Perceptrn Perceptrns can be learned t fit linearly separable data, using a gradient-descent rule. The lgistic functin ffers a smth versin f the perceptrn. 18

19 A few cmments n the Perceptrn Perceptrns can be learned t fit linearly separable data, using a gradient-descent rule. The lgistic functin ffers a smth versin f the perceptrn. Tw issues: Slutins are nn-unique. What abut nn-linearly separable data? 19

20 A few cmments n the Perceptrn Perceptrns can be learned t fit linearly separable data, using a gradient-descent rule. The lgistic functin ffers a smth versin f the perceptrn. Tw issues: Slutins are nn-unique. What abut nn-linearly separable data? (Tpic fr next class.) Perhaps data can be linearly separated in a different feature space? Perhaps we can relax the criterin f separating all the data? 20

21 The nn-uniqueness issue Cnsider a linearly separable binary classificatin dataset. There is an infinite number f hyper-planes that separate the classes: Which plane is best?!!!!! " " " " " 21

22 The nn-uniqueness issue Cnsider a linearly separable binary classificatin dataset. There is an infinite number f hyper-planes that separate the classes: Which plane is best?!!!!! " " " " " Related questin: Fr a given plane, fr which pints shuld we be mst cnfident in the classificatin? 22

23 Linear Supprt Vectr Machine (SVM) A linear SVM is a perceptrn fr which we chse w such that the margin is maximized. Fr a given separating hyper-plane, the margin is twice the (Euclidean) distance frm hyper-plane t nearest training example. I.e. the width f the strip arund the decisin bundary that cntains n training examples.!!!!!!!!!! " " " " " " " " " " 23

24 Distance t the decisin bundary Suppse we have a decisin bundary that separates the data. w T x> w T x<0 Class 1 Class 2 Assuming y i ={-1, +1}, cnfidence = y i w T x i 24

25 Distance t the decisin bundary Suppse we have a decisin bundary that separates the data x i 0 γ i x i w Class 1 Class 2 Let ɣ i be the distance frm instance x i t the decisin bundary. Define vectr w t be the nrmal t the decisin bundary. 25

26 Distance t the decisin bundary Hw can we write ɣ i in terms f x i, y i, w? Let x i0 be the pint n the decisin bundary nearest x i The vectr frm x i0 t x i is ɣ i w / w. ɣ i is a scalar (distance frm x i t x i0 ) w/ w is the unit nrmal. S we can define x i0 = x i -ɣ i w / w x i 0 γ i x i w 26

27 Distance t the decisin bundary Hw can we write ɣ i in terms f x i, y i, w? Let x i0 be the pint n the decisin bundary nearest x i The vectr frm x i0 t x i is ɣ i w / w. ɣ i is a scalar (distance frm x i t x i0 ) w/ w is the unit nrmal. S we can define x i0 = x i -ɣ i w / w. As x i0 is n the decisin bundary, we have x i 0 γ i x i w w T ( x i -ɣ i w / w ) = 0 Slving fr ɣ i yields, fr a psitive example: ɣ i = w T x i / w r fr examples f bth classes: ɣ i = y i w T x i / w 27

28 Optimizatin First suggestin: Maximize M with respect t w subject t y i w T x i / w M, i This is nt very cnvenient fr ptimizatin: w appears nnlinearly in the cnstraints. Prblem is undercnstrained. If (w, M) is ptimal, s is (βw, M), fr any β>0. Add a cnstraint: w M = 1 Instead try: Minimize w with respect t w subject t y i w T x i 1 28

29 Optimizatin First suggestin: Maximize M with respect t w subject t y i w T x i / w M, i This is nt very cnvenient fr ptimizatin: w appears nnlinearly in the cnstraints. Prblem is undercnstrained. If (w, M) is ptimal, s is (βw, M), fr any β>0. Add a cnstraint: w M = 1 Instead try: Minimize w with respect t w subject t y i w T x i 1 29

30 Final frmulatin Let s minimize ½ w 2 instead f w (Taking the square is a mntne transfrm, as w is psitive, s it desn t change the ptimal slutin. The ½ is fr mathematical cnvenience.) This gets us t: Min ½ w 2 w.r.t. w s.t. y i w T x i 1 This can be slved! Hw? It is a quadratic prgramming (QP) prblem a standard type f ptimizatin prblem fr which many efficient packages are available. Better yet, it s a cnvex (psitive semidefinite) QP. 30

31 Cnstrained ptimizatin Picture frm: 31

32 !"+ Example -./.0''"*+)+.'#"&!%%1...-!./.!'#"+'*$,#!"&!"*!"%!")!"$!"(!"#!"'!!!"#!"$!"%!"& ',' We have a unique slutin, but n supprt vectrs yet. Recall the dual slutin fr the Perceptrn: Extend fr the margin case. 32

Lagrange multipliers Cnsider the fllwing ptimizatin prblem, called primal: min w f(w) s.t. g i (w) 0, i=1 k We define the generalized Lagrangian: L(w, α) = f(w) + i=1:k α i g i (w) where α i, i=1 k are the Lagrange multipliers.

33 Lagrange multipliers Cnsider the fllwing ptimizatin prblem, called primal: min w f(w) s.t. g i (w) 0, i=1 k We define the generalized Lagrangian: L(w, α) = f(w) + i=1:k α i g i (w) where α i, i=1 k are the Lagrange multipliers. Figure : Find x and y t maximize f(x, y) subject t a cnstraint (shwn in red) g(x, y) = c. Frm: 33

34 Lagrangian ptimizatin Cnsider P(w) = max α:αi 0 L(w,α) (P stands fr primal ) Observe that the fllwing is true: P(w) = { f(w), if all cnstraints are satisfied, +, therwise } Hence, instead f cmputing min w f(w) subject t the riginal cnstraints, we can cmpute: p* = min w P(w) = min w max α:αi 0 L(w,α) Primal Alternately, invert max and min t get: d* = max α:αi 0 min w L(w,α) Dual 34

35 Maximum Margin Perceptrn We wanted t slve: Min ½ w 2 The Lagrangian is: w.r.t. w s.t. y i w T x i 1 L(w, α) = ½ w 2 + i α i (1 y i (w T x i ) ) The primal prblem is: min w max α:αi 0 L(w,α) The dual prblem is: max α:αi 0 min w L(w,α) 35

36 Dual ptimizatin prblem Cnsider bth slutins: p* = min w max α:αi 0 L(w,α) Primal d* = max α:αi 0 min w L(w,α) Dual If f and g i are cnvex and the g i can all be satisfied simultaneusly fr sme w, then we have equality: d* = p* = L(w*, α*). w* is the ptimal weight vectr (= primal slutin) α* is the ptimal set f supprt vectrs (=dual slutin) Fr SVMs, we have a quadratic bjective and linear cnstraints s bth f and g i are cnvex. Fr linearly separable data, all g i can be satisfied simultaneusly. Nte: w*, α* slve the primal and dual if and nly if they satisfy the Karush-Kunh-Tucker cnditins (see suggested readings). 36

37 Slving the dual Taking derivatives f L(w, α) wrt w, setting t 0, and slving fr w : L(w, α) = ½ w 2 + i α i (1 y i (w T x i ) ) δl/δw = w - i α i y i x i = 0 w* = i α i y i x i Just like fr the perceptrn with zer initial weights, the ptimal slutin w* is a linear cmbinatin f the x i. Plugging this back int L we get the dual: max α i α i ½ i,j y i y j α i α j (x i x) with cnstraints α i 0 and i α i y i = 0. Quadratic prgramming prblem. Cmplexity f slving quadratic prgram? Plynmial time, O( v 3 ) (where v =# variables in ptimizatin; here v =n). Fast apprximatins exist. 37

38 The supprt vectrs Suppse we find the ptimal α s (e.g. using a QP package.) Cnstraint i is active when α i > 0. This crrespnds fr the pints fr which (1-y i w T x i )=0. These are the pints lying n the edge f the margin. We call them supprt vectrs. They define the decisin bundary. The utput f the classifier fr query pint x is cmputed as: h w (x) = sign( i=1:n α i y i (x i x) ) It is determined by cmputing the dt prduct f the query pint with the supprt vectrs. 38

39 Example Example Supprt vectrs are in bld 39

40 What yu shuld knw Frm tday: The perceptrn algrithm. The margin definitin fr linear SVMs. The use f Lagrange multipliers t transfrm ptimizatin prblems. The primal and dual ptimizatin prblems fr SVMs. After the next class: Nn-linearly separable case. Feature space versin f SVMs. The kernel trick and examples f cmmn kernels. 40

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

COMP 551 Applied Machine Learning Lecture 9: Supprt Vectr Machines (cnt d) Instructr: Herke van Hf (herke.vanhf@mail.mcgill.ca) Slides mstly by: Class web page: www.cs.mcgill.ca/~hvanh2/cmp551 Unless therwise