UVA CS / Introduc8on to Machine Learning and Data Mining

Size: px

Start display at page:

Download "UVA CS / Introduc8on to Machine Learning and Data Mining"

Barbara Owen
5 years ago
Views:

1 UVA CS / Introduc8on to Machne Learnng and Data Mnng Lecture 11: Classfca8on wth Support Vector Machne (Revew + Prac8cal Gude) Yanjun Q / Jane Unversty of Vrgna Department of Computer Scence 9/6/14 1 Where are we? è Fve major seclons of ths course q Regresson (supervsed) q ClassfcaLon (supervsed) q Unsupervsed models q Learnng theory q Graphcal models 9/6/14 1

Dscrmnatve - drectly estmate a decson rule/boundary - e.g., support vector machne, decson tree.

Instance based classfers - Use observaton drectly (no models) - e.g.

2 Where are we? è Three major seclons for classfcalon We can dvde the large varety of classfcaton approaches nto roughly three major types 1. Dscrmnatve - drectly estmate a decson rule/boundary - e.g., support vector machne, decson tree. Generatve: - buld a generatve statstcal model - e.g., naïve bayes classfer, Bayesan networks 3. Instance based classfers - Use observaton drectly (no models) - e.g. K nearest neghbors 9/6/14 3 A Dataset for bnary classfcalon Output as Bnary Class Label: 1 or -1 Data/ponts/nstances/examples/samples/records: [ rows ] Features/a0rbutes/dmensons/ndependent varables/covarates/ predctors/regressors: [ columns, except the last] Target/outcome/response/label/dependent varable: specal 9/6/14 column to be predcted [ last column ] 4

3 revew Today: Revew & Prac8cal Gude q Support Vector Machne (SVM) ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü PracLcal Gude 9/6/14 5 Max margn classfers Instead of fttng all ponts, focus on boundary ponts Learn a boundary that leads to the largest margn from ponts on both sdes x Why? Intutve, makes sense Some theoretcal support Works well n practce 9/6/14 x 6 1 3

Machne (SVM) ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model

4 When lnearly Separable Case The decson boundary should be as far away from the data of both classes as possble Class 1 W s a p-dm vector; b s a scalar Class -1 9/6/14 7 revew Today: Revew & Prac8cal Gude q Support Vector Machne (SVM) ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü PracLcal Gude 9/6/14 8 4

5 Maxmzng the margn: observaton-1 Observa8on 1: the vector w s orthogonal to the +1 plane Class Class 1 M 9/6/14 9 w T x+b=+1 w T x+b=0 w T x+b=-1 Maxmzng the margn: observaton- Predct class +1 Predct class -1 Classfy as +1 f w T x+b 1 Classfy as -1 f w T x+b - 1 Undefned f -1 <w T x+b < 1 Observaton 1: the vector w s orthogonal to the +1 and -1 planes Observaton : f x + s a pont on the +1 plane and x - s the closest pont to x + on the -1 plane then M x + = λw + x - Snce w s orthogonal to both planes we need to travel some dstance along w to get from x + to x - 9/6/

6 Predct class +1 Puttng t together M w T x+b=+1 w T x+b=0 w T x+b=-1 w T x + + b = +1 w T x - + b = -1 x + = λw + x - x + - x - = M Predct class -1 We can now defne M n terms of w and b w T x + + b = +1 w T (λw + x - ) + b = +1 w T x - + b + λw T w = λw T w = +1 λ = /w T w 9/6/14 11 w T x+b=+1 w T x+b=0 w T x+b=-1 w T x + + b = +1 w T x - + b = -1 x + = λw + x - x + - x - = M λ = /w T w Predct class +1 Puttng t together Predct class -1 We can now defne M n terms of w and b M M = x + - x - M = λw = λ w = λ M = w T w w T w = w T w w T w 9/6/14 1 6

revew Today: Revew & Prac8cal Gude q Support Vector Machne (SVM) ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly

7 revew Today: Revew & Prac8cal Gude q Support Vector Machne (SVM) ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü PracLcal Gude 9/6/14 13 Optmzaton Step.e. learnng optmal parameter for SVM Predct class +1 M M = w T w w T x+b=+1 w T x+b=0 w T x+b=-1 Predct class -1 Mn (w T w)/ subject to the followng constrants: For all x n class + 1 w T x+b 1 For all x n class - 1 w T x+b -1 } A total of n constrants f we have n nput samples argmn w,b p w =1 9/6/14 14 subject to x Dtran : y ( x w + b) 1 7

8 SVM as a QP problem w T x+b=+1 w T x+b=0 w T x+b=-1 Predct class +1 Predct class -1 M M = w T w R as I matrx, d as zero vector, c as 0 value mn U u T Ru + d T u + c subject to n nequalty constrants: a 11 u 1 + a 1 u +... b 1!!! Mn (w T w)/ subject to the followng nequalty constrants: For all x n class + 1 w T x+b 1 For all x n class - 1 w T x+b -1 } A total of n constrants f we have n nput samples a n1 u 1 + a n u +... b n and k equvalency constrants: a n +1,1 u 1 + a n +1, u +... = b n +1!!! a n +k,1 u 1 + a n +k, u +... = b n +k 9/6/14 15 revew Today: Revew & Prac8cal Gude q Support Vector Machne (SVM) ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü PracLcal Gude 9/6/

9 Non lnearly separable case Instead of mnmzng the number of msclassfed ponts we can mnmze the (relatve) dstance between these ponts and ther correct plane The new optmzaton problem s: +1 plane -1 plane w T n w mn w + Cε =1 subject to the followng nequalty constrants: For all x n class + 1 ε k ε j w T x+b 1- ε For all x n class - 1 w T x+b -1+ ε }A total of n constrants For all ε I 0 } Another n constrants 9/6/14 17 Where we are Two optmzaton problems: For the separable and non separable cases w T n w T w w mn mn w + Cε w =1 For all x n class + 1 For all x n class + 1 w T x+b 1 w T x+b 1- ε For all x For all x n class - 1 n class - 1 w T w x+b -1 T x+b -1+ ε For all ε I 0 9/6/

10 revew Today: Revew & Prac8cal Gude q Support Vector Machne (SVM) ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü PracLcal Gude 9/6/14 19 Where we are Two optmzaton problems: For the separable and non separable cases w T n w Mn (w T mn w)/ w + Cε =1 For all x n class + 1 For all x n class + 1 w T x+b 1 w T x+b 1- ε For all x For all x n class - 1 n class - 1 w T w x+b -1 T x+b -1+ ε For all ε I 0 Instead of solvng these QPs drectly we wll solve a dual formulaton of the SVM optmzaton problem The man reason for swtchng to ths type of representaton s that t would allow us to use a neat trck that wll make our lves easer (and the run tme faster) 9/6/

11 Optmzaton Revew: Constraned OpLmzaLon mn u u s.t. u b Allowed mn Case 1: b Global mn Allowed mn Case : b Global mn 9/6/14 1 Optmzaton Revew: Constraned OpLmzaLon wth Lagrange When equal constrants è oplmze f(x), subject to g (x)=0 Method of Lagrange mullplers: convert to a hgher- dmensonal problem Mnmze w.r.t. 9/6/14 f ( x) + g ( x) λ ( x1 x n 1 k ; λ λ ) Introducng a Lagrange mullpler for each constrant Construct the Lagrangan for the orgnal oplmzalon problem 11

Optmzaton Revew: Dual Problem Usng dual problem Constraned oplmzalon à unconstraned oplmzalon Need to change maxmzalon to mnmzalon Only vald when the orgnal oplmzalon problem s convex/concave (strong

12 Optmzaton Revew: Dual Problem Usng dual problem Constraned oplmzalon à unconstraned oplmzalon Need to change maxmzalon to mnmzalon Only vald when the orgnal oplmzalon problem s convex/concave (strong dualty) x*=λ* When convex/concave Dual Problem * λ = arg mn l( λ) x * λ Prmal Problem = arg max f( x) x subject to gx ( ) = c l(λ) = sup( f (x) + λ(g(x) c)) x An alternatve (dual) representaton for SVM QP Here α s the lagrange mullpler varable We wll start wth the lnearly separable case Instead of encodng the correct classfcaton rule a constrant we wll use Lagrange multples to encode t as part of the our mnmzaton problem Recall that Lagrange multplers can be appled to turn the followng problem: mn x x s.t. x b To Mn x,α x +α(b-x) s.t. α 0 b- x 0 mn x max α x - α(x- b) Mn (w T w)/ (w T x +b)y 1 b Allowed mn Global mn 9/6/14 4 1

13 Lagrange multpler for SVMs / Lnearly Separable Case Dual formulaton w T w mn w,b max α α [(w T x + b)y 1] α 0 Usng ths new formulaton we can derve w and b by takng the dervatve w.r.t. w and α leadng to: w = α x y b = y w T x for s.t. α > 0 Fnally, takng the dervatve w.r.t. b we get: α y = 0 Set partal dervatves to 0 Orgnal formulaton Mn (w T w)/ (w T x +b)y 1 9/6/14 5 A Geometrcal InterpretaLon α 5 =0 α 8 =0.6 α 10 =0 α 7 =0 α =0 w = α x y For those α that are 0, no nfluence α 4 =0 α 9 =0 α 3 =0 α 6 =1.4 α 1 =0.8 9/6/

14 Dual SVM for lnearly separable case Substtutng w nto our target functon and usng the addtonal constrant we get: Dual formulaton max α α 1 α y = 0 α 0,j α α j y y j x T x j mn w,b w T w α 0 w = α x y b = y w T x for s.t. α > 0 α y = 0 α [(w T x + b)y 1] Easer than orgnal QP, a QP solver can be used to fnd α 9/6/14 7 Dual SVM for lnearly separable case Our dual target functon: max α α 1 α y = 0 α 0 To evaluate a new sample x j we need to compute: w T x j + b =,j α α j y y j x T x j α y x T x j + b Dot product among all tranng samples Dot product of test sample wth all tranng samples 9/6/

Dual formulaton for non lnearly separable case Dual target functon: max α α 1 α y = 0 C > α 0,,j α α j y y j x T x j

sample x j we need to compute: w T x j + b = α y x T x j + b Ths s very smlar to the oplmzalon problem n the lnear separable

Prac8cal Gude q Support Vector Machne (SVM) ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü

15 Dual formulaton for non lnearly separable case Dual target functon: max α α 1 α y = 0 C > α 0,,j α α j y y j x T x j Hyperparameter C should be tuned through k- folds CV The only dfference s that the α I s are now bounded To evaluate a new sample x j we need to compute: w T x j + b = α y x T x j + b Ths s very smlar to the oplmzalon problem n the lnear separable case, except that there s an upper bound C on α now Once agan, a QP solver can be used to fnd α 9/6/14 9 revew Today: Revew & Prac8cal Gude q Support Vector Machne (SVM) ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü PracLcal Gude 9/6/

16 Classfyng n 1-d Can an SVM correctly classfy ths data? What about ths? X X 9/6/14 31 Classfyng n 1-d Can an SVM correctly classfy ths data? And now? (extend wth polynomal bass ) X X X 9/6/

17 RECAP: Polynomal regresson Introduce bass funclons 9/6/14 33 Dr. Nando de Fretas s tutoral slde Non- lnear SVMs: D The orgnal nput space (x) can be mapped to some hgher- dmensonal feature space (φ(x) )where the tranng set s separable: x=(x 1,x ) φ(x) =(x 1,x, x 1 x ) x 1 x Φ: x φ(x) x x 1 9/6/14 34 Ths slde s courtesy of 17

18 Non- lnear SVMs: D The orgnal nput space (x) can be mapped to some hgher- dmensonal feature space (φ(x) )where the tranng set s separable: x=(x 1,x ) φ(x) =(x 1,x, x 1 x ) x 1 x If data s mapped nto suffcently hgh dmenson, then samples wll n general Φ: x φ(x) be lnearly separable; N data ponts are n general separable n a space of N-1 dmensons or more!!! x x 1 9/6/14 35 Ths slde s courtesy of A lttle bt theory: Vapnk-Chervonenks (VC) dmenson If data s mapped nto suffcently hgh dmenson, then samples wll n general be lnearly separable; N data ponts are n general separable n a space of N-1 dmensons or more!!! VC dmenson of the set of orented lnes n R s 3 It can be shown that the VC dmenson of the famly of orented separalng hyperplanes n R N- 1 s at least N 9/6/

19 Transformaton of Inputs Possble problems - Hgh computaton burden due to hgh-dmensonalty - Many more parameters SVM solves these two ssues smultaneously Kernel trcks for effcent computaton Dual formulaton only assgns parameters to samples, not features Input space φ(.) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) Feature space 9/6/14 37 Kernel trcks for effcent computaton è e.g. Quadratc kernels Whle workng n hgher dmensons s benefcal, t also ncreases our runnng tme because of the dot product computaton However, there s a neat trck we can use max α α y = 0 α 0 α α α j y y j Φ(x ) T Φ(x j ),j consder all quadratc terms for x 1, x x m The term wll become clear n the next slde Φ(x) = 1 x 1! x m x 1! x m m+1 lnear terms m quadratc terms m s the number of features n each vector K(x, z) := Φ(x) T Φ(z) x 1 x! m(m-1)/ parwse terms x m 1 x m 9/6/

20 Dot product for quadratc kernels How many operatons do we need for the dot product? 1 1 x 1! z 1! Φ(x) T Φ(z) = x m x 1! x m. z m z 1! z m = x z + x z + x x j z z j +1 j= +1 m m m(m-1)/ =~ m x 1 x! z 1 z! K(x, z) := Φ(x) T Φ(z) x m 1 x m z m 1 z m 9/6/14 39 The kernel trck How many operatons do we need for the dot product? Φ(x) T Φ(z) = x z + x z + x x j z z j +1 j= +1 m m m(m-1)/ =~ m K(x, z) := Φ(x) T Φ(z) However, we can obtan dramatc savngs by notng that Φ(x) T Φ(z) = (x T z +1) = (x.z +1) = (x.z) + (x.z)+1 = ( x z ) + x z +1 We only need m operatons! = x z + x z + x x j z z j +1 9/6/14 40 j=+1 So, f we defne the kernel func8on as follows, there s no need to carry out φ(.) explctly K(x, z) = (x T z +1) 0

21 Where we are K(x, z) := Φ(x) T Φ(z) Our dual target functon: max α α 1 α α j y y j K(x, x j ) α y = 0 α 0,j mn operatons at each teraton To evaluate a new sample x j we need to compute: w T Φ(x j )+ b = mr operatons where r are the number of support vectors (α >0) α y K(x, x j )+ b So, f we defne the kernel func8on as follows, there s no need to carry out φ(.) explctly K(x, z) = (x T z +1) 9/6/14 41 hpp:// More examples of kernel funclons Never represent features explctly Compute dot products n closed form Very ntereslng theory Reproducng Kernel Hlbert Spaces Not covered n detal here 9/6/14 4 1

22 Why do SVMs work? q If we are usng huge features spaces (wth kernels) how come we are not overfttng the data? - Number of parameters remans the same (and most are set to 0) - Whle we have a lot of nput values, at the end we only care about the support vectors and these are usually a small group of samples - The mnmzaton (or the maxmzng of the margn) functon acts as a sort of regularzaton term leadng to reduced overfttng 9/6/14 43 Today: Revew & Prac8cal Gude q Support Vector Machne (SVM) ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü PracLcal Gude 9/6/14 44

23 Software A lst of SVM mplementaton can be found at Some mplementaton (such as LIBSVM) can handle mult-class classfcaton SVMLght s among one of the earlest mplementaton of SVM Several Matlab toolboxes for SVM are also avalable 9/6/14 45 Practcal Gude to SVM From authors of as LIBSVM: A PracLcal Gude to Support Vector ClassfcaLon Chh- We Hsu, Chh- Chung Chang, and Chh- Jen Ln, hpp:// gude.pdf 9/6/

24 LIBSVM hpp:// ü Developed by Chh- Jen Ln etc. ü Tools for Support Vector classfcalon ü Also support mull- class classfcalon ü C++/Java/Python/Matlab/Perl wrappers ü Lnux/UNIX/Wndows ü SMO mplementalon, fast!!! A PracLcal Gude to Support Vector ClassfcaLon (a) Data fle formats for LIBSVM Tranng.dat +1 1: :1 3:1 4: : :- 1 4: :1 +1 1: :1 3: : : :1 3:1 4: : TesLng.dat 4

25 (b) Feature Preprocessng (1) Categorcal Feature Recommend usng m numbers to represent an m- category aprbute. Only one of the m numbers s one, and others are zero. For example, a three- category aprbute such as {red, green, blue} can be represented as (0,0,1), (0,1,0), and (1,0,0) 9/6/14 A PracLcal Gude to Support Vector 49 ClassfcaLon Feature Preprocessng () Scalng before applyng SVM s very mportant to avod aprbutes n greater numerc ranges domnalng those n smaller numerc ranges. to avod numercal dffculles durng the calculalon Recommend lnearly scalng each aprbute to the range [1, +1] or [0, 1]. 9/6/14 50 A PracLcal Gude to Support Vector ClassfcaLon 5

Easy way: to subsltute the mssng values by the mean value of the varable A lple bt harder

26 Feature Preprocessng 9/6/14 A PracLcal Gude to Support Vector 51 ClassfcaLon Feature Preprocessng (3) mssng value Very very trcky! Easy way: to subsltute the mssng values by the mean value of the varable A lple bt harder way: mputalon usng nearest neghbors Even more complex: e.g. EM based (beyond the scope) 9/6/14 5 A PracLcal Gude to Support Vector ClassfcaLon 6

27 (c) Model SelecLon 9/6/14 53 RECAP: Overfyng and underfyng durng regresson models y =θ 0 + θ1x y 5 = θ 0 + θ1x + θx = j = 0 y θ j x j 9/9/14 Generalsaton: learn funclon / hypothess from past data n order to explan, predct, model or control new data examples K- fold Cross ValdaLon!!!! 54 7

28 (c) Model SelecLon (e.g. for lnear kernel) Select the rght penalty parameter C 9/6/14 55 (c) Model SelecLon Three parameters for a polynomal kernel 9/6/14 A PracLcal Gude to Support Vector ClassfcaLon 56 8

Evaluaton Choce-I: Tran and Test Tranng dataset conssts of nput-

29 (d) Ppelne Procedures (1) tran / test () k- folds cross valdalon (3) k- CV on tran to choose hyperparameter / then test 9/6/14 57 Evaluaton Choce-I: Tran and Test Tranng dataset conssts of nput- output pars Evaluaton f(x? ) Measure Loss on par è ( f(x? ), y? ) 9//

30 Evaluaton Choce-II: Cross ValdaLon Problem: don t have enough data to set asde a test set SoluLon: Each data pont s used both as tran and test Common types: - K- fold cross- valdalon (e.g. K=5, K=10) - - fold cross- valdalon - Leave- one- out cross- valdalon (LOOCV) 9//14 A good praclce s : to random shuffle all tranng sample before splyng 59 Why Maxmum Margn for SVM? denotes +1 denotes -1 Support Vectors are those dataponts that the margn pushes up aganst 1. Intutvely ths feels safest.. If we ve made f(x,w,b) a small = error sgn(w. n x the - b) locaton of the boundary (t s been jolted n ts perpendcular The drecton) maxmum ths gves us least chance of causng margn a lnear msclassfcaton. classfer s the 3. LOOCV s easy snce lnear the classfer model s wth mmune to removal the, of um, any non-supportvector dataponts. maxmum margn. 4. There s some theory (usng VC dmenson) that s related Ths s to the (but smplest not the same as) the proposton knd of that SVM ths s a good thng. (Called an LSVM) 5. Emprcally t works very very well. Copyrght 001, 003, Andrew W. Moore 30

31 Basc solulon For HW- Q more advanced solulon For HW- Q 9/6/14 Evaluaton Choce-III A PracLcal Gude to Support Vector 61 ClassfcaLon 9/6/14 A PracLcal Gude to Support Vector ClassfcaLon 6 31

32 Today: Revew & Prac8cal Gude q Support Vector Machne (SVM) ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü PracLcal Gude ü Fle format / LIBSVM ü Feature preprocsssng ü Model seleclon ü Ppelne procedure 9/6/14 63 References Bg thanks to Prof. Zv Bar- CMU for allowng me to reuse some of hs sldes Prof. Andrew CMU s sldes Elements of StaLsLcal Learnng, by HasLe, Tbshran and Fredman 9/6/

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont.

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont. UVA CS 4501-001 / 6501 007 Introduc8on to Machne Learnng and Data Mnng Lecture 10: Classfca8on wth Support Vector Machne (cont. ) Yanjun Q / Jane Unversty of Vrgna Department of Computer Scence 9/6/14