UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont.

UVA CS 4501-001 / 6501 007 Introduc8on to Machne Learnng and Data Mnng Lecture 10: Classfca8on wth Support Vector Machne (cont. ) Yanjun Q / Jane Unversty of Vrgna Department of Computer Scence 9/6/14 1 Where we are? è Fve major seclons of ths course q Regresson (supervsed) q ClassfcaLon (supervsed) q Unsupervsed models q Learnng theory q Graphcal models 9/6/14 1

Where we are? è Three major seclons for classfcalon We can dvde the large varety of classfcaton approaches nto roughly three major types 1. Dscrmnatve - drectly estmate a decson rule/boundary - e.g., support vector machne, decson tree. Generatve: - buld a generatve statstcal model - e.g., Bayesan networks 3. Instance based classfers - Use observaton drectly (no models) - e.g. K nearest neghbors 9/6/14 3 Today Last Lecture q Support Vector Machne (SVM) ü Hstory of SVM ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü MulLclass SVM 9/6/14 4

Today revew q Support Vector Machne (SVM) ü Hstory of SVM ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü MulLclass SVM 9/6/14 5 Hstory of SVM Young / theorelcally sound / SVM s nspred from stalslcal learnng theory [3] Impaceul SVM was frst ntroduced n 199 [1] SVM becomes popular because of ts success n handwr_en dgt recognlon 1.1% test error rate for SVM. Ths s the same as the error rates of a carefully constructed neural network, LeNet 4. See SecLon 5.11 n [] or the dscusson n [3] for detals SVM s now regarded as an mportant example of kernel methods, arguably the ho_est area n machne learnng 10 years ago [1] B.E. Boser et al. A Tranng Algorthm for Optmal Margn Classfers. Proceedngs of the Ffth Annual Workshop on Computatonal Learnng Theory 5 144-15, Pttsburgh, 199. [] L. Bottou et al. Comparson of classfer methods: a case study n handwrtten dgt recognton. Proceedngs of the 1th IAPR Internatonal Conference on Pattern Recognton, vol., pp. 77-8, 1994. [3] V. Vapnk. The Nature of Statstcal Learnng Theory. nd edton, Sprnger, 1999. 9/6/14 6 3

ApplcaLons of SVMs Computer Vson Text CategorzaLon Rankng (e.g., Google searches) Handwr_en Character RecognLon Tme seres analyss BonformaLcs. à Lots of very successful applcalons!!! 9/6/14 7 Handwr_en dgt recognlon 1999, SVM 9/6/14 8 4

Today revew q Support Vector Machne (SVM) ü Hstory of SVM ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary 9/6/14 9 A Dataset for bnary classfcalon Output as Bnary Class Label: 1 or -1 Data/ponts/nstances/examples/samples/records: [ rows ] Features/a0rbutes/dmensons/ndependent varables/covarates/ predctors/regressors: [ columns, except the last] Target/outcome/response/label/dependent varable: specal 9/5/14 column to be predcted [ last column ] 10 5

Max margn classfers Instead of fttng all ponts, focus on boundary ponts Learn a boundary that leads to the largest margn from ponts on both sdes x Why? Intutve, makes sense Some theoretcal support Works well n practce 9/5/14 x 11 1 Max- margn & Decson Boundary The decson boundary should be as far away from the data of both classes as possble Class 1 W s a p-dm vector; b s a scalar Class -1 9/6/14 1 6

w T x+b=+1 w T x+b=0 w T x+b=-1 Maxmzng the margn: observaton- Predct class +1 Predct class -1 M Classfy as +1 f w T x+b 1 Classfy as -1 f w T x+b - 1 Undefned f -1 <w T x+b < 1 Observaton 1: the vector w s orthogonal to the +1 and -1 planes Observaton : f x + s a pont on the +1 plane and x - s the closest pont to x + on the -1 plane then x + = λw + x - Snce w s orthogonal to both planes we need to travel some dstance along w to get from x + to x - 9/6/14 15 Predct class +1 Puttng t together M w T x+b=+1 w T x+b=0 w T x+b=-1 w T x + + b = +1 w T x - + b = -1 x + = λw + x - x + - x - = M Predct class -1 We can now defne M n terms of w and b w T x + + b = +1 w T (λw + x - ) + b = +1 w T x - + b + λw T w = +1-1 + λw T w = +1 λ = /w T w 9/6/14 16 8

w T x+b=+1 w T x+b=0 w T x+b=-1 w T x + + b = +1 w T x - + b = -1 x + = λw + x - x + - x - = M λ = /w T w Predct class +1 Puttng t together Predct class -1 We can now defne M n terms of w and b M M = x + - x - M = λw = λ w = λ M = w T w w T w = w T w w T w 9/6/14 17 Today revew q Support Vector Machne (SVM) ü Hstory of SVM ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü MulLclass SVM 9/6/14 18 9

Optmzaton Step.e. learnng optmal parameter for SVM Predct class +1 M M = w T w w T x+b=+1 w T x+b=0 w T x+b=-1 Predct class -1 Mn (w T w)/ subject to the followng constrants: For all x n class + 1 w T x+b 1 For all x n class - 1 w T x+b -1 } A total of n constrants f we have n nput samples argmn w,b p w =1 9/6/14 19 subject to x Dtran : y ( x w + b) 1 SVM as a QP problem w T x+b=+1 w T x+b=0 w T x+b=-1 Predct class +1 Predct class -1 M M = w T w R as I matrx, d as zero vector, c as 0 value mn U u T Ru + d T u + c subject to n nequalty constrants: a 11 u 1 + a 1 u +... b 1!!! Mn (w T w)/ subject to the followng nequalty constrants: For all x n class + 1 w T x+b 1 For all x n class - 1 w T x+b -1 } A total of n constrants f we have n nput samples a n1 u 1 + a n u +... b n and k equvalency constrants: a n +1,1 u 1 + a n +1, u +... = b n +1!!! a n +k,1 u 1 + a n +k, u +... = b n +k 9/6/14 0 10

Where we are Two optmzaton problems: For the separable and non separable cases w T n w T w w mn mn w + Cε w =1 For all x n class + 1 For all x n class + 1 w T x+b 1 w T x+b 1- ε For all x For all x n class - 1 n class - 1 w T w x+b -1 T x+b -1+ ε For all ε I 0 9/6/14 3 Today q Support Vector Machne (SVM) ü Hstory of SVM ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü MulLclass SVM 9/6/14 4 1

Where we are Two optmzaton problems: For the separable and non separable cases w T n w Mn (w T mn w)/ w + Cε =1 For all x n class + 1 For all x n class + 1 w T x+b 1 w T x+b 1- ε For all x For all x n class - 1 n class - 1 w T w x+b -1 T x+b -1+ ε For all ε I 0 Instead of solvng these QPs drectly we wll solve a dual formulaton of the SVM optmzaton problem The man reason for swtchng to ths type of representaton s that t would allow us to use a neat trck that wll make our lves easer (and the run tme faster) 9/6/14 5 Optmzaton Revew: Constraned OpLmzaLon mn u u s.t. u b Allowed mn Case 1: b Global mn Allowed mn Case : b Global mn 9/6/14 6 13

Optmzaton Revew: Constraned OpLmzaLon wth Lagrange When equal constrants è oplmze f(x), subject to g (x)=0 Method of Lagrange mullplers: convert to a hgher- dmensonal problem Mnmze w.r.t. f ( x) + g ( x) λ ( x1 x n 1 k ; λ λ ) Introducng a Lagrange mullpler for each constrant Construct the Lagrangan for the orgnal oplmzalon problem 7 Optmzaton Revew: Dual Problem Usng dual problem Constraned oplmzalon à unconstraned oplmzalon Need to change maxmzalon to mnmzalon Only vald when the orgnal oplmzalon problem s convex/concave (strong dualty) x*=λ* When convex/concave Dual Problem * λ = arg mn l( λ) x * λ Prmal Problem = arg max f( x) x subject to gx ( ) = c l(λ) = sup( f (x) + λ(g(x) c)) x 14

An alternatve (dual) representaton of the SVM QP We wll start wth the lnearly separable case Instead of encodng the correct classfcaton rule and constrant we wll use LaGrange multples to encode t as part of the our mnmzaton problem Mn (w T w)/ For all x n class +1 w T x+b 1 For all x n class -1 w T x+b -1 Why? Mn (w T w)/ (w T x +b)y 1 9/6/14 9 An alternatve (dual) representaton of the SVM QP We wll start wth the lnearly separable case Instead of encodng the correct classfcaton rule a constrant we wll use Lagrange multples to encode t as part of the our mnmzaton problem Recall that Lagrange multplers can be appled to turn the followng problem: mn x x s.t. x b To Mn x,α x +α(b-x) s.t. α 0 b- x 0 mn x max α x - α(x- b) Mn (w T w)/ (w T x +b)y 1 b Allowed mn Global mn 9/6/14 30 15

Lagrange multpler for SVMs Dual formulaton w T w mn w,b max α α [(w T x + b)y 1] α 0 Usng ths new formulaton we can derve w and b by takng the dervatve w.r.t. w and α leadng to: w = α x y b = y w T x for s.t. α > 0 Set partal dervatves to 0 Orgnal formulaton Mn (w T w)/ (w T x +b)y 1 Fnally, takng the dervatve w.r.t. b we get: α y = 0 9/6/14 31 Dual SVM - nterpretaton w = α x y For α s that are not 0, no nfluence 9/6/14 3 16

A Geometrcal InterpretaLon α 8 =0.6 α 10 =0 α 5 =0 α 4 =0 α 9 =0 α 3 =0 α 6 =1.4 α 7 =0 α =0 α 1 =0.8 9/6/14 33 Dual SVM for lnearly separable case Substtutng w nto our target functon and usng the addtonal constrant we get: Dual formulaton max α α 1 α y = 0 α 0,j α α j y y j x T x j mn w,b w T w α 0 w = α x y b = y w T x for s.t. α > 0 α y = 0 α [(w T x + b)y 1] 9/6/14 34 17

Dual SVM for lnearly separable case Our dual target functon: max α α 1 α y = 0 α 0 To evaluate a new sample x j we need to compute: w T x j + b =,j α α j y y j x T x j α y x T x j + b Is ths too much computatonal work (for example when usng transformaton of the data)? Dot product for all tranng samples Dot product wth tranng samples 9/6/14 35 Dual formulaton for non lnearly separable case Dual target functon: max α α 1 α y = 0 C > α 0,,j α α j y y j x T x j Hyperparameter C should be tuned through k- folds CV The only dfference s that the α I s are now bounded To evaluate a new sample x j we need to compute: w T x j + b = α y x T x j + b Ths s very smlar to the oplmzalon problem n the lnear separable case, except that there s an upper bound C on α now Once agan, a QP solver can be used to fnd α 9/6/14 36 18

Classfyng n 1-d Can an SVM correctly classfy ths data? And now? (extend wth polynomal bass ) X X X 9/6/14 39 Non- lnear SVMs: D The orgnal nput space (x) can be mapped to some hgher- dmensonal feature space (φ(x) )where the tranng set s separable: x=(x 1,x ) φ(x) =(x 1,x, x 1 x ) x 1 x Φ: x φ(x) x x 1 9/6/14 40 Ths slde s courtesy of www.ro.umontreal.ca/~pft6080/documents/papers/svm_tutoral.ppt 0

Non- lnear SVMs: D The orgnal nput space (x) can be mapped to some hgher- dmensonal feature space (φ(x) )where the tranng set s separable: x=(x 1,x ) φ(x) =(x 1,x, x 1 x ) x 1 x If data s mapped nto suffcently hgh dmenson, then samples wll n general Φ: x φ(x) be lnearly separable; N data ponts are n general separable n a space of N-1 dmensons or more!!! x x 1 9/6/14 41 Ths slde s courtesy of www.ro.umontreal.ca/~pft6080/documents/papers/svm_tutoral.ppt A lttle bt theory: Vapnk-Chervonenks (VC) dmenson If data s mapped nto suffcently hgh dmenson, then samples wll n general be lnearly separable; N data ponts are n general separable n a space of N-1 dmensons or more!!! VC dmenson of the set of orented lnes n R s 3 It can be shown that the VC dmenson of the famly of orented separalng hyperplanes n R N s at least N+1 9/6/14 4 1

Transformaton of Inputs Possble problems - Hgh computaton burden due to hgh-dmensonalty - Many more parameters SVM solves these two ssues smultaneously Kernel trcks for effcent computaton Dual formulaton only assgns parameters to samples, not features Input space φ(.) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) Feature space 9/6/14 43 Quadratc kernels Whle workng n hgher dmensons s benefcal, t also ncreases our runnng tme because of the dot product computaton However, there s a neat trck we can use max α α α α j y y j Φ(x ) T Φ(x j ),j α y = 0 α 0 consder all quadratc terms for x 1, x x m The term wll become clear n the next slde Φ(x) = 1 x 1! x m x 1! x m m+1 lnear terms m quadratc terms m s the number of features n each vector x 1 x! m(m-1)/ parwse terms x m 1 x m 9/6/14 44

Dot product for quadratc kernels How many operatons do we need for the dot product? 1 1 x 1! z 1! Φ(x) T Φ(z) = x m x 1! x m. z m z 1! z m = x z + x z + x x j z z j +1 j= +1 m m m(m-1)/ =~ m x 1 x! z 1 z! x m 1 x m z m 1 z m 9/6/14 45 The kernel trck How many operatons do we need for the dot product? Φ(x) T Φ(z) = x z + x z + x x j z z j +1 j= +1 m m m(m-1)/ =~ m However, we can obtan dramatc savngs by notng that Φ(x) T Φ(z) = (x T z +1) = (x.z +1) = (x.z) + (x.z)+1 = ( x z ) + x z +1 We only need m operatons! = x z + x z + x x j z z j +1 9/6/14 46 j=+1 So, f we defne the kernel func8on as follows, there s no need to carry out φ(.) explctly K(x, z) = (x T z +1) 3

Where we are Our dual target functon: max α α 1 α α j y y j Φ(x ) T Φ(x j ) α y = 0 α 0,j mn operatons at each teraton To evaluate a new sample x j we need to compute: w T Φ(x j )+ b = mr operatons where r are the number of support vectors (α >0) α y Φ(x ) T Φ(x j )+ b So, f we defne the kernel func8on as follows, there s no need to carry out φ(.) explctly K(x, z) = (x T z +1) 9/6/14 47 Erc Xng @ CMU, 006-008 More examples of kernel funclons Lnear kernel (we've seen t) T K ( x, x') = x x' Polynomal kernel (we just saw an example) T ( x ') p K ( x, x') = 1 + x where p =, 3, To get the feature vectors we concatenate all pth order polynomal terms of the components of x (weghted approprately) Radal bass kernel 1 K( x, x') = exp x x' In ths case the feature space conssts of funclons and results n a non- parametrc classfer. Never represent features explctly Compute dot products n closed form Very ntereslng theory Reproducng Kernel Hlbert Spaces Not covered n detal here 48 4

Today q Support Vector Machne (SVM) ü Hstory of SVM ü Large Margn Lnear Classfer ü Defne Margn (M) n terms of model parameter ü OpLmzaLon to learn model parameters (w, b) ü Non lnearly separable case ü OpLmzaLon wth dual form ü Nonlnear decson boundary ü MulLclass SVM 9/6/14 49 Mult-class classfcaton wth SVMs What f we have data from more than two classes? Most common soluton: One vs. all - create a classfer for each class aganst all other data - for a new pont use all classfers and compare the margn for all selected classes Note that ths s not necessarly vald snce ths s not what we traned the SVM for, but often works well n practce 9/6/14 50 5

Handwrtten dgt recognton 1999, SVM 9/6/14 51 Why do SVMs work? If we are usng huge features spaces (wth kernels) how come we are not overfttng the data? - Number of parameters remans the same (and most are set to 0) - Whle we have a lot of nput values, at the end we only care about the support vectors and these are usually a small group of samples - The mnmzaton (or the maxmzng of the margn) functon acts as a sort of regularzaton term leadng to reduced overfttng 9/6/14 5 6

Software A lst of SVM mplementaton can be found at http://www.kernel-machnes.org/software.html Some mplementaton (such as LIBSVM) can handle mult-class classfcaton SVMLght s among one of the earlest mplementaton of SVM Several Matlab toolboxes for SVM are also avalable 9/6/14 53 References Bg thanks to Prof. Zv Bar- Joseph @ CMU for allowng me to reuse some of hs sldes Prof. Andrew Moore @ CMU s sldes Elements of StaLsLcal Learnng, by HasLe, Tbshran and Fredman 9/18/14 54 7