Machne Learnng 0-70/5-78, 78, Fall 20 Advanced topcs n Max-Margn Margn Learnng Erc Xng Lecture 20, Noveber 2, 20 Erc Xng @ CMU, 2006-200 Recap: the SVM proble We solve the follong constraned opt proble: ax s.t. J ( ) = 0, = = 2, = y = 0. =, K, y y ( x x hs s a quadratc prograng g proble. A global axu of can alays be found. ) he soluton: Ho to predct: = = y x Erc Xng @ CMU, 2006-200 2
Non-lnearly Separable Probles Class 2 Class We allo error ξ n classfcaton; t s based on the output of the dscrnant functon x+b ξ approxates the nuber of sclassfed saples Erc Xng @ CMU, 2006-200 3 Soft Margn Hyperplane No e have a slghtly dfferent opt proble: n, b s.t 2 y ( x + b) ξ, ξ 0, + C = ξ ξ are slack varables n optaton Note that ξ =0 f there s no error for x ξ s an upper bound of the nuber of errors C : tradeoff paraeter beteen error and argn Erc Xng @ CMU, 2006-200 4 2
Lagrangan Dualty, cont. Recall the Pral Proble: he Dual Proble: n ax β, heore (eak dualty): L (,, β ), β 0 ax, β, 0 n L (,, β ) * d = ax, β, n (,, ) n ax,, (,, ) 0 L β β = 0 L β p * heore (strong dualty): Iff there exst a saddle pont of L (,, β ), e have * d = p * Erc Xng @ CMU, 2006-20 5 A sketch of strong and eak dualty No, gnorng h(x) for splcty, let's look at hat's happenng graphcally n the dualty theores. d * = ax n f ( ) + g( ) n ax f ( ) + g( ) = 0 0 f () p * g() Erc Xng @ CMU, 2006-20 6 3
A sketch of strong and eak dualty No, gnorng h(x) for splcty, let's look at hat's happenng graphcally n the dualty theores. d * = ax n f ( ) + g( ) n ax f ( ) + g( ) = 0 0 f () p * g() Erc Xng @ CMU, 2006-20 7 he KK condtons If there exsts soe saddle pont of L, then the saddle pont satsfes the follong "Karush-Kuhn-ucker" (KK) condtons: L (,, β ) = 0, L (,, β ) = 0, β g ( ) = 0, g ( ) 0, 0, =, K, k =, K, l =, K, =, K, =, K, heore: If *, * and β* satsfy the KK condton, then t s also a soluton to the pral and the dual probles. Erc Xng @ CMU, 2006-20 8 4
he Optaton Proble he dual of ths ne constraned optaton proble s ax s.t. J ( ) = 0 C, = y = 0. = 2, = =, K, y y ( x x ) hs s very slar to the optaton proble n the lnear separable case, except that there s an upper bound C on no Once agan, a QP solver can be used to fnd Erc Xng @ CMU, 2006-200 9 he SMO algorth Consder solvng the unconstraned opt proble: We ve already see three opt algorths! Coordnate ascent Gradent ascent Neton-Raphson Coordnate ascend: Erc Xng @ CMU, 2006-200 0 5
Coordnate ascend Erc Xng @ CMU, 2006-200 Sequental nal optaton Constraned optaton: ax s.t. J ( ) = 0 C, = y = 0. = 2, = =, K, y y ( x x ) Queston: can e do coordnate along one drecton at a te (.e., hold all [-] fxed, and update?) Erc Xng @ CMU, 2006-200 2 6
he SMO algorth Repeat tll convergence. Select soe par and to update next (usng a heurstc that tres to pck the to that ll allo us to ake the bggest progress toards the global axu). 2. Re-opte J() th respect to and, hle holdng all the other k 's (k ; ) fxed. k ( ; ) Wll ths procedure converge? Erc Xng @ CMU, 2006-200 3 Convergence of SMO ax J ( ) = 2 = 2, = y y ( x x ) KK: s.t. 0 C, = y = 0. =, K, k Let s hold 3,, fxed and reopt J.r.t. and 2 Erc Xng @ CMU, 2006-200 4 7
Convergence of SMO he constrants: he obectve: Constraned opt: Erc Xng @ CMU, 2006-200 5 Cross-valdaton error of SVM he leave-one-out cross-valdaton error does not depend on the densonalty of the feature space but only on the # of support vectors! # support vectors Leave - one - out CV error = # of tranng exaples Erc Xng @ CMU, 2006-200 6 8
Advanced topcs n Max-Margn Learnng ax J ( ) = 2 = 2, = y y ( x x ) Kernel Pont rule or average rule Can e predct vec(y)? Erc Xng @ CMU, 2006-200 7 Outlne he Kernel trck Maxu entropy dscrnaton Structured SVM, aka, Maxu Margn Markov Netorks Erc Xng @ CMU, 2006-200 8 9
() Non-lnear Decson Boundary So far, e have only consdered large-argn classfer th a lnear decson boundary Ho to generale t to becoe nonlnear? Key dea: transfor x to a hgher densonal space to ake lfe easer Input space: the space the pont x are located Feature space: the space of φ(x ) after transforaton Why transfor? Lnear operaton n the feature space s equvalent to non-lnear operaton n nput space Classfcaton can becoe easer th a proper transforaton. In the XOR proble, for exaple, addng a ne feature of x x 2 ake the proble lnearly separable (hoeork) Erc Xng @ CMU, 2006-200 9 he Kernel rck Is ths data lnearly-separable? Ho about a quadratc appng φ(x )? Erc Xng @ CMU, 2006-200 20 0
he Kernel rck Recall the SVM optaton proble ax s.t. J ( ) = 0 C, = y = 0. = 2, = =, K, y y ( x x he data ponts only appear as nner product As long as e can calculate the nner product n the feature space, e do not need the appng explctly Many coon geoetrc operatons (angles, dstances) can be expressed by nner products Defne the kernel functon K by K x, x ) = φ( x ) φ( x ) Erc Xng @ CMU, 2006-200 ( ) 2 II. he Kernel rck Coputaton depends on feature space Bad f ts denson s uch larger than nput space ax s.t. = 0, 2, = y = 0. = y y K =, K, k ( x, x ) Where K(x,x ) = φ(x ) t φ(x ) ( ) y *( ) = sgn yk x SV, + b Erc Xng @ CMU, 2006-200 22
ransforng the Data Input space φ(.) φ( ) φ( ) φ( ) φ( ) φ( ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) Feature space Note: feature space s of hgher denson than the nput space n practce Coputaton n the feature space can be costly because t s hgh densonal he feature space s typcally nfnte-densonal! he kernel trck coes to rescue Erc Xng @ CMU, 2006-200 23 An Exaple for feature appng and kernels Consder an nput x=[x,x 2 ] Suppose φ(.) ) s gven as follos x 2 2 φ 2x 2x2 x x2 2x x x =,,,,, 2 An nner product n the feature space s 2 x x ' φ, φ x 2 x 2' = So, f e defne the kernel functon as follos, there s no need to carry out φ(.) explctly ( x ') 2 K ( x, x') = + x Erc Xng @ CMU, 2006-200 24 2
More exaples of kernel functons Lnear kernel (e've seen t) K ( x, x ') = x x ' Polynoal kernel (e ust sa an exaple) ( x ') p K ( x, x') = + x here p = 2, 3, o get the feature vectors e concatenate all pth order polynoal ters of the coponents of x (eghted approprately) Radal bass kernel K( x, x') = exp x x' 2 In ths case the feature space conssts of functons and results n a nonparaetrc classfer. 2 Erc Xng @ CMU, 2006-200 25 he essence of kernel Feature appng, but thout payng a cost E.g., polynoal kernel Ho any densons e ve got n the ne space? Ho any operatons t takes to copute K()? Kernel desgn, any prncple? K(x,) can be thought of as a slarty functon beteen x and hs ntuton can be ell reflected n the follong Gaussan functon (Slarly one can easly coe up th other K() n the sae sprt) Is ths necessarly lead to a legal kernel? (n the above partcular case, K() s a legal one, do you kno ho any denson φ(x) s? Erc Xng @ CMU, 2006-200 26 3
Kernel atrx Suppose for no that K s ndeed a vald kernel correspondng to soe feature appng φ, then for x,, x, e can copute an atrx, here hs s called a kernel atrx! No, f a kernel functon s ndeed a vald kernel, and ts eleents are dot-product n the transfored feature space, t ust satsfy: Syetry K=K proof Postve sedefnte proof? Erc Xng @ CMU, 2006-200 27 Mercer kernel Erc Xng @ CMU, 2006-200 28 4
SVM exaples Erc Xng @ CMU, 2006-200 29 (2) Model averagng Inputs x, class y = +, - data D = { (x,y ),. (x,y ) } Pont Rule: learn f opt (x) dscrnant functon fro F = {f} faly of dscrnants classfy y = sgn f opt (x) E.g., SVM Erc Xng @ CMU, 2006-200 30 5
Model averagng here exst any f th near optal perforance Instead of choosng f opt, average over all f n F Q(f) = eght of f Ho to specfy: F = { f } faly of dscrnant functons? Ho to learn Q(f) dstrbuton over F? Erc Xng @ CMU, 2006-200 3 Recall Bayesan Inference Bayesan learnng: Bayes Learner Bayes Predctor (odel averagng): Recall n SVM: What p 0? Erc Xng @ CMU, 2006-200 32 6
Ho to score dstrbutons? Entropy Entropy H(X) of a rando varable X H(X) s the expected nuber of bts needed to encode a randoly dran value of X (under ost effcent code) Why? Inforaton theory: Most effcent code assgns -log 2 P(X=) bts to encode the essage X=I, So, expected nuber of bts to code one rando X s: Erc Xng @ CMU, 2006-200 33 Maxu Entropy Dscrnaton Gven data set, fnd soluton Q ME correctly classfes D aong all adssble Q, Q ME has ax entropy ax entropy "nu assupton" about f Erc Xng @ CMU, 2006-200 34 7
Introducng Prors Pror Q 0 ( f ) Mnu Relatve Entropy Dscrnaton p Convex proble: Q MRE unque soluton MER "nu addtonal assupton" over Q 0 about f Erc Xng @ CMU, 2006-200 35 Soluton: Q ME as a proecton Convex proble: Q ME unque =0 unfor heore: Q 0 ME Q ME adssble Q 0 Lagrange ultplers fndng Q M : start th = 0 and follo gradent of unsatsfed constrants Erc Xng @ CMU, 2006-200 36 8
Soluton to MED heore (Soluton to MED): Posteror Dstrbuton: Dual Optaton Proble: Algorth: to coputer t, t =,... start th t = 0 (unfor dstrbuton) teratve ascent on J() untl convergence Erc Xng @ CMU, 2006-200 37 Exaples: SVMs heore For f(x) = x + b, Q 0 () = Noral( 0, I ), Q 0 (b) = non-nforatve pror, the Lagrange ultplers are obtaned by axng J() subect to 0 t C and t t y t = 0, here Separable D SVM recovered exactly Inseparable D SVM recovered th dfferent sclassfcaton penalty Erc Xng @ CMU, 2006-200 38 9
SVM extensons Exaple: Leptograpsus Crabs (5 nputs, tran =80, test =20) Lnear SVM Max Lkelhood Gaussan MRE Gaussan Erc Xng @ CMU, 2006-200 39 (3) Structured Predcton Unstructured predcton Structured predcton Part of speech taggng Do you ant sugar n t? <verb pron verb noun prep pron> Iage segentaton Erc Xng @ CMU, 2006-200 40 20
OCR exaple x y brace Sequental structure y a- a- a- a- a- x Erc Xng @ CMU, 2006-200 4 Classcal Classfcaton Models Inputs: a set of tranng saples, here and Outputs: a predctve functon : Exaples: SVM: Logstc Regresson: here Erc Xng @ CMU, 2006-200 42 2
Structured Models Assuptons: space of feasble outputs dscrnant functon Lnear cobnaton of features Su of partal scores: ndex p represents a part n the structure Rando felds or Markov netork features: Erc Xng @ CMU, 2006-200 43 Dscrnatve Learnng Strateges Max Condtonal Lkelhood We predct based on: * y x = arg ax p ( y x) = exp c f c ( x, y c ) y Z(, x) c And e learn based on: Max Margn: * p c fc ( x, y ) Z(, x ) c { y, x} = arg ax ( y x ) = exp We predct based on: And e learn based on: * y x = arg ax fc ( x, y c ) * y c c = arg ax y f ( x, y) { } y, x = arg ax n ( f ( y, x ) f ( y, x )) y y, Erc Xng @ CMU, 2006-200 44 22
E.g. Max-Margn Markov Netorks Convex Optaton Proble: Feasble subspace of eghts: Predctve Functon: Erc Xng @ CMU, 2006-200 45 OCR Exaple We ant: argax ord f(, ord) = brace Equvalently: f(, brace ) > f(, aaaaa ) f(, brace ) > f(, aaaab ) f(, brace ) > f(, ) a lot! Erc Xng @ CMU, 2006-200 46 23
Mn-ax Forulaton Brute force enueraton of constrants: he constrants are exponental n the se of the structure Alternatve: n-ax forulaton add only the ost volated constrant Handles ore general loss functons Only polynoal # of constrants needed Several algorths exst Erc Xng @ CMU, 2006-200 47 Results: Handrtng Recognton Length: ~8 chars Letter: 6x8 pxels 0-fold ran/est 5000/50000 letters 600/6000 ords Models: Multclass-SVMs* M 3 nets rror (average per-chara acter) 30 25 20 5 0 ra pxels quadratc kernel cubc kernel 33% error reducton over ultclass 5 SVMs est er 0 MC SVMs M^3 nets better Craer & Snger 0 Erc Xng @ CMU, 2006-200 48 24
Dscrnatve Learnng Paradgs M 3N SVM MED b r a c e MED-MN MED? MN = SMED + Bayesan M3N See [Zhu and Xng, 2008] Erc Xng @ CMU, 2006-200 49 49 Suary Maxu argn nonlnear separator Max-entropy dscrnaton Kernel trck Proect nto lnearly separatable space (possbly hgh or nfnte densonal) No need to kno the explct proecton functon Average rule for predcton, Average taken over a posteror dstrbuton of ho defnes the separaton hyperplane P() s obtaned by ax-entropy or n-kl prncple, subect to expected argnal constrants on the tranng exaples Max-argn Markov netork Mult-varate, rather than un-varate output Y Varable n the outputs are not ndependent of each other (structured nput/output) Margn constrant over every possble confguraton of Y (exponentally any!) Erc Xng @ CMU, 2006-200 50 50 25