Recap: the SVM problem

Similar documents
Recap: the SVM problem

Machine Learning. What is a good Decision Boundary? Support Vector Machines

Machine Learning. Support Vector Machines. Eric Xing. Lecture 4, August 12, Reading: Eric CMU,

Machine Learning. Support Vector Machines. Eric Xing , Fall Lecture 9, October 8, 2015

Machine Learning. Support Vector Machines. Eric Xing , Fall Lecture 9, October 6, 2015

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Which Separator? Spring 1

Support Vector Machines CS434

Advanced Introduction to Machine Learning

Linear Classification, SVMs and Nearest Neighbors

Excess Error, Approximation Error, and Estimation Error

BAYESIAN CURVE FITTING USING PIECEWISE POLYNOMIALS. Dariusz Biskup

Support Vector Machines

Lecture 3: Dual problems and Kernels

Support Vector Machines

Support Vector Machines

Support Vector Machines

10-701/ Machine Learning, Fall 2005 Homework 3

Lagrange Multipliers Kernel Trick

Discriminative classifier: Logistic Regression. CS534-Machine Learning

COS 511: Theoretical Machine Learning

Support Vector Machines

Lecture 10 Support Vector Machines II

CSE 252C: Computer Vision III

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

y new = M x old Feature Selection: Linear Transformations Constraint Optimization (insertion)

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Support Vector Machines CS434

1 Definition of Rademacher Complexity

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

Kristin P. Bennett. Rensselaer Polytechnic Institute

Kernel Methods and SVMs Extension

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

LECTURE :FACTOR ANALYSIS

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Chapter 6 Support vector machine. Séparateurs à vaste marge

Least Squares Fitting of Data

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Least Squares Fitting of Data

Pattern Classification

Preference and Demand Examples

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

Nonlinear Classifiers II

What is LP? LP is an optimization technique that allocates limited resources among competing activities in the best possible manner.

Gradient Descent Learning and Backpropagation

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

CHAPTER 7 CONSTRAINED OPTIMIZATION 1: THE KARUSH-KUHN-TUCKER CONDITIONS

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

Evaluation of classifiers MLPs

EM and Structure Learning

Classification learning II

Lecture 6: Support Vector Machines

Computational and Statistical Learning theory Assignment 4

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont.

CHAPTER 6 CONSTRAINED OPTIMIZATION 1: K-T CONDITIONS

Xiangwen Li. March 8th and March 13th, 2001

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

18-660: Numerical Methods for Engineering Design and Optimization

Natural Language Processing and Information Retrieval

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

COMP th April, 2007 Clement Pang

1 Review From Last Time

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Lecture Notes on Linear Regression

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Lecture 3. Camera Models 2 & Camera Calibration. Professor Silvio Savarese Computational Vision and Geometry Lab. 13- Jan- 15.

Generative classification models

Ensemble Methods: Boosting

XII.3 The EM (Expectation-Maximization) Algorithm

Structured Perceptrons & Structural SVMs

17 Support Vector Machines

Lecture 10 Support Vector Machines. Oct

Our focus will be on linear systems. A system is linear if it obeys the principle of superposition and homogenity, i.e.

Kernel Methods and SVMs

Structural Extensions of Support Vector Machines. Mark Schmidt March 30, 2009

CS 548: Computer Vision Machine Learning - Part 1. Spring 2016 Dr. Michael J. Reale

On the Eigenspectrum of the Gram Matrix and the Generalisation Error of Kernel PCA (Shawe-Taylor, et al. 2005) Ameet Talwalkar 02/13/07

Multilayer Perceptron (MLP)

An Optimal Bound for Sum of Square Roots of Special Type of Integers

Intro to Visual Recognition

Solving Fuzzy Linear Programming Problem With Fuzzy Relational Equation Constraint

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MMA and GCMMA two methods for nonlinear optimization

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

CSC 411 / CSC D11 / CSC C11

Feature Selection: Part 1

Homework Assignment 3 Due in class, Thursday October 15

PHYS 705: Classical Mechanics. Calculus of Variations II

General Averaged Divergence Analysis

Machine Learning. Support Vector Machines. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 8, Sept. 13, 2012 Based on slides from Eric Xing, CMU

On Pfaff s solution of the Pfaff problem

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Multigradient for Neural Networks for Equalizers 1

PGM Learning Tasks and Metrics

Column Generation. Teo Chung-Piaw (NUS) 25 th February 2003, Singapore

Mean Field / Variational Approximations

Elastic Collisions. Definition: two point masses on which no external forces act collide without losing any energy.

Multipoint Analysis for Sibling Pairs. Biostatistics 666 Lecture 18

Classification as a Regression Problem

Transcription:

Machne Learnng 0-70/5-78, 78, Fall 20 Advanced topcs n Max-Margn Margn Learnng Erc Xng Lecture 20, Noveber 2, 20 Erc Xng @ CMU, 2006-200 Recap: the SVM proble We solve the follong constraned opt proble: ax s.t. J ( ) = 0, = = 2, = y = 0. =, K, y y ( x x hs s a quadratc prograng g proble. A global axu of can alays be found. ) he soluton: Ho to predct: = = y x Erc Xng @ CMU, 2006-200 2

Non-lnearly Separable Probles Class 2 Class We allo error ξ n classfcaton; t s based on the output of the dscrnant functon x+b ξ approxates the nuber of sclassfed saples Erc Xng @ CMU, 2006-200 3 Soft Margn Hyperplane No e have a slghtly dfferent opt proble: n, b s.t 2 y ( x + b) ξ, ξ 0, + C = ξ ξ are slack varables n optaton Note that ξ =0 f there s no error for x ξ s an upper bound of the nuber of errors C : tradeoff paraeter beteen error and argn Erc Xng @ CMU, 2006-200 4 2

Lagrangan Dualty, cont. Recall the Pral Proble: he Dual Proble: n ax β, heore (eak dualty): L (,, β ), β 0 ax, β, 0 n L (,, β ) * d = ax, β, n (,, ) n ax,, (,, ) 0 L β β = 0 L β p * heore (strong dualty): Iff there exst a saddle pont of L (,, β ), e have * d = p * Erc Xng @ CMU, 2006-20 5 A sketch of strong and eak dualty No, gnorng h(x) for splcty, let's look at hat's happenng graphcally n the dualty theores. d * = ax n f ( ) + g( ) n ax f ( ) + g( ) = 0 0 f () p * g() Erc Xng @ CMU, 2006-20 6 3

A sketch of strong and eak dualty No, gnorng h(x) for splcty, let's look at hat's happenng graphcally n the dualty theores. d * = ax n f ( ) + g( ) n ax f ( ) + g( ) = 0 0 f () p * g() Erc Xng @ CMU, 2006-20 7 he KK condtons If there exsts soe saddle pont of L, then the saddle pont satsfes the follong "Karush-Kuhn-ucker" (KK) condtons: L (,, β ) = 0, L (,, β ) = 0, β g ( ) = 0, g ( ) 0, 0, =, K, k =, K, l =, K, =, K, =, K, heore: If *, * and β* satsfy the KK condton, then t s also a soluton to the pral and the dual probles. Erc Xng @ CMU, 2006-20 8 4

he Optaton Proble he dual of ths ne constraned optaton proble s ax s.t. J ( ) = 0 C, = y = 0. = 2, = =, K, y y ( x x ) hs s very slar to the optaton proble n the lnear separable case, except that there s an upper bound C on no Once agan, a QP solver can be used to fnd Erc Xng @ CMU, 2006-200 9 he SMO algorth Consder solvng the unconstraned opt proble: We ve already see three opt algorths! Coordnate ascent Gradent ascent Neton-Raphson Coordnate ascend: Erc Xng @ CMU, 2006-200 0 5

Coordnate ascend Erc Xng @ CMU, 2006-200 Sequental nal optaton Constraned optaton: ax s.t. J ( ) = 0 C, = y = 0. = 2, = =, K, y y ( x x ) Queston: can e do coordnate along one drecton at a te (.e., hold all [-] fxed, and update?) Erc Xng @ CMU, 2006-200 2 6

he SMO algorth Repeat tll convergence. Select soe par and to update next (usng a heurstc that tres to pck the to that ll allo us to ake the bggest progress toards the global axu). 2. Re-opte J() th respect to and, hle holdng all the other k 's (k ; ) fxed. k ( ; ) Wll ths procedure converge? Erc Xng @ CMU, 2006-200 3 Convergence of SMO ax J ( ) = 2 = 2, = y y ( x x ) KK: s.t. 0 C, = y = 0. =, K, k Let s hold 3,, fxed and reopt J.r.t. and 2 Erc Xng @ CMU, 2006-200 4 7

Convergence of SMO he constrants: he obectve: Constraned opt: Erc Xng @ CMU, 2006-200 5 Cross-valdaton error of SVM he leave-one-out cross-valdaton error does not depend on the densonalty of the feature space but only on the # of support vectors! # support vectors Leave - one - out CV error = # of tranng exaples Erc Xng @ CMU, 2006-200 6 8

Advanced topcs n Max-Margn Learnng ax J ( ) = 2 = 2, = y y ( x x ) Kernel Pont rule or average rule Can e predct vec(y)? Erc Xng @ CMU, 2006-200 7 Outlne he Kernel trck Maxu entropy dscrnaton Structured SVM, aka, Maxu Margn Markov Netorks Erc Xng @ CMU, 2006-200 8 9

() Non-lnear Decson Boundary So far, e have only consdered large-argn classfer th a lnear decson boundary Ho to generale t to becoe nonlnear? Key dea: transfor x to a hgher densonal space to ake lfe easer Input space: the space the pont x are located Feature space: the space of φ(x ) after transforaton Why transfor? Lnear operaton n the feature space s equvalent to non-lnear operaton n nput space Classfcaton can becoe easer th a proper transforaton. In the XOR proble, for exaple, addng a ne feature of x x 2 ake the proble lnearly separable (hoeork) Erc Xng @ CMU, 2006-200 9 he Kernel rck Is ths data lnearly-separable? Ho about a quadratc appng φ(x )? Erc Xng @ CMU, 2006-200 20 0

he Kernel rck Recall the SVM optaton proble ax s.t. J ( ) = 0 C, = y = 0. = 2, = =, K, y y ( x x he data ponts only appear as nner product As long as e can calculate the nner product n the feature space, e do not need the appng explctly Many coon geoetrc operatons (angles, dstances) can be expressed by nner products Defne the kernel functon K by K x, x ) = φ( x ) φ( x ) Erc Xng @ CMU, 2006-200 ( ) 2 II. he Kernel rck Coputaton depends on feature space Bad f ts denson s uch larger than nput space ax s.t. = 0, 2, = y = 0. = y y K =, K, k ( x, x ) Where K(x,x ) = φ(x ) t φ(x ) ( ) y *( ) = sgn yk x SV, + b Erc Xng @ CMU, 2006-200 22

ransforng the Data Input space φ(.) φ( ) φ( ) φ( ) φ( ) φ( ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) Feature space Note: feature space s of hgher denson than the nput space n practce Coputaton n the feature space can be costly because t s hgh densonal he feature space s typcally nfnte-densonal! he kernel trck coes to rescue Erc Xng @ CMU, 2006-200 23 An Exaple for feature appng and kernels Consder an nput x=[x,x 2 ] Suppose φ(.) ) s gven as follos x 2 2 φ 2x 2x2 x x2 2x x x =,,,,, 2 An nner product n the feature space s 2 x x ' φ, φ x 2 x 2' = So, f e defne the kernel functon as follos, there s no need to carry out φ(.) explctly ( x ') 2 K ( x, x') = + x Erc Xng @ CMU, 2006-200 24 2

More exaples of kernel functons Lnear kernel (e've seen t) K ( x, x ') = x x ' Polynoal kernel (e ust sa an exaple) ( x ') p K ( x, x') = + x here p = 2, 3, o get the feature vectors e concatenate all pth order polynoal ters of the coponents of x (eghted approprately) Radal bass kernel K( x, x') = exp x x' 2 In ths case the feature space conssts of functons and results n a nonparaetrc classfer. 2 Erc Xng @ CMU, 2006-200 25 he essence of kernel Feature appng, but thout payng a cost E.g., polynoal kernel Ho any densons e ve got n the ne space? Ho any operatons t takes to copute K()? Kernel desgn, any prncple? K(x,) can be thought of as a slarty functon beteen x and hs ntuton can be ell reflected n the follong Gaussan functon (Slarly one can easly coe up th other K() n the sae sprt) Is ths necessarly lead to a legal kernel? (n the above partcular case, K() s a legal one, do you kno ho any denson φ(x) s? Erc Xng @ CMU, 2006-200 26 3

Kernel atrx Suppose for no that K s ndeed a vald kernel correspondng to soe feature appng φ, then for x,, x, e can copute an atrx, here hs s called a kernel atrx! No, f a kernel functon s ndeed a vald kernel, and ts eleents are dot-product n the transfored feature space, t ust satsfy: Syetry K=K proof Postve sedefnte proof? Erc Xng @ CMU, 2006-200 27 Mercer kernel Erc Xng @ CMU, 2006-200 28 4

SVM exaples Erc Xng @ CMU, 2006-200 29 (2) Model averagng Inputs x, class y = +, - data D = { (x,y ),. (x,y ) } Pont Rule: learn f opt (x) dscrnant functon fro F = {f} faly of dscrnants classfy y = sgn f opt (x) E.g., SVM Erc Xng @ CMU, 2006-200 30 5

Model averagng here exst any f th near optal perforance Instead of choosng f opt, average over all f n F Q(f) = eght of f Ho to specfy: F = { f } faly of dscrnant functons? Ho to learn Q(f) dstrbuton over F? Erc Xng @ CMU, 2006-200 3 Recall Bayesan Inference Bayesan learnng: Bayes Learner Bayes Predctor (odel averagng): Recall n SVM: What p 0? Erc Xng @ CMU, 2006-200 32 6

Ho to score dstrbutons? Entropy Entropy H(X) of a rando varable X H(X) s the expected nuber of bts needed to encode a randoly dran value of X (under ost effcent code) Why? Inforaton theory: Most effcent code assgns -log 2 P(X=) bts to encode the essage X=I, So, expected nuber of bts to code one rando X s: Erc Xng @ CMU, 2006-200 33 Maxu Entropy Dscrnaton Gven data set, fnd soluton Q ME correctly classfes D aong all adssble Q, Q ME has ax entropy ax entropy "nu assupton" about f Erc Xng @ CMU, 2006-200 34 7

Introducng Prors Pror Q 0 ( f ) Mnu Relatve Entropy Dscrnaton p Convex proble: Q MRE unque soluton MER "nu addtonal assupton" over Q 0 about f Erc Xng @ CMU, 2006-200 35 Soluton: Q ME as a proecton Convex proble: Q ME unque =0 unfor heore: Q 0 ME Q ME adssble Q 0 Lagrange ultplers fndng Q M : start th = 0 and follo gradent of unsatsfed constrants Erc Xng @ CMU, 2006-200 36 8

Soluton to MED heore (Soluton to MED): Posteror Dstrbuton: Dual Optaton Proble: Algorth: to coputer t, t =,... start th t = 0 (unfor dstrbuton) teratve ascent on J() untl convergence Erc Xng @ CMU, 2006-200 37 Exaples: SVMs heore For f(x) = x + b, Q 0 () = Noral( 0, I ), Q 0 (b) = non-nforatve pror, the Lagrange ultplers are obtaned by axng J() subect to 0 t C and t t y t = 0, here Separable D SVM recovered exactly Inseparable D SVM recovered th dfferent sclassfcaton penalty Erc Xng @ CMU, 2006-200 38 9

SVM extensons Exaple: Leptograpsus Crabs (5 nputs, tran =80, test =20) Lnear SVM Max Lkelhood Gaussan MRE Gaussan Erc Xng @ CMU, 2006-200 39 (3) Structured Predcton Unstructured predcton Structured predcton Part of speech taggng Do you ant sugar n t? <verb pron verb noun prep pron> Iage segentaton Erc Xng @ CMU, 2006-200 40 20

OCR exaple x y brace Sequental structure y a- a- a- a- a- x Erc Xng @ CMU, 2006-200 4 Classcal Classfcaton Models Inputs: a set of tranng saples, here and Outputs: a predctve functon : Exaples: SVM: Logstc Regresson: here Erc Xng @ CMU, 2006-200 42 2

Structured Models Assuptons: space of feasble outputs dscrnant functon Lnear cobnaton of features Su of partal scores: ndex p represents a part n the structure Rando felds or Markov netork features: Erc Xng @ CMU, 2006-200 43 Dscrnatve Learnng Strateges Max Condtonal Lkelhood We predct based on: * y x = arg ax p ( y x) = exp c f c ( x, y c ) y Z(, x) c And e learn based on: Max Margn: * p c fc ( x, y ) Z(, x ) c { y, x} = arg ax ( y x ) = exp We predct based on: And e learn based on: * y x = arg ax fc ( x, y c ) * y c c = arg ax y f ( x, y) { } y, x = arg ax n ( f ( y, x ) f ( y, x )) y y, Erc Xng @ CMU, 2006-200 44 22

E.g. Max-Margn Markov Netorks Convex Optaton Proble: Feasble subspace of eghts: Predctve Functon: Erc Xng @ CMU, 2006-200 45 OCR Exaple We ant: argax ord f(, ord) = brace Equvalently: f(, brace ) > f(, aaaaa ) f(, brace ) > f(, aaaab ) f(, brace ) > f(, ) a lot! Erc Xng @ CMU, 2006-200 46 23

Mn-ax Forulaton Brute force enueraton of constrants: he constrants are exponental n the se of the structure Alternatve: n-ax forulaton add only the ost volated constrant Handles ore general loss functons Only polynoal # of constrants needed Several algorths exst Erc Xng @ CMU, 2006-200 47 Results: Handrtng Recognton Length: ~8 chars Letter: 6x8 pxels 0-fold ran/est 5000/50000 letters 600/6000 ords Models: Multclass-SVMs* M 3 nets rror (average per-chara acter) 30 25 20 5 0 ra pxels quadratc kernel cubc kernel 33% error reducton over ultclass 5 SVMs est er 0 MC SVMs M^3 nets better Craer & Snger 0 Erc Xng @ CMU, 2006-200 48 24

Dscrnatve Learnng Paradgs M 3N SVM MED b r a c e MED-MN MED? MN = SMED + Bayesan M3N See [Zhu and Xng, 2008] Erc Xng @ CMU, 2006-200 49 49 Suary Maxu argn nonlnear separator Max-entropy dscrnaton Kernel trck Proect nto lnearly separatable space (possbly hgh or nfnte densonal) No need to kno the explct proecton functon Average rule for predcton, Average taken over a posteror dstrbuton of ho defnes the separaton hyperplane P() s obtaned by ax-entropy or n-kl prncple, subect to expected argnal constrants on the tranng exaples Max-argn Markov netork Mult-varate, rather than un-varate output Y Varable n the outputs are not ndependent of each other (structured nput/output) Margn constrant over every possble confguraton of Y (exponentally any!) Erc Xng @ CMU, 2006-200 50 50 25