Support-Vector Machines

Similar documents
Pattern Recognition 2014 Support Vector Machines

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

IAML: Support Vector Machines

The blessing of dimensionality for kernel methods

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

SURVIVAL ANALYSIS WITH SUPPORT VECTOR MACHINES

Linear programming III

Support Vector Machines and Flexible Discriminants

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

Lyapunov Stability Stability of Equilibrium Points

Introduction: A Generalized approach for computing the trajectories associated with the Newtonian N Body Problem

Homology groups of disks with holes

Slide04 (supplemental) Haykin Chapter 4 (both 2nd and 3rd ed): Multi-Layer Perceptrons

Elements of Machine Intelligence - I

The Solution Path of the Slab Support Vector Machine

Chapter 3 Kinematics in Two Dimensions; Vectors

ANSWER KEY FOR MATH 10 SAMPLE EXAMINATION. Instructions: If asked to label the axes please use real world (contextual) labels

COMP 551 Applied Machine Learning Lecture 4: Linear classification

EDA Engineering Design & Analysis Ltd

Pre-Calculus Individual Test 2017 February Regional

Smoothing, penalized least squares and splines

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Part 3 Introduction to statistical classification techniques

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

An Introduction to Complex Numbers - A Complex Solution to a Simple Problem ( If i didn t exist, it would be necessary invent me.

NUMBERS, MATHEMATICS AND EQUATIONS

SUPPORT VECTOR MACHINES FOR BANKRUPTCY ANALYSIS

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Computational modeling techniques

Floating Point Method for Solving Transportation. Problems with Additional Constraints

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

Contents. This is page i Printer: Opaque this

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

MATHEMATICS Higher Grade - Paper I

Chapter 3: Cluster Analysis

Differentiation Applications 1: Related Rates

1 The limitations of Hartree Fock approximation

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

Math Foundations 20 Work Plan

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Tree Structured Classifier

Support Vector Machines and Flexible Discriminants

Admissibility Conditions and Asymptotic Behavior of Strongly Regular Graphs

Kinematic transformation of mechanical behavior Neville Hogan

LEARNING : At the end of the lesson, students should be able to: OUTCOMES a) state trigonometric ratios of sin,cos, tan, cosec, sec and cot

SOLUTIONS TO EXERCISES FOR. MATHEMATICS 205A Part 4. Function spaces

Machine Learning. Support Vector Machines. Manfred Huber

Part a: Writing the nodal equations and solving for v o gives the magnitude and phase response: tan ( 0.25 )

A Few Basic Facts About Isothermal Mass Transfer in a Binary Mixture

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Computational modeling techniques

Section 6-2: Simplex Method: Maximization with Problem Constraints of the Form ~

Physical Layer: Outline

GENESIS Structural Optimization for ANSYS Mechanical

Sequential Allocation with Minimal Switching

22.54 Neutron Interactions and Applications (Spring 2004) Chapter 11 (3/11/04) Neutron Diffusion

Chapter 9 Vector Differential Calculus, Grad, Div, Curl

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

Lim f (x) e. Find the largest possible domain and its discontinuity points. Why is it discontinuous at those points (if any)?

You need to be able to define the following terms and answer basic questions about them:

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

Chapter 2 GAUSS LAW Recommended Problems:

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

x x

Introduction to Support Vector Machines

EASTERN ARIZONA COLLEGE Precalculus Trigonometry

Building to Transformations on Coordinate Axis Grade 5: Geometry Graph points on the coordinate plane to solve real-world and mathematical problems.

Thermodynamics Partial Outline of Topics

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

What is Statistical Learning?

DESIGN OPTIMIZATION OF HIGH-LIFT CONFIGURATIONS USING A VISCOUS ADJOINT-BASED METHOD

Cambridge Assessment International Education Cambridge Ordinary Level. Published

5 th grade Common Core Standards

Source Coding and Compression

Trigonometric Ratios Unit 5 Tentative TEST date

Margin Distribution and Learning Algorithms

Revisiting the Socrates Example

ELECTRON CYCLOTRON HEATING OF AN ANISOTROPIC PLASMA. December 4, PLP No. 322

Revision: August 19, E Main Suite D Pullman, WA (509) Voice and Fax

Inflow Control on Expressway Considering Traffic Equilibria

V. Balakrishnan and S. Boyd. (To Appear in Systems and Control Letters, 1992) Abstract

COMP9414/ 9814/ 3411: Artificial Intelligence. 14. Course Review. COMP3411 c UNSW, 2014

8 th Grade Math: Pre-Algebra

Support Vector Machine (SVM) and Kernel Methods

Computational modeling techniques

Lead/Lag Compensator Frequency Domain Properties and Design Methods

Support Vector Machines: Maximum Margin Classifiers

ENGI 1313 Mechanics I

A.H. Helou Ph.D.~P.E.

A Scalable Recurrent Neural Network Framework for Model-free

Mathematics and Computer Sciences Department. o Work Experience, General. o Open Entry/Exit. Distance (Hybrid Online) for online supported courses

Lecture 5: Equilibrium and Oscillations

Kinetic Model Completeness

Dead-beat controller design

I.S. 239 Mark Twain. Grade 7 Mathematics Spring Performance Task: Proportional Relationships

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

Transcription:

Supprt-Vectr Machines Intrductin Supprt vectr machine is a linear machine with sme very nice prperties. Haykin chapter 6. See Alpaydin chapter 13 fr similar cntent. Nte: Part f this lecture drew material frm Ricard Gutierrez-Osuna s Pattern Analysis lectures. The basic idea f SVM is t cnstruct a separating hyperplane where the margin f separatin between psitive and negative eamples are maimized. Principled derivatin: structural risk minimizatin errr rate is bunded by: (1) training errr-rate and (2) VC-dimensin f the mdel. SVM makes (1) becme zer and minimizes (2). 1 2 Optimal Hyperplane Distance t the Optimal Hyperplane Fr linearly separable patterns {( i, d i )} N (with d i {+1, 1}): i d+r w The separating hyperplane is w T + b = 0: w T + b 0 fr d i = +1 θ d r w T + b < 0 fr d i = 1 Let w be the ptimal hyperplane and b the ptimal bias. Frm w T i = b, the distance frm the rigin t the hyperplane is calculated as: d = i cs( i, w ) = b w since w T i = w i cs(w, i ) = b 3 4

Distance t the Optimal Hyperplane (cnt d) i w d+r θ d r Optimal Hyperplane and Supprt Vectrs ρ Optimal hyperplane The distance frm an arbitrary pint t the hyperplane can be calculated as: When the pint is in the psitive area: r = cs(, w ) d = T w w + w = T w + b w When the pint is in the negative area: r = d cs(, w ) = T w w w = T w + b w 5 b b.. Supprt Vectrs Supprt vectrs: input pints clsest t the separating hyperplane. Margin f separatin ρ: distance between the separating hyperplane and the clsest input pint. 6 Optimal Hyperplane and Supprt Vectrs (cnt d) The ptimal hyperplane is suppsed t maimize the margin f separatin ρ. With that requirement, we can write the cnditins that w and b must meet: w T + b +1 fr d i = +1 w T + b 1 fr d i = 1 Nte: +1 and 1, and supprt vectrs are thse (s) where equality hlds (i.e., w T (s) + b = +1 r 1). Since r = (w T + b )/ w, 1/ w if d = +1 r = 1/ w if d = 1 7 Optimal Hyperplane and Supprt Vectrs (cnt d) Supprt Vectrs Optimal hyperplane Margin f separatin between tw classes is ρ ρ = 2r = 2 w. Thus, maimizing the margin f separatin between tw classes is equivalent t minimizing the Euclidean nrm f the weight w! 8

Primal Prblem: Cnstrained Optimizatin Fr the training set T = {( i, d i )} N find w and b such that they minimize a certain value (1/ρ) while satisfying a cnstraint (all eamples are crrectly classified): Cnstraint: d i (w T i + b) 1 fr i = 1, 2,..., N. Cst functin: Φ(w) = 1 2 wt w. This prblem can be slved using the methd f Lagrange multipliers (see net tw slides). Mathematical Aside: Lagrange Multipliers Turn a cnstrained ptimizatin prblem int an uncnstrained ptimizatin prblem by absrbing the cnstraints int the cst functin, weighted by the Lagrange multipliers. Eample: Find pint n the circle 2 + y 2 = 1 clsest t the pint (2, 3) (adapted frm Ballard, An Intrductin t Natural Cmputatin, 1997, pp. 119 120). Minimize F (, y) = ( 2) 2 + (y 3) 2 subject t the cnstraint 2 + y 2 1 = 0. Absrb the cnstraint int the cst functin, after multiplying the Lagrange multiplier α: F (, y, α) = ( 2) 2 + (y 3) 2 + α( 2 + y 2 1). 9 10 Lagrange Multipliers (cnt d) Must find, y, α that minimizes F (, y, α) = ( 2) 2 + (y 2) 2 + α( 2 + y 2 1). Set the partial derivatives t 0, and slve the system f equatins. F F y = 2( 2) + 2α = 0 = 2(y 2) + 2αy = 0 F α = 2 + y 2 1 = 0 Slve fr and y in the 1st and 2nd, and plug in thse t the 3rd equatin = y = 2 ( ) 2 2 ( ) 2 2 1 + α, s + = 1 1 + α 1 + α frm which we get α = 2 2 1. Thus, (, y) = (1/ 2, 1/ 2). 11 Primal Prblem: Cnstrained Optimizatin (cnt d) Putting the cnstrained ptimizatin prblem int the Lagrangian frm, we get (utilizing the Kunh-Tucker therem) J(w, b, α) = 1 2 wt w Frm J(w,b,α) w = 0: Frm J(w,b,α) b = 0: w = ] α i [d i (w T i + b) 1. α i d i i. α i d i = 0 12

Primal Prblem: Cnstrained Optimizatin (cnt d) Nte that when the ptimal slutin is reached, the fllwing cnditin must hld (Karush-Kuhn-Tucker cmplementary cnditin) fr all i = 1, 2,..., N. ] α i [d i (w T i + b) 1 = 0 Thus, nn-zer α i s can be attained nly when [ di (w T i + b) 1 ] = 0, i.e., when the α i is assciated with a supprt vectr (s)! Other cnditins include α i 0. Primal Prblem: Cnstrained Optimizatin (cnt d) Plugging in w = N α id i N i and α id i = 0 back int J(w, b, α), we get the dual prblem. J(w, b, α) = 1 2 wt w N α i [ ] d i (w T i + b) 1 = 1 2 wt w N α id i w T i b N α id i + N α i { nting w T w = N α id i w T i } N and frm α id i = 0 N α id i w T i + N = 1 2 = 1 N 2 = Q(α). α i N j=1 α iα j d i d j T i j + N α i S, J(w, b, α) = Q(α) (α i 0). This results in the dual prblem (net slide). 13 14 Dual Prblem Given the training sample {( i, d i )} N, find the Lagrange multipliers {α i } N Q(α) = 1 2 that maimize the bjective functin: subject t the cnstraints N α id i = 0 j=1 α i 0 fr all i = 1, 2,..., N. α i α j d i d j T i j + The prblem is stated entirely in terms f the training data ( i, d i ), and the dt prducts T i j play a key rle. 15 α i Slutin t the Optimizatin Prblem Once all the ptimal Lagrange mulitpliers α,i are fund (use Sequential minimal ptimizatin, etc.), w and b can be fund as fllws: w = α,i d i i and frm w T i + b = d i when i is a supprt vectr: b = d (s) w T (s) Nte: calculatin f final estimated functin des nt need any eplicit calculatin f w since they can be calculated frm the dt prduct between the input vectrs! w T = α,i d i T i 16

Margin f Separatin in SVM and VC Dimensin Statistical learning thery shws that it is desirable t reduce bth the errr (empirical risk) and the VC dimensin f the classifier. Vapnik (1995, 1998) shwed: Let D be the diameter f the smallest ball cntaining all input vectrs i. The set f ptimal hyperplanes defined by w T + b = 0 has a VC dimensin h bunded frm abve as { D 2 h min ρ 2, m 0 } + 1 where is the ceiling, ρ the margin f separatin equal t 2/ w, and m 0 the dimensinality f the input space. The implicatin is that the VC dimensin can be cntrlled independetly f m 0, by chsing an apprpriate (large) ρ! 17 Sft-Margin Classificatin ρ Optimal hyperplane Supprt Vectrs Inside margin, crrectly classified Inside margin, incrrectly classified Sme prblems can vilate the cnditin: d i (w T i + b) 1 We can intrduce a new set f variables {ξ i } N : d i (w T i + b) 1 ξ i where ξ i is called the slack variable. 18 Sft-Margin Classificatin (cnt d) We want t find a separating hyperplane that minimizes: Φ(ξ) = I(ξ i 1) where I(ξ) = 0 if ξ 0 and 1 therwise. Slving the abve is NP-cmplete, s we instead slve an apprimatin: Φ(ξ) = ξ i Sft-Margin Classificatin: Slutin Fllwing a similar rute invlving Lagrange multipliers, and a mre restrictive cnditin f 0 α i C, we get the slutin: N s w = α,i d i i b = d i (1 ξ i ) w T i Furthermre, the weight vectr can be factred in: Φ(, ξ) = 1 2 wt w + C ξ i }{{} }{{} Cntrls VC dim Cntrls errr with a cntrl parameter C. 19 20

Nnlinear SVM Inner-Prduct Kernel Input is mapped t ϕ(). Input space ( ) ( i ) With the weight w (including the bias b), the decisin surface in the feature space becmes (assume ϕ 0 () = 1): i w T ϕ() = 0 Feature space Using the steps in linear SVM, we get Nnlinear mapping f an input vectr t a high-dimensinal feature space (eplit Cver s therem) Cnstructin f an ptimal hyperplane fr separating the features identified in the abve step. 21 w = α i d i ϕ( i ) Cmbining the abve tw, we get the decisin surface α i d i ϕ T ( i )ϕ() = 0. 22 Inner-Prduct Kernel (cnt d) The inner prduct ϕ T ()ϕ( i ) is between tw vectrs in the feature space. The calculatin f this inner prduct can be simpified by use f a inner-prduct kernel K(, i ): m 1 K(, i ) = ϕ T ()ϕ( i ) = ϕ j ()ϕ j ( i ) j=0 where m 1 is the dimensin f the feature space. (Nte: K(, i ) = K( i, ).) Inner-Prduct Kernel (cnt d) Mercer s therem states that K(, i ) that fllw certain cnditins (cntinuus, symmetric, psitive semi-definite) can be epressed in terms f an inner-prduct in a nnlinearly mapped feature space. Kernel functin K(, i ) allws us t calculate the inner prduct ϕ T ()ϕ( i ) in the mapped feature space withut any eplicit calculatin f the mapping functin ϕ( ). S, the ptimal hyperplane becmes: α i d i K(, i ) = 0 23 24

Eamples f Kernel Functins Linear: K(, i ) = T i. Plynmial: K(, i ) = ( T i + 1) p. ( ) RBF: K(, i ) = ep 1 2σ 2 i 2. Tw-layer perceptrn: K(, i ) = tanh ( β 0 T i + β 1 ) (fr sme β 0 and β 1 ). Epanding Kernel Eample K(, i ) = (1 + T i ) 2 with = [ 1, 2 ] T, i = [ i1, i2 ] T, K(, i ) = 1 + 2 1 2 i1 + 2 1 2 i1 i2 = + 2 2 2 i2 + 2 1 i1 + 2 2 i2 [1, 2 1, 2 1 2, 2 2, 2 1, 2 2 ] [1, 2 i1, 2 i1 i2, 2 i2, 2 i1, 2 i2 ] T = ϕ() T ϕ( i ), where ϕ() = [1, 2 1, 2 1 2, 2 2, 2 1, 2 2 ] T. 25 26 Nnlinear SVM: Slutin The slutin is basically the same as the linear case, where T i is replaced with K(, i ), and an additinal cnstraint that α C is added. Nnlinear SVM Summary Prject input t high-dimensinal space t turn the prblem int a linearly separable prblem. Issues with a prjectin t higher dimensinal feature space: Statistical prblem: Danger f invking curse f dimensinality and higher chance f verfitting Use large margins t reduce VC dimensin Cmputatinal prblem: cmputatinal verhead fr calculating the mapping ϕ( ): Slve by using the kernel trick. 27 28