SVM: Terminology 1(6) SVM: Terminology 2(6)

Similar documents
Statistical Learning Theory: A Primer

Support Vector Machine and Its Application to Regression and Classification

Multicategory Classification by Support Vector Machines

CS229 Lecture notes. Andrew Ng

Content. Learning. Regression vs Classification. Regression a.k.a. function approximation and Classification a.k.a. pattern recognition

FRST Multivariate Statistics. Multivariate Discriminant Analysis (MDA)

Determining The Degree of Generalization Using An Incremental Learning Algorithm

Statistical Learning Theory: a Primer

A unified framework for Regularization Networks and Support Vector Machines. Theodoros Evgeniou, Massimiliano Pontil, Tomaso Poggio

Support Vector Machines

Chapter 1 Decomposition methods for Support Vector Machines

SVM-based Supervised and Unsupervised Classification Schemes

Explicit overall risk minimization transductive bound

An Algorithm for Pruning Redundant Modules in Min-Max Modular Network

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

From Margins to Probabilities in Multiclass Learning Problems

Appendix A: MATLAB commands for neural networks

Math 124B January 31, 2012

Multilayer Kerceptron

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Machine Learning. Support Vector Machines. Manfred Huber

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

1D Heat Propagation Problems

A Solution to the 4-bit Parity Problem with a Single Quaternary Neuron

Support Vector Machine (continued)

Lecture Note 3: Stationary Iterative Methods

4 Separation of Variables

THINKING IN PYRAMIDS

221B Lecture Notes Notes on Spherical Bessel Functions

4 1-D Boundary Value Problems Heat Equation

Inductive Bias: How to generalize on novel data. CS Inductive Bias 1

Appendix of the Paper The Role of No-Arbitrage on Forecasting: Lessons from a Parametric Term Structure Model

A. Distribution of the test statistic

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machine (SVM) and Kernel Methods

CS798: Selected topics in Machine Learning

Linear & nonlinear classifiers

Separation of Variables and a Spherical Shell with Surface Charge

Expectation-Maximization for Estimating Parameters for a Mixture of Poissons

Support Vector Machine (SVM) and Kernel Methods

How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah

Semidefinite relaxation and Branch-and-Bound Algorithm for LPECs

Primal and dual active-set methods for convex quadratic programming

Support Vector Machine

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Two view learning: SVM-2K, Theory and Practice

DYNAMIC RESPONSE OF CIRCULAR FOOTINGS ON SATURATED POROELASTIC HALFSPACE

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

6.434J/16.391J Statistics for Engineers and Scientists May 4 MIT, Spring 2006 Handout #17. Solution 7

Statistical Machine Learning from Data

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Review: Support vector machines. Machine learning techniques and image analysis

An approximate method for solving the inverse scattering problem with fixed-energy data

Lecture 9. Stability of Elastic Structures. Lecture 10. Advanced Topic in Column Buckling

Do Schools Matter for High Math Achievement? Evidence from the American Mathematics Competitions Glenn Ellison and Ashley Swanson Online Appendix

BALANCING REGULAR MATRIX PENCILS

hole h vs. e configurations: l l for N > 2 l + 1 J = H as example of localization, delocalization, tunneling ikx k

Haar Decomposition and Reconstruction Algorithms

Lecture 6: Moderately Large Deflection Theory of Beams

Perceptron Revisited: Linear Separators. Support Vector Machines

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Jeff Howbert Introduction to Machine Learning Winter

Math 220B - Summer 2003 Homework 1 Solutions

Universal Consistency of Multi-Class Support Vector Classification

MARKOV CHAINS AND MARKOV DECISION THEORY. Contents

L5 Support Vector Classification

Discriminant Analysis: A Unified Approach

An introduction to Support Vector Machines

Steepest Descent Adaptation of Min-Max Fuzzy If-Then Rules 1

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Learning From Data Lecture 25 The Kernel Trick

On the Goal Value of a Boolean Function

Support Vector Machines

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with?

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Modelli Lineari (Generalizzati) e SVM

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Math 124B January 17, 2012

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

u(x) s.t. px w x 0 Denote the solution to this problem by ˆx(p, x). In order to obtain ˆx we may simply solve the standard problem max x 0

ASummaryofGaussianProcesses Coryn A.L. Bailer-Jones

Section 6: Magnetostatics

Linear & nonlinear classifiers

Research of Data Fusion Method of Multi-Sensor Based on Correlation Coefficient of Confidence Distance

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

XSAT of linear CNF formulas

8 Digifl'.11 Cth:uits and devices

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Linear Classification and SVM. Dr. Xin Zhang

Theory and implementation behind: Universal surface creation - smallest unitcell

$, (2.1) n="# #. (2.2)

Discriminative Models

Lecture 10: A brief introduction to Support Vector Machine

6 Wave Equation on an Interval: Separation of Variables

Statistics for Applications. Chapter 7: Regression 1/43

Trainable fusion rules. I. Large sample size case

A proposed nonparametric mixture density estimation using B-spline functions

Transcription:

Andrew Kusiak Inteigent Systems Laboratory 39 Seamans Center he University of Iowa Iowa City, IA 54-57 SVM he maxima margin cassifier is simiar to the perceptron: It aso assumes that the data points are ineary separabe It aims at finding the separating hyperpane with the maxima geometric margin (not just anyone - typica of a perceptron) Sma margin Cass, y = + Cass, y = + andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak Cass, y = - Cass, y = - Separating ines, i.e., decision boundaries, i.e., hyperpanes Large margin (Based on the materia provided by Professor V. Kecman) he University of Iowa Inteigent Systems Laboratory x x he arger the margin, the smaer the probabiity of miscassification. he University of Iowa Inteigent Systems Laboratory SVM: erminoogy (6) Before introducing forma (constructive) part of the statistica earning theory the terminoogy is defined. Vapnik and Chervonenkis introduced a nested set of hypothesis (a.k.a., approximating or decision) functions) SVM: erminoogy (6) Approximation or training error ~ Empirica risk ~ Bias Estimation error ~ Variance ~ Confidence of the training error ~ VC confidence interva Generaization (true, expected) error ~ Bound on test error ~ Guaranteed or true risk H H H n- H n he University of Iowa Inteigent Systems Laboratory 3 he University of Iowa Inteigent Systems Laboratory 4

SVM: erminoogy 3(6) SVM: erminoogy 4(6) Error or Risk Underfitting Overfitting Decision functions and/or hyperpanes and/or hypersurfaces Discriminant functions and/or hyperpanes and/or hypersurfaces Approximation or training error e app ~Bias h ~ n, capacity Estimation error e est ~ Variance f f o n Generaization or true error e gen ~ Confidence ~ Bound on test error f n, ~ Guaranteed, or true risk H H H n- H n Hypothesis space of increasing compexity arget space Decision boundaries (hyperpanes, hypersurfaces) Separation ines, functions and/or hyperpanes and/or hypersurfaces he University of Iowa Inteigent Systems Laboratory 5 he University of Iowa Inteigent Systems Laboratory 6 SVM: erminoogy 5(6) SVM: erminoogy 5(6) Downoadabe software iustrates some VSM reationships Input space and feature space used. More recenty SVM deveopers introduced feature space anaogous to the NN hidden ayer or imaginary i z-space Desired vaue y + Indicator function if(x, w, b) = sign(d) Input x he decision boundary or separating ine is an intersection of d(x, w, b) and an input pane (x, x); d = w x +b = Input pane (x, x) Indicator function that is basicay a threshod function. - d(x, w, b) Stars denote support vectors he optima separating hyperpane d(x,w, b) is an argument of indicator function Input x he University of Iowa Inteigent Systems Laboratory 7 he University of Iowa Inteigent Systems Laboratory 8

More simiarities between NNs and P E= ( di f ( x, )) i w P Coseness to data E= ( di f( xi, w)) + λ Pf P SVMs (3) Cosenessto data Smoothness E = Lεi + λ Pf = Lεi + Ω( h, ) ` P `Cossenes to data Capacity of machine Cassic mutiayer perceptron Reguarization (RBF) NN Support Vector Machine In the ast expression, h is a contro parameter for minimizing the generaization error E (i.e., risk R). More simiarities between NNs and SVMs (3) here are two basic, constructive approaches to the minimization of the previous equations (Vapnik, 995 and 998): Seect an appropriate structure (order of poynomias, number of HL neurons, number of rues in the Fuzzy Logic mode) and keeping the confidence interva fixed. his way the training error (i.e., empirica risk) is minimized, or Keep the vaue of the training error fixed (equa to zero or at some acceptabe eve) and minimize the confidence interva. he University of Iowa Inteigent Systems Laboratory 9 he University of Iowa Inteigent Systems Laboratory More simiarities between NNs and SVMs 3(3) Cassica NNs impement the first approach (or some of its more sophisticated variants) and SVMs impement the second strategy. In both cases the resuting mode shoud resove the trade-off between under-fitting and over-fitting the training data. he fina mode structure (order) shoud ideay match the earning machine capacity with training data compexity. Anaysis of SVM Learning ) Linear Maxima Margin Cassifier for Lineary Separabe Data; No overapping of sampes. ) Linear Soft Margin Cassifier for Overapping Casses. 3) Noninear Cassifier. 4) Regression by SV Machine that can be either inear or noninear. he University of Iowa Inteigent Systems Laboratory he University of Iowa Inteigent Systems Laboratory

) Linear Maxima Margin Cassifier ) Linear Maxima Margin Cassifier Given training data (x, y ),..., (x, y ), y i {-, +} Find a function f(x, w ) f(x, w) that best approximates the unknown discriminant (separation) function y = f(x) Lineary separabe data can be separated by in infinite number of inear hyperpanes f(x, w) = w x + b Find the optima separating hyperpane he University of Iowa Inteigent Systems Laboratory 3 he University of Iowa Inteigent Systems Laboratory 4 ) Linear Maxima Margin Cassifier MARGIN IS DEFINED by w as foows: M = w (Vapnik, Chervonenkis 974) M ) Linear Maxima Margin Cassifier he reationship between the weight vector w and the margin M is obtained from the simpe geometric anaysis Optima separating hyperpane with the argest margin intersects haf-way between the two casses. (w x)+b = (w x)+b =- (w x)+b =+ Cass, y = + x β D w α D x M w Cass, y = - Margin M x he University of Iowa Inteigent Systems Laboratory 5 he University of Iowa Inteigent Systems Laboratory 6

) Linear Maxima Margin Cassifier he optima canonica separating hyperpane (OCSH), i.e., a separating hyperpane with the argest margin (defined by M = / w ), specifies support vectors, i.e., training data points cosest to it, which satisfy y j [w x j + b], j =, N SV. At the same time, the OCSH must separate data correcty, i.e., it shoud satisfy the constraint y i [w x i + b], i =, where denotes a number of training data and N SV denotes the number of support vectors. he University of Iowa Inteigent Systems Laboratory 7 ) Linear Maxima Margin Cassifier Note that maximization of M means minimization of w Consequenty, minimization of the norm w equas minimization of w w = w +w + + w n his eads to a maximization of a margin M w w = w + w +... + w n he University of Iowa Inteigent Systems Laboratory 8 ) Linear Maxima Margin Cassifier Minimize J = w w = w subject to constraints y i [w x i + b] Margin maximization! Correct cassification! his is a cassic QP probem with constraints that eads to forming and soving of a prima and/or dua Lagrangian. ) Linear Maxima Margin Cassifier he QP probem J = w w = w, subject to constraints y i [w x i + b] can be soved the Lagrangian reaxation approach. In forming the Lagrangian for constraints of the form gi >, the inequaity constraints equations are mutipied by nonnegative Lagrange mutipiers αi (i.e., αi > ) and subtracted from the objective function. he University of Iowa Inteigent Systems Laboratory 9 he University of Iowa Inteigent Systems Laboratory

) Linear Maxima Margin Cassifier hus, the Lagrangian L(w, b, α) is ) Linear Maxima Margin Cassifier L(w, b, α) = ww αi{ yi[ wxi + b] } Soving the dua probem where the α i are Lagrange mutipiers. he Lagrangian L is minimized with respect to w and b and maximized with respect to nonnegative α i his probem can be soved either in a prima space (which is the space of parameters w and b)orinadua space (which is the space of Lagrange mutipiers α i ). he University of Iowa Inteigent Systems Laboratory he Karush-Kuhn-ucker (KK) optimaity conditions are used. he University of Iowa Inteigent Systems Laboratory ) Linear Maxima Margin Cassifier L(w, b, α) = ww αi{ yi[ wxi + b] } he Karush-Kuhn-ucker (KK) conditions At the optimum (sadde) point (w o, b o, α o ), the derivatives of Lagrangian L with respect to the prima variabes are zero, i.e. L =, i.e., wo = αiyix (a) i w o he University of Iowa Inteigent Systems Laboratory 3 L =, i.e., αiyi = b o (b) ) Linear Maxima Margin Cassifier In addition, the compementarity condition must be satisfied. α i {y i [w x i + b] - } =, i =,. L(w, b, α) = ww αi{ yi[ wxi + b] } Substituting (a) and (b) for the prima variabes, Lagrangian L(w, b, α) becomes Lagrangian L d (α) in dua variabes L d (α) = αi yy i jαα i jxx i j i, j= he University of Iowa Inteigent Systems Laboratory 4

) Linear Maxima Margin Cassifier Such a standard quadratic optimization probem can be expressed using a matrix notation: Maximize L d (α) =-.5α H α +f α subject to y α = α where (α) i = α i, H denotes the Hessian matrix (H ij =y i y j (x i x j )=y i y j x ix j ) of this probem and f is a unit vector f = =[,...,]. he University of Iowa Inteigent Systems Laboratory 5 ) Linear Maxima Margin Cassifier Standard optimization programs are often designed for soving minimization probems. herefore we change the sign of the objective function Minimize i i L d (α) =.5α Hα -f α subject to the same constraints y α = α he University of Iowa Inteigent Systems Laboratory 6 ) Linear Maxima Margin Cassifier he soution α oi of the above dua optimization probem determines the parameters of the optima hyperpane w o (according to (a)) and b o (according to the compementarity conditions) as foow w = α yx, i =, o oi i i b ( ( ), s =, N. NSV o = xw s= s o NSV ys N SV = the number of support vectors SV ) Linear Maxima Margin Cassifier Note that an optima weight vector N SV denotes the number of support vectors. w o and bias term b are cacuated by using support vectors ony (despite the fact that the summation for w is over a training data patterns). his is because Lagrange mutipiers for a non-support vectors equa zero (α oi =, i = N SV +, ). Finay, having cacuated w o and b o we obtain an indicator function i F = o = sign(d(x)) and a decision hyperpane d(x) xx d(x) = w oixi + bo= y i i iαi i + b = = o he University of Iowa Inteigent Systems Laboratory 7 he University of Iowa Inteigent Systems Laboratory 8

) Linear Maxima Margin Cassifier he previous approach wi not work for NO ineary separabe casses, i.e., in the case when there is data overapping as shown beow ) Linear Maxima Margin Cassifier here is no singe hyperpane that can perfecty separate a data. However, the separation can now be done in two ways: Aowing for miscassification of data Finding a NONLINEAR separation boundary he University of Iowa Inteigent Systems Laboratory 9 he University of Iowa Inteigent Systems Laboratory 3 ) Linear Soft Margin Cassifier for Overapping Casses k Now one minimizes: J( w, ξ ) = w w+ C( ξi ) s.t. w x i + b + - ξ i, for y i = +, w x i + b - + ξ i, fory i =-. ) Linear Soft Margin Cassifier for Overapping Casses 5 Noninear SV cassification he probem is no onger convex and the soution is given by the sadde point of the prima Lagrangian L p (w, b, ξ, α, β) whereα i and β i are the Lagrange mutipiers. Again, we shoud find an optima sadde point (w o, b o, ξ o, α o, β o ) because the Lagrangian L p has to be minimized with respect to w, b and ξ, and maximized with respect to nonnegative α i and β i. 4 3 Cass y = + he soution is a hyperpane. However, no perfect separation he University of Iowa Inteigent Systems Laboratory 3 A perfect hyperpane can not be found for noninear the decision boundaries. he University of Iowa 3 Feature x 4 5 Inteigent Systems Laboratory 3

SVM Design he SVM is constructed by: Exampe Mapping in a feature space for a cassica XOR (nonineary separabe) probem. Many different noninear discriminant functions that separate s from s can be drawn in a feature pane. f(x) = x + - x -/3, = x, f(x) = x + - -/3 i) Mapping input vectors nonineary into a high dimensiona feature space, and.5 f ii) Constructing the OCSH in the high dimensiona feature space. he University of Iowa Inteigent Systems Laboratory 33.5 f > f x -.5 -.5.5.5 he University of Iowa Inteigent Systems Laboratory 34 Exampe LAYERS INPU HIDDEN OUPU Second order poynomia hypersurface d(x) in input space Mapping z Hyperpane in a feature = Φ(x) space F: d(z) = w z + b x φ (x) φ (x) x x -.5 -/3 - o f = + constant input, bias he pane in the ast side is produced by this NN. he University of Iowa Inteigent Systems Laboratory 35 SVM maps input vectors x = [x x n ] into feature vectors z = Φ(x). x x he University of Iowa Inteigent Systems Laboratory 36 x (x ) () () x x φ 3(x) φ 4(x) φ 5(x) φ 6(x) φ 7(x) φ 8(x) φ 9(x) w w 9 b + d(x) i F =sign(d(x))

he kerne trick Map input vectors x R n into vectors z of a higher dimensiona feature space F(z) = Φ(x) where Φ represents mapping: R n R f and to sove a inear cassification probem in this feature space x R n z(x) = [a φ (x), a φ (x),..., a f φ f (x)] R f he kerne trick he soution for an indicator function i F (x) = sign(w z(x) + b), which is a inear cassifier in a feature space F creates a noninear separating hypersurface in the origina input space given by i F (x)=sign( ( α i y i z ( x ) z ( x i ) + b ) K(x i, x j ) = z i z j = Φ(x i )Φ(x j ) Note that a kerne function K(x i, x j ) is a function in the input space. he University of Iowa Inteigent Systems Laboratory 37 he University of Iowa Inteigent Systems Laboratory 38 Kerne Functions Kerne Functions Kerne Function K(x, x i ) = [(x x i ) + ] d K( xx, ) [( x xi ) ( x xi )] = i e Σ K(x, x i ) = tanh[(x x i ) + b]* *ony for certain vaues of b Cassifier ype Poynomia of degree d Gaussian RBF Mutiayer perceptron he earning procedure is the same as the construction of hard and soft margin cassifiers in x-space. In z-space, the dua Lagrangian that shoud be maximized is L d (α) = αi yyα α zz i j i j i j i, j= L d (α) = αi yy i jαiα jk( xi, x j) i, j= or he University of Iowa Inteigent Systems Laboratory 39 he University of Iowa Inteigent Systems Laboratory 4

Kerne Functions he constraints are α i, i =, In a more genera case, because of noise or generic cass features, training data points overap. Nothing butconstraints change as for the soft margin cassifier above. hus, the noninear soft margin cassifier wi be the soution of the quadratic optimization probem given above subject to constraints C α i i =, and α i y i = Kerne Functions he decision hypersurface is given by d( x) = yiα ik( x, xi) + b Note Noethat the fina structure ucueof the esvm is sequvae equivaent to the NN mode. In essence it is a weighted inear combination of some kerne (basis) functions. he University of Iowa Inteigent Systems Laboratory 4 he University of Iowa Inteigent Systems Laboratory 4 Reference V. Kecman, Learning and Soft Computing, MI Press, Cambridge, MA,. he University of Iowa Inteigent Systems Laboratory 43