SVM: Terminology 1(6) SVM: Terminology 2(6)

Size: px

Start display at page:

Download "SVM: Terminology 1(6) SVM: Terminology 2(6)"

Stella Armstrong
5 years ago
Views:

1 Andrew Kusiak Inteigent Systems Laboratory 39 Seamans Center he University of Iowa Iowa City, IA SVM he maxima margin cassifier is simiar to the perceptron: It aso assumes that the data points are ineary separabe It aims at finding the separating hyperpane with the maxima geometric margin (not just anyone - typica of a perceptron) Sma margin Cass, y = + Cass, y = + andrew-kusiak@uiowa.edu Cass, y = - Cass, y = - Separating ines, i.e., decision boundaries, i.e., hyperpanes Large margin (Based on the materia provided by Professor V. Kecman) he University of Iowa Inteigent Systems Laboratory x x he arger the margin, the smaer the probabiity of miscassification. he University of Iowa Inteigent Systems Laboratory SVM: erminoogy (6) Before introducing forma (constructive) part of the statistica earning theory the terminoogy is defined. Vapnik and Chervonenkis introduced a nested set of hypothesis (a.k.a., approximating or decision) functions) SVM: erminoogy (6) Approximation or training error ~ Empirica risk ~ Bias Estimation error ~ Variance ~ Confidence of the training error ~ VC confidence interva Generaization (true, expected) error ~ Bound on test error ~ Guaranteed or true risk H H H n- H n he University of Iowa Inteigent Systems Laboratory 3 he University of Iowa Inteigent Systems Laboratory 4

2 SVM: erminoogy 3(6) SVM: erminoogy 4(6) Error or Risk Underfitting Overfitting Decision functions and/or hyperpanes and/or hypersurfaces Discriminant functions and/or hyperpanes and/or hypersurfaces Approximation or training error e app ~Bias h ~ n, capacity Estimation error e est ~ Variance f f o n Generaization or true error e gen ~ Confidence ~ Bound on test error f n, ~ Guaranteed, or true risk H H H n- H n Hypothesis space of increasing compexity arget space Decision boundaries (hyperpanes, hypersurfaces) Separation ines, functions and/or hyperpanes and/or hypersurfaces he University of Iowa Inteigent Systems Laboratory 5 he University of Iowa Inteigent Systems Laboratory 6 SVM: erminoogy 5(6) SVM: erminoogy 5(6) Downoadabe software iustrates some VSM reationships Input space and feature space used. More recenty SVM deveopers introduced feature space anaogous to the NN hidden ayer or imaginary i z-space Desired vaue y + Indicator function if(x, w, b) = sign(d) Input x he decision boundary or separating ine is an intersection of d(x, w, b) and an input pane (x, x); d = w x +b = Input pane (x, x) Indicator function that is basicay a threshod function. - d(x, w, b) Stars denote support vectors he optima separating hyperpane d(x,w, b) is an argument of indicator function Input x he University of Iowa Inteigent Systems Laboratory 7 he University of Iowa Inteigent Systems Laboratory 8

3 More simiarities between NNs and P E= ( di f ( x, )) i w P Coseness to data E= ( di f( xi, w)) + λ Pf P SVMs (3) Cosenessto data Smoothness E = Lεi + λ Pf = Lεi + Ω( h, ) ` P `Cossenes to data Capacity of machine Cassic mutiayer perceptron Reguarization (RBF) NN Support Vector Machine In the ast expression, h is a contro parameter for minimizing the generaization error E (i.e., risk R). More simiarities between NNs and SVMs (3) here are two basic, constructive approaches to the minimization of the previous equations (Vapnik, 995 and 998): Seect an appropriate structure (order of poynomias, number of HL neurons, number of rues in the Fuzzy Logic mode) and keeping the confidence interva fixed. his way the training error (i.e., empirica risk) is minimized, or Keep the vaue of the training error fixed (equa to zero or at some acceptabe eve) and minimize the confidence interva. he University of Iowa Inteigent Systems Laboratory 9 he University of Iowa Inteigent Systems Laboratory More simiarities between NNs and SVMs 3(3) Cassica NNs impement the first approach (or some of its more sophisticated variants) and SVMs impement the second strategy. In both cases the resuting mode shoud resove the trade-off between under-fitting and over-fitting the training data. he fina mode structure (order) shoud ideay match the earning machine capacity with training data compexity. Anaysis of SVM Learning ) Linear Maxima Margin Cassifier for Lineary Separabe Data; No overapping of sampes. ) Linear Soft Margin Cassifier for Overapping Casses. 3) Noninear Cassifier. 4) Regression by SV Machine that can be either inear or noninear. he University of Iowa Inteigent Systems Laboratory he University of Iowa Inteigent Systems Laboratory

4 ) Linear Maxima Margin Cassifier ) Linear Maxima Margin Cassifier Given training data (x, y ),..., (x, y ), y i {-, +} Find a function f(x, w ) f(x, w) that best approximates the unknown discriminant (separation) function y = f(x) Lineary separabe data can be separated by in infinite number of inear hyperpanes f(x, w) = w x + b Find the optima separating hyperpane he University of Iowa Inteigent Systems Laboratory 3 he University of Iowa Inteigent Systems Laboratory 4 ) Linear Maxima Margin Cassifier MARGIN IS DEFINED by w as foows: M = w (Vapnik, Chervonenkis 974) M ) Linear Maxima Margin Cassifier he reationship between the weight vector w and the margin M is obtained from the simpe geometric anaysis Optima separating hyperpane with the argest margin intersects haf-way between the two casses. (w x)+b = (w x)+b =- (w x)+b =+ Cass, y = + x β D w α D x M w Cass, y = - Margin M x he University of Iowa Inteigent Systems Laboratory 5 he University of Iowa Inteigent Systems Laboratory 6

5 ) Linear Maxima Margin Cassifier he optima canonica separating hyperpane (OCSH), i.e., a separating hyperpane with the argest margin (defined by M = / w ), specifies support vectors, i.e., training data points cosest to it, which satisfy y j [w x j + b], j =, N SV. At the same time, the OCSH must separate data correcty, i.e., it shoud satisfy the constraint y i [w x i + b], i =, where denotes a number of training data and N SV denotes the number of support vectors. he University of Iowa Inteigent Systems Laboratory 7 ) Linear Maxima Margin Cassifier Note that maximization of M means minimization of w Consequenty, minimization of the norm w equas minimization of w w = w +w + + w n his eads to a maximization of a margin M w w = w + w w n he University of Iowa Inteigent Systems Laboratory 8 ) Linear Maxima Margin Cassifier Minimize J = w w = w subject to constraints y i [w x i + b] Margin maximization! Correct cassification! his is a cassic QP probem with constraints that eads to forming and soving of a prima and/or dua Lagrangian. ) Linear Maxima Margin Cassifier he QP probem J = w w = w, subject to constraints y i [w x i + b] can be soved the Lagrangian reaxation approach. In forming the Lagrangian for constraints of the form gi >, the inequaity constraints equations are mutipied by nonnegative Lagrange mutipiers αi (i.e., αi > ) and subtracted from the objective function. he University of Iowa Inteigent Systems Laboratory 9 he University of Iowa Inteigent Systems Laboratory

6 ) Linear Maxima Margin Cassifier hus, the Lagrangian L(w, b, α) is ) Linear Maxima Margin Cassifier L(w, b, α) = ww αi{ yi[ wxi + b] } Soving the dua probem where the α i are Lagrange mutipiers. he Lagrangian L is minimized with respect to w and b and maximized with respect to nonnegative α i his probem can be soved either in a prima space (which is the space of parameters w and b)orinadua space (which is the space of Lagrange mutipiers α i ). he University of Iowa Inteigent Systems Laboratory he Karush-Kuhn-ucker (KK) optimaity conditions are used. he University of Iowa Inteigent Systems Laboratory ) Linear Maxima Margin Cassifier L(w, b, α) = ww αi{ yi[ wxi + b] } he Karush-Kuhn-ucker (KK) conditions At the optimum (sadde) point (w o, b o, α o ), the derivatives of Lagrangian L with respect to the prima variabes are zero, i.e. L =, i.e., wo = αiyix (a) i w o he University of Iowa Inteigent Systems Laboratory 3 L =, i.e., αiyi = b o (b) ) Linear Maxima Margin Cassifier In addition, the compementarity condition must be satisfied. α i {y i [w x i + b] - } =, i =,. L(w, b, α) = ww αi{ yi[ wxi + b] } Substituting (a) and (b) for the prima variabes, Lagrangian L(w, b, α) becomes Lagrangian L d (α) in dua variabes L d (α) = αi yy i jαα i jxx i j i, j= he University of Iowa Inteigent Systems Laboratory 4

7 ) Linear Maxima Margin Cassifier Such a standard quadratic optimization probem can be expressed using a matrix notation: Maximize L d (α) =-.5α H α +f α subject to y α = α where (α) i = α i, H denotes the Hessian matrix (H ij =y i y j (x i x j )=y i y j x ix j ) of this probem and f is a unit vector f = =[,...,]. he University of Iowa Inteigent Systems Laboratory 5 ) Linear Maxima Margin Cassifier Standard optimization programs are often designed for soving minimization probems. herefore we change the sign of the objective function Minimize i i L d (α) =.5α Hα -f α subject to the same constraints y α = α he University of Iowa Inteigent Systems Laboratory 6 ) Linear Maxima Margin Cassifier he soution α oi of the above dua optimization probem determines the parameters of the optima hyperpane w o (according to (a)) and b o (according to the compementarity conditions) as foow w = α yx, i =, o oi i i b ( ( ), s =, N. NSV o = xw s= s o NSV ys N SV = the number of support vectors SV ) Linear Maxima Margin Cassifier Note that an optima weight vector N SV denotes the number of support vectors. w o and bias term b are cacuated by using support vectors ony (despite the fact that the summation for w is over a training data patterns). his is because Lagrange mutipiers for a non-support vectors equa zero (α oi =, i = N SV +, ). Finay, having cacuated w o and b o we obtain an indicator function i F = o = sign(d(x)) and a decision hyperpane d(x) xx d(x) = w oixi + bo= y i i iαi i + b = = o he University of Iowa Inteigent Systems Laboratory 7 he University of Iowa Inteigent Systems Laboratory 8

8 ) Linear Maxima Margin Cassifier he previous approach wi not work for NO ineary separabe casses, i.e., in the case when there is data overapping as shown beow ) Linear Maxima Margin Cassifier here is no singe hyperpane that can perfecty separate a data. However, the separation can now be done in two ways: Aowing for miscassification of data Finding a NONLINEAR separation boundary he University of Iowa Inteigent Systems Laboratory 9 he University of Iowa Inteigent Systems Laboratory 3 ) Linear Soft Margin Cassifier for Overapping Casses k Now one minimizes: J( w, ξ ) = w w+ C( ξi ) s.t. w x i + b + - ξ i, for y i = +, w x i + b - + ξ i, fory i =-. ) Linear Soft Margin Cassifier for Overapping Casses 5 Noninear SV cassification he probem is no onger convex and the soution is given by the sadde point of the prima Lagrangian L p (w, b, ξ, α, β) whereα i and β i are the Lagrange mutipiers. Again, we shoud find an optima sadde point (w o, b o, ξ o, α o, β o ) because the Lagrangian L p has to be minimized with respect to w, b and ξ, and maximized with respect to nonnegative α i and β i. 4 3 Cass y = + he soution is a hyperpane. However, no perfect separation he University of Iowa Inteigent Systems Laboratory 3 A perfect hyperpane can not be found for noninear the decision boundaries. he University of Iowa 3 Feature x 4 5 Inteigent Systems Laboratory 3

9 SVM Design he SVM is constructed by: Exampe Mapping in a feature space for a cassica XOR (nonineary separabe) probem. Many different noninear discriminant functions that separate s from s can be drawn in a feature pane. f(x) = x + - x -/3, = x, f(x) = x + - -/3 i) Mapping input vectors nonineary into a high dimensiona feature space, and.5 f ii) Constructing the OCSH in the high dimensiona feature space. he University of Iowa Inteigent Systems Laboratory 33.5 f > f x he University of Iowa Inteigent Systems Laboratory 34 Exampe LAYERS INPU HIDDEN OUPU Second order poynomia hypersurface d(x) in input space Mapping z Hyperpane in a feature = Φ(x) space F: d(z) = w z + b x φ (x) φ (x) x x -.5 -/3 - o f = + constant input, bias he pane in the ast side is produced by this NN. he University of Iowa Inteigent Systems Laboratory 35 SVM maps input vectors x = [x x n ] into feature vectors z = Φ(x). x x he University of Iowa Inteigent Systems Laboratory 36 x (x ) () () x x φ 3(x) φ 4(x) φ 5(x) φ 6(x) φ 7(x) φ 8(x) φ 9(x) w w 9 b + d(x) i F =sign(d(x))

10 he kerne trick Map input vectors x R n into vectors z of a higher dimensiona feature space F(z) = Φ(x) where Φ represents mapping: R n R f and to sove a inear cassification probem in this feature space x R n z(x) = [a φ (x), a φ (x),..., a f φ f (x)] R f he kerne trick he soution for an indicator function i F (x) = sign(w z(x) + b), which is a inear cassifier in a feature space F creates a noninear separating hypersurface in the origina input space given by i F (x)=sign( ( α i y i z ( x ) z ( x i ) + b ) K(x i, x j ) = z i z j = Φ(x i )Φ(x j ) Note that a kerne function K(x i, x j ) is a function in the input space. he University of Iowa Inteigent Systems Laboratory 37 he University of Iowa Inteigent Systems Laboratory 38 Kerne Functions Kerne Functions Kerne Function K(x, x i ) = [(x x i ) + ] d K( xx, ) [( x xi ) ( x xi )] = i e Σ K(x, x i ) = tanh[(x x i ) + b]* *ony for certain vaues of b Cassifier ype Poynomia of degree d Gaussian RBF Mutiayer perceptron he earning procedure is the same as the construction of hard and soft margin cassifiers in x-space. In z-space, the dua Lagrangian that shoud be maximized is L d (α) = αi yyα α zz i j i j i j i, j= L d (α) = αi yy i jαiα jk( xi, x j) i, j= or he University of Iowa Inteigent Systems Laboratory 39 he University of Iowa Inteigent Systems Laboratory 4

11 Kerne Functions he constraints are α i, i =, In a more genera case, because of noise or generic cass features, training data points overap. Nothing butconstraints change as for the soft margin cassifier above. hus, the noninear soft margin cassifier wi be the soution of the quadratic optimization probem given above subject to constraints C α i i =, and α i y i = Kerne Functions he decision hypersurface is given by d( x) = yiα ik( x, xi) + b Note Noethat the fina structure ucueof the esvm is sequvae equivaent to the NN mode. In essence it is a weighted inear combination of some kerne (basis) functions. he University of Iowa Inteigent Systems Laboratory 4 he University of Iowa Inteigent Systems Laboratory 4 Reference V. Kecman, Learning and Soft Computing, MI Press, Cambridge, MA,. he University of Iowa Inteigent Systems Laboratory 43

Statistical Learning Theory: A Primer

Statistical Learning Theory: A Primer Internationa Journa of Computer Vision 38(), 9 3, 2000 c 2000 uwer Academic Pubishers. Manufactured in The Netherands. Statistica Learning Theory: A Primer THEODOROS EVGENIOU, MASSIMILIANO PONTIL AND TOMASO