Support Vector Machines. Maximizing the Margin

Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the Lagrange ultipliers, α i 0, a nonnegative weight for each training exaple. k is a kernel function, e.g., we ight choose the dot product, k(x, x ) = x x = Σ x j x j if α i > 0, then x i is a support vector. Maxiizing the Margin The argin is the region h(x). The goal of learning is to axiize the width of the argin with positive exs. and negative exs.. A SVM is defined by its support vectors, the positive exs. and negative exs.. The dot product, linear, kernel function k(x, x ) = x x = Σ j x j x j leads to a classifier siilar to perceptrons with a argin. Other kernel fns. lead to nonlinear decision boundaries.

Exaple SVM for Separable Exaples.5 w.x + b = - w.x + b = 0 w.x + b =.5 0 - - 0 4 5 6 Exaple SVM for Nonseparable Exaples.5 w.x + b = - w.x + b = 0 w.x + b =.5.5 4 4.5 5 5.5 6 6.5 7

0 - Exaple Gaussian Kernel SVM 4 5 6 7.5.5 0 Exaple Gaussian Kernel, Zooed In 0-4. 4.4 4.6 4.8 5 5. 5.4 5.6 5.8.8.6.4.

Hyperplane Classification Consider the class of hyperplanes, i.e., dot product plus a bias: (w x) + b = 0 For linearly separable exaples, there a unique optial hyperplane defined by axiizing the argin. This can be expressed as: ax w,b in{ x x i (w x) + b = 0} Choose w and b to axiize the iniu distance fro an exaple to the hyperplane. Hyperplane Exaple Figure fro Scholkopf and Sola, Learning with Kernels, MIT Press, 00.

The optial hyperplane can be found by solving: iniize w / s.t. y i ((w x i ) + b), i {,..., } That is, we require the sallest weights such that positive exaples and negative exaples. The sallest weights correspond to the axiu argin. Note that: w = w w so the width of the argin is equal to: w w = w Convert to a Lagrangian: Math Tricks w / Σ i= α i (y i ((w x i ) + b) ) The derivatives are zero when: w = Σ α i y i x i and 0 = Σ α i y i i= i= Substituting for w in the Lagrangian leads to the objective function: Σ α i i= Σ Σ α iα j y i y j (x i x j ) i= j= which we want to axiize subject to α i 0 and Σ α iy i = 0 i=

Substituting for w in the hypothesis: h(x) = (w x) + b leads to: h(x) = b + Σ i= y i α i (x x i ) In this case (dot product kernel, linearly separable exaples), α i > 0 iplies y i ((w x i ) + b) =, i.e., support vectors will be on the argin boundary. Kernel Tricks The kernel trick is to substitute other functions k(x, x ) in place of (x x ). The ost popular are Polynoial: (x x ) d, where d {,,...} (a(x x ) + c) d, where d {,,...}, a, c > 0 Gaussian: exp{ x x /(σ )}, where σ > 0. The trick of a kernel function is efficiently coputing the dot product of a large nuber of basis functions. k(x, x ) = (Φ(x) Φ(x )) where Φ(x) = (Φ (x), Φ (x),...).

Exaple Kernel Function Let k(x, x ) = ((x x ) + ). Then in two diensions, ((u v) + ) = ((u, u ) (v, v )) + ) = (u v + u v + ) = u v + u v + + u u v v + u v + u v = (u, u,, u u, u, u ) (v, v,, v v, v, v ) The kernel function obtains the sae result as the dot product of the basis functions without explicitly coputing all the basis functions. The Gaussian kernel corresponds to an infinite nuber of basis functions! Nonlinear Classification Using a kernel function, the hypothesis becoes: h(x) = b + Σ i= y i α i k(x, x i ) The Lagrangian conversion results in the proble: Find the α i weights that solve: axiize Σ i= α i subject to α i 0 and Σ i= Σ j= α iα j y i y j k(x i, x j ) Σ i= α iy i = 0 The support vectors are {x i α i > 0}. They satisfy y i h(x i ) =.

Figure fro Scholkopf and Sola, Learning with Kernels, MIT Press, 00. Soft Margin Classification It ight not be possible or desirable to satisfy: y i ((w x i ) + b) To allow violations, an error ter can be added: y i ((w x i ) + b) ξ i where ξ i 0 and the proble is to: iniize w / + C Σ i= ξ i where C is chosen by the user. Note that Σ i= ξ i the nuber of training istakes. In the conversion, replace α i 0 with 0 α i C.

Figure fro Burges, A tutorial on support vector achines for pattern recognition, Data Mining and Knowledge Discovery :-67, 998. ν-paraeterization Another type of soft argin classifier satisfies: y i ((w x i ) + b) ρ ξ i where ρ > 0 is a free paraeter, and solves: iniize w / ρ + ν Σ i= ξ i where ν is chosen by the user. In the conversion, α i 0 is replaced with: 0 α i ν and Σ α i = i= ν (0, ) eans at least ν support vectors.

C =, Gaussian Kernel, σ =.5.5.5 4 4.5 5 5.5 6 6.5 7 C = 0, Gaussian Kernel, σ =.5.5.5 4 4.5 5 5.5 6 6.5 7

C = 00, Gaussian Kernel, σ =.5.5.5 4 4.5 5 5.5 6 6.5 7 ν = 0., Gaussian Kernel, σ =.5.5.5 4 4.5 5 5.5 6 6.5 7

ν = 0., Gaussian Kernel, σ =.5.5.5 4 4.5 5 5.5 6 6.5 7 ν = 0., Gaussian Kernel, σ =.5.5.5 4 4.5 5 5.5 6 6.5 7

Support Vector Regression SV classification uses y {, }, while regression tries to predict y R. SV regression uses ɛ-insensitive loss: y z ɛ = or equivalently: 0 if y z ɛ y z ɛ otherwise y z ɛ = ax(0, y z ɛ) To obtain a hypothesis h(x) = b + w x iniize w / + C Σ i= y h(x i ) ɛ One can think of SV regression as fitting a tube of radius ɛ to the data. Figure fro Scholkopf and Sola, Learning with Kernels, MIT Press, 00.

SV Regression Proble Applying the usual ath tricks results in: h(x) = b + Σ i= α i k(x, x i ) where the α i weights are found by solving: axiize ɛ Σ α i i= + Σ α i y i i= Σ i= Σ j= α iα j k(x i, x j ) subject to C α i C and Σ i= α i = 0