A NEW ALGORITHM FOR FINDING THE MINIMUM DISTANCE BETWEEN TWO CONVEX HULLS. Dougsoo Kaown, B.Sc., M.Sc. Dissertation Prepared for the Degree of

Similar documents
Kernel Methods and SVMs Extension

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

MMA and GCMMA two methods for nonlinear optimization

Support Vector Machines CS434

Which Separator? Spring 1

Lecture 10 Support Vector Machines II

Support Vector Machines CS434

Linear Classification, SVMs and Nearest Neighbors

Lecture 3: Dual problems and Kernels

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Support Vector Machines

APPENDIX A Some Linear Algebra

Some modelling aspects for the Matlab implementation of MMA

The Minimum Universal Cost Flow in an Infeasible Flow Network

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

On the Multicriteria Integer Network Flow Problem

Support Vector Machines

Problem Set 9 Solutions

More metrics on cartesian products

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Support Vector Machines

Assortment Optimization under MNL

Chapter 6 Support vector machine. Séparateurs à vaste marge

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Lecture 10 Support Vector Machines. Oct

COS 521: Advanced Algorithms Game Theory and Linear Programming

Module 9. Lecture 6. Duality in Assignment Problems

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Affine transformations and convexity

Lagrange Multipliers Kernel Trick

Linear, affine, and convex sets and hulls In the sequel, unless otherwise specified, X will denote a real vector space.

Natural Language Processing and Information Retrieval

Maximal Margin Classifier

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

Errors for Linear Systems

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Calculation of time complexity (3%)

Lecture Notes on Linear Regression

Kristin P. Bennett. Rensselaer Polytechnic Institute

ON A DETERMINATION OF THE INITIAL FUNCTIONS FROM THE OBSERVED VALUES OF THE BOUNDARY FUNCTIONS FOR THE SECOND-ORDER HYPERBOLIC EQUATION

CSE 252C: Computer Vision III

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

EEE 241: Linear Systems

Section 8.3 Polar Form of Complex Numbers

Complete subgraphs in multipartite graphs

10-701/ Machine Learning, Fall 2005 Homework 3

Pattern Classification

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

Difference Equations

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Yong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )

Lecture 20: November 7

Lecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem.

Nonlinear Classifiers II

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

Generalized Linear Methods

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 16

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Feature Selection: Part 1

1 Convex Optimization

The Order Relation and Trace Inequalities for. Hermitian Operators

On the correction of the h-index for career length

Support Vector Machines

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

The Study of Teaching-learning-based Optimization Algorithm

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

Formulas for the Determinant

A New Refinement of Jacobi Method for Solution of Linear System Equations AX=b

Singular Value Decomposition: Theory and Applications

Report on Image warping

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

NUMERICAL DIFFERENTIATION

Linear Approximation with Regularization and Moving Least Squares

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Support Vector Machines

A 2D Bounded Linear Program (H,c) 2D Linear Programming

On a direct solver for linear least squares problems

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Inexact Newton Methods for Inverse Eigenvalue Problems

The Second Anti-Mathima on Game Theory

Lecture 21: Numerical methods for pricing American type derivatives

A Local Variational Problem of Second Order for a Class of Optimal Control Problems with Nonsmooth Objective Function

Online Classification: Perceptron and Winnow

6.854J / J Advanced Algorithms Fall 2008

Salmon: Lectures on partial differential equations. Consider the general linear, second-order PDE in the form. ,x 2

A new Approach for Solving Linear Ordinary Differential Equations

Min Cut, Fast Cut, Polynomial Identities

Advanced Introduction to Machine Learning

Lecture 5 Decoding Binary BCH Codes

Interactive Bi-Level Multi-Objective Integer. Non-linear Programming Problem

Some Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS)

Transcription:

A NEW ALGORITHM FOR FINDING THE MINIMUM DISTANCE BETWEEN TWO CONVEX HULLS Dougsoo Kaown, B.Sc., M.Sc. Dssertaton Prepared for the Degree of DOCTOR OF PHILOSOPHY UNIVERSITY OF NORTH TEXAS May 2009 APPROVED: Janguo Lu, Major Professor John Neuberger, Commttee Member Xaohu Yuan, Commttee Member Matthew Douglass, Char of the Department of Mathematcs Mchael Montcno, Interm Dean of the Robert B. Toulouse School of Graduate Studes

Kaown, Dougsoo. A New Algorthm for Fndng the Mnmum Dstance between Two Convex Hulls. Doctor of Phlosophy (Mathematcs), May 2009, 66 pp., 24 llustratons, bblography, 27 ttles. The problem of computng the mnmum dstance between two convex hulls has applcatons to many areas ncludng robotcs, computer graphcs and path plannng. Moreover, determnng the mnmum dstance between two convex hulls plays a sgnfcant role n support vector machnes (SVM). In ths study, a new algorthm for fndng the mnmum dstance between two convex hulls s proposed and nvestgated. A convergence of the algorthm s proved and applcablty of the algorthm to support vector machnes s demostrated. The performance of the new algorthm s compared wth the performance of one of the most popular algorthms, the sequental mnmal optmzaton (SMO) method. The new algorthm s smple to understand, easy to mplement, and can be more effcent than the SMO method for many SVM problems.

Copyrght 2009 by Dougsoo Kaown

ACKNOWLEDGEMENTS I thank my advsor Dr. Lu, Janguo who gave me endless support and gudance. There are no words to descrbe hs support and gudance. I thank my PhD commttee members Dr. Neuberger and Dr. Yuan. I thank my father. Even though he already left our famly, I always thnk about hm. Wthout my father, my mother rased me wth her heart wth endless love. I thank my mother. She always supports me. I also thank my wfe and my daughter who always love me and are wth me. Wthout them, I could not fnsh ths pleasant and wonderful journey. I thank the Department of Mathematcs for gvng me ths opportunty and supportng me on ths journey. I thank all staff and faculty members n the department. They are my lfe tme teachers.

CONTENTS ACKNOWLEDGEMENTS CHAPTER 1. INTRODUCTION 1 1.1. Introducton and Dscusson of the Problem 1 1.2. Prevous Works 4 CHAPTER 2. A NEW ALGORITHM FOR FINDING THE MINIMUM DISTANCE BETWEEN TWO CONVEX HULLS 7 2.1. A New Algorthm for Fndng the Mnmum Dstance of Two Convex Hulls 7 2.2. Convergence for Algorthm 2.1 11 CHAPTER 3. SUPPORT VECTOR MACHINES 25 3.1. Introducton for Support Vector Machnes (SVM) 25 3.2. SVM for Lnearly Separable Sets 26 3.3. SVM Dual Formulaton 28 3.4. Lnearly Nonseparable Tranng Set 29 3.5. Kernel SVM 32 CHAPTER 4. SOLVING SUPPORT VECTOR MACHINEs USING THE NEW ALGORITHM 35 4.1. Solvng SVM Problems Usng the New Algorthm 35 4.2. Stoppng Method for the Algorthm 37 CHAPTER 5. EXPERIMENTS 39 5.1. Comparson wth Lnearly Separable Sets 39 5.2. Comparson wth Real World Problems 39 CHAPTER 6. CONCLUSION AND FUTURE WORKS 56 v

6.1. Concluson 56 6.2. Future Work 56 BIBLIOGRAPHY 64 v

CHAPTER 1 INTRODUCTION 1.1. Introducton and Dscusson of the Problem Fndng the mnmum dstance between two convex hulls s a popular research topc wth mportant real-world applcatons. The felds of robotcs, anmaton, computer graphcs and path plannng all use ths dstance calculaton. For example, n robotcs, the dstance between two convex hulls s calculated to determne the path of the robot so that t can avod collsons. However, computng the mnmum dstance between two convex hulls plays ts most sgnfcant role n support vector machnes (SVM) [6], [7], [10], [15]. SVM s a very useful classfcaton tool. SVM gets ts popularty because of ts mpressve performance n many real world problems. Many fast teratve methods are ntroduced to solve the SVM problems. The popular sequental mnmal optmzaton (SMO) method s one of them. There are also many algorthms for fndng the mnmum dstance between two convex hulls, ncludng the Glbert algorthm and Mtchell-Dem yanov-malozemov (MDM) algorthm. These two algorthms are the most basc algorthms for solvng the SVM problems geometrcally. Furthermore there have been many publshed papers based on these two algorthms. Keerth s paper [2] s an mportant example of ths. In ths work, he uses these two algorthms to produce a new hybrd algorthm, and he shows that the SVM problems can be transformed by computng the mnmum dstance between two convex hulls. Defnton 1.1 Let {x 1,..., x m } be a set of vectors n R n. Then the convex combnaton of {x 1,..., x m } s gven by (1) α x where α = 1 and α 0 1

Defnton 1.2 For a gven set X = {x 1,..., x m }, cox s the convex hull of X. cox s the set of all convex combnatons of elements of X. (2) cox = { α x α = 1, α 0} Glbert s algorthm and the MDM algorthm are easy to understand and mplement. However, these two algorthms are desgned for fndng the nearest pont to the orgn from the gven convex hull. Let X = {x 1,..., x m } and Y = {y 1,..., y s } be fnte pont sets n R n. U and V denote convex hulls that are generated by X and Y, respectvely. U = {u = α x α = 1, α 0} V = {v = j β j y j j β j = 1, β j 0} Then the Glbert and the MDM algorthms fnd the soluton for the followng problem : mn u subject to u = α x, α = 1, α 0 =1 Ths problem s called the mnmal norm problem (MNP). The soluton for the above MNP s the nearest pont to the orgn from U. For x, y R n, the Eucldean norm can be defned as follows. x = (x, x) where the nner product s defned by (x, y) = x t y = x 1 y 1 +... + x n y n. The mnmum dstance between two convex hulls can be found by solvng an optmzaton problem. Ths problem s called the nearest pont problem (NPP). A NPP problem can be seen n Fgure 1.1. By usng Glbert s algorthm or the MDM algorthm, a user of these algorthms can fnd the mnmum dstance between two convex hulls by fndng the nearest pont to the orgn from the convex hull that s generated by the dfference of vectors. mn u v Nearest Pont Problem (NPP) 2

subject to u = v = m α x, =1 s β j y j, j=1 m α = 1, α 0 =1 s β j = 1, β j 0 j=1 Fgure 1.1. Nearest Pont Problem Set Z = {x y j x X and y j Y }. Let W = coz. W = {w = l γ l z l l γ l = 1, γ l 0} Then the mnmum dstance between two convex hulls can be found by solvng the followng MNP. subject to mn w w = γ l z l l γ l = 1, γ l 0 l=1 However, computng the mnmum dstance between two convex hulls by usng such algorthms requres bg memory storage, because the convex hull W has m s vertces. For example, let X = {x 1, x 2 } and Y = {y 1, y 2, y 3 }, 3

then Z = X Y = {x 1 y 1, x 1 y 2, x 1 y 3, x 2 y 1, x 2 y 2, x 2 y 3 }. Ths means that f one set has 1000 ponts and the other set has 1000 ponts, then 1,000,000 ponts wll be used n the Glbert or the MDM method. If these methods are used to fnd the mnmum dstance between two convex hulls, the calculaton would be too expensve. Thus, a drect method wthout formng the dfference of vectors s desred. Ths dssertaton study presents a drect method of computng the mnmum dstance between two convex hulls. Chapter 2 proposes a new algorthm for fndng the soluton for NPP. Convergence s proved n chapter 2. Chapter 3 ntroduces Support Vector Machnes (SVM). Chapter 4 dscusses solvng SVM by usng the new algorthm. Chapter 5 shows the experment results comparng SMO and the new algorthm. Fnally, the concluson and the future work for the new algorthm s dscussed n chapter 6. 1.2. Prevous Works In ths secton, two mportant teratve algorthms for fndng the nearest pont to the orgn from a gven convex hull are dscussed begnnng wth Glbert s algorthm [5]. Glbert s algorthm was the frst algorthm developed for fndng the mnmum dstance to the orgn from a convex hull. Glbert s algorthm s a geometrc method used to fnd the mnmum dstance. Addtonally, Glbert s algorthm can be vewed geometrcally. Glbert s Algorthm can be explaned by usng convex hull representaton. Frst, let X = {x 1,..., x m } and U = {u = α x α = 1, α 0}. Defnton 1.3 For u U, defne (3) δ MDM (u) = (u, u) mn(x, u) Theorem 1 n [1] states that δ MDM (u) = 0 f and only f u s the nearest to the orgn. Glbert s algorthm 4

Step 1 Choose u k U. Step 2 Fnd (x k, u k ) = mn(x, u k ). Step 3 If δ MDM (u k ) = 0 then stop. Otherwse, fnd u k+1 where u k+1 s the pont of the lne segment jonng u k and x k that has the mnmum norm. Step 4 Go to Step 2. In Step 3, the mnmum norm of lne segment of u k and x k can be found n the followng way. The mnmum norm of lne segment of u k and x k s u k f ( u k, x k u k ) 0 x k f b a 2 ( u k, x k u k ) u k + ( u k,x k u k ) b a 2 otherwse Glbert s Algorthm tends to converge fast at the begnnng, but t s very slow as t approaches the soluton, because Glbert s Algorthm remans zgzaggng untl t fnds the soluton. Usng the fast convergence at the begnnng, Keerth [2] used t when he made a hybrd algorthm. After Glbert s algorthm, Mtchell-Dem yanov-malozemov (MDM) suggested a geometrc algorthm to fnd the nearest pont to the orgn. Lke Glbert s algorthm, the MDM algorthm s an teratve method to fnd the nearest pont from the orgn. Let X and U be the same as the prevous page. Let u be the nearest to the orgn. Defne (x, u) = max α >0 (x, u) where α [ s the coeffcent of x n the convex hull representaton and (x, u) = mn(x, u). Defnton 1.4 Defne (4) MDM (u) = (x, u) (x, u) Note that (5) (x, u) (x, u) = (x x, u) Lemma 2 n [1] states that α MDM (u) δ MDM (u) MDM (u) where α s the coeffcent of x n the convex hull representaton. 5

Then by theorem 1 and lemma 2 n [1], u s the nearest pont to the orgn f and only f MDM (u) = 0. Then the algorthm for fndng u s the followng. The MDM Algorthm Step 1 Choose u k U Step 2 Fnd x k and x k Step 3 Set u k (t) = u k + tα k (x k x k ) where α has the same ndex of x k and t [0, 1] Step 4 Set u k+1 = u k + t k α k (x k x k ) where (u k(t k ), u k (t k )) = mn t [0,1] (u k(t), u k (t)) Step 5 If MDM,k (u k ) = (x k x k, u k ) = 0 then stop. Otherwse go to Step 2. The convergence was proved n [1]. Ths algorthm s also very smple to mplement. In ths algorthm, u k needs to stay n the convex hull. In order to be nsde of the convex hull, the algorthm adds tα k x and subtracts tα k x k to the current pont so that the sum of α s 1. Here, t may take some tme to comprehend why only a nonzero α s used whle tryng to fnd x n the convex hull representaton. The MDM algorthm has a reason to use x when the current pont u k moves. If max(x, u) were used nstead of max (x, u), u k may converge α >0 slowly or may not converge to u. t k α k wll decde how far the current pont can move and x k x k wll gve the drecton for u k. Note that n each teraton, u k updates two coeffcents n the representaton. By takng ths method, four or more coeffcents are updated at a tme. The method wll be shown n the Appendx A. 6

CHAPTER 2 A NEW ALGORITHM FOR FINDING THE MINIMUM DISTANCE BETWEEN TWO CONVEX HULLS 2.1. A New Algorthm for Fndng the Mnmum Dstance of Two Convex Hulls Ths chapter ntroduces a new algorthm for fndng the mnmum dstance of two convex hulls and a proof for ts convergence of the algorthm. Ths algorthm s an teratve method for computng the mnmum dstance between two convex hulls. Moreover, ths algorthm has a good applcaton to SVM. Let X, Y, Z, U, V and W be the same as n chapter 1. As mentoned n Chapter 1, fndng the nearest pont to the orgn of W s equvalent to solvng NPP. Let w be the nearest pont to the orgn and u v be the soluton for NPP, then w = u v. Note that for any w W, w = u v = α x j β j y j δ MDM (w) = (w, w) mn(z l, w) = (w, w) mn (x y j, w) l,j MDM (w) = max γ l >0 (z l, w) mn l (z l, w) = max α >0,β j >0 (x y j, w) mn,j (x y j, w) Mtchell-Dem yanov-malozemov defned MDM (w) and δ MDM n ther paper [1]. Defnton 2.1 For u U and v V, defne (6) x (u v) = max α >0 (x, u v) mn (x, u v) (7) y (v u) = max β j >0 (y j, v u) mn j (y j, v u) Actually, MDM (u v) = x (u v)+ y (v u) n order to show ths. Lemma 2.1 s needed. 7

Lemma 2.1 (8) max(x y j, u v) = max(x, u v) mn(y j, u v) (9) mn(x y j, u v) = mn(x, u v) max(y j, u v) (10) max(y j, u v) = mn(y j, v u) (Proof) (8) can be shown by nequaltes ( ). Let (x M y M, u v) = max(x y j, u v), then (x M y M, u v) = (x M, u v) (y M, u v). (x M, u v) max(x, u v) and (y M, u v) mn(y j, u v). Thus (x M y M, u v) = (x M, u v) (y M, u v) max(x, u v) mn(y j, u v). = max(x y j, u v) max(x, u v) mn(y j, u v). ( ) Let (x m, u v) = max(x, u v) and (y m, u v) = mn(y j, u v), then (x m, u v) (y m, u v) = (x m y m, u v). max(x y j, u v) (x m y m, u v) = max(x y j, u v) max(x, u v) mn(y j, u v) Proof of (9) s smlar to (8). Proof of (10) can be shown by nequaltes. Let (y M, u v) = max(y j, u v), then (y M, u v) = (y M, v u). = max(y j, u v) = (y M, v u) mn(y j, v u) Let (y mn, v u) = mn(y j, v u), then (y mn, v u) = (y mn, u v). = mn(y j, v u) = (y mn, u v) max(y j, u v). Lemma 2.2 (11) MDM (u v) = x (u v) + y (v u) (Proof) Snce u v W, MDM (u v) = max (x y j, u v) mn (x y j, u v) α >0,β j >0,j = max (x, u v) mn (y j, u v) mn(x, u v) + max(y j, u v) α >0 β j >0 j = max (x, u v) mn(x, u v) mn (y α >0 j, u v) + max(y j, u v) β j >0 j 8

= max (x, u v) mn(x, u v) + max (y α >0 j, v u) mn(y j, v u) β j >0 j = x (u v) + y (v u). Usng (8), (9) and (10), lemma 2.2 was shown. A new algorthm for fndng the mnmum dstance between two convex hulls uses the teratve method. In the new algorthm, the followng notatons are used. Let (x k, u k v k ) = max α k >0 (x k, u k v k ) and (x k, u k v k ) = mn (x k, u k v k ). (12) x (u k v k ) = (x k x k, u k v k ) (13) y (v k u k ) = (y k y k, v k u k ) Algorthm 2.1 Step 1 Choose u k U and v k V where u k = α k x and v k = j β jk y j. Step 2 Fnd x k and x k where (x k, u k v k ) = mn(x k, u k v k ) and (x k, u k v k ) = max (x α k >0 k, u k v k ). Step 3 u k+1 = u k + t k α k (x k x k ) where 0 t k 1 and t k = mn{1, Note t k s defned by x (u k v k ) α x k x 2 }. k (u k (t k ) v k, u k (t k ) v k ) = mn t [0,1] (u k(t) v k, u k (t) v k ) where u k (t) = u k + tα k (x k x k ) and (u k (t) v k, u k (t) v k ) = α k xk x k 2 t 2 2α k x (u k v k )t + (u k v k, u k v k ). Step 4 Fnd y jk and y j k where (y jk, v k u k+1 ) = mn(y jk, v k u k+1 ) j and (y j k, v k u k+1 ) = max (y j k, v k u k+1 ). β jk >0 Step 5 v k+1 = v k + t k β j k (y j k y j k ) Note t k where 0 t k 1 and t k = mn{1, y(v k u k+1 ) s defned by β j k yjk y j k 2 }. 9

(v k (t k ) u k+1, v k (t k ) u k+1) = mn t [0,1] (v k(t ) u k+1, v k (t ) u k+1 ) where v k (t ) = v k + t β j k (y jk y j k ) and (v k (t ) u k+1, v k (t ) u k+1 ) = β j k y jk y j k 2 t 2 2t β j k y (v k u k+1 )+(v k u k+1, v k u k+1 ). Step 6. Iteratng ths way, a sequence {u k v k } s obtaned such that u k+1 v k+1 u k v k snce u k+1 v k u k v k and u k+1 v k+1 u k+1 v k. Fgure 2.1. Algorthm 2.1 Fgure 2.1 descrbes how to fnd new u and new v. Frst fnd new u by usng old u, old v, then fnd new v by usng old v and new u. Iterated ths way, ths algorthm fnds the soluton u v for NPP. 10

2.2. Convergence for Algorthm 2.1 Before provng the convergence for Algorthm 2.1, let s defne the followng. Defnton 2.2 For u U and v V, defne (14) δ x (u v) = (u, u v) mn (x, u v) (15) δ y (v u) = (v, v u) mn j (y j, v u) Lemma 2.3 Show that u v (u v ) δ x (u v) + δ y (v u). (Proof) By lemma 2 n [1] states that for any w W, w w δ MDM (w) Note that snce u v W, δ MDM (u v) = (u v, u v) mn,j (x y j, u v) = (u, u v) (v, u v) mn(x, u v) mn(y j, v u) j = (u, u v) mn(x, u v) + (v, v u) mn(y j, v u) j = δ x (u v) + δ y (v u). Then u v (u v ) δ MDM (u v) = δ x (u v) + δ y (v u). Lemma 2.4 Suppose u k U, v k V, k = 1, 2... s a sequence of ponts such that u k+1 v k+1 u k v k and there s a subsequence such that δ x (u kj v kj ) 0 and δ y (v kj u kj ) 0 then u k v k u v = w. (Proof) Snce u k v k s a nonncreasng sequence bounded below, t has a lmt u k v k µ. By lemma 2.3, ukj v kj (u v ) 0 snce δx (u kj v kj ) 0 and δ y (v kj u kj ) 0. Ths mples ukj v kj u v = w. Hence u k v k u v = w. Any convergent subsequence of u k v k must converge to w = u v because of unqueness of w = u v. 11

Lemma 2.3 and 2.4 are mportant lemmas to prove the convergence of Algorthm 2.1. After the followng theorem, support vectors wll be defned. Support vectors are sgnfcant concept n ths dssertaton and support vector machnes (SVM). Theorem 1 Show (16) mn (x, u v ) = (u, u v ) (17) mn j (y j, v u ) = (v, v u ) (proof) Let u = α x and v = j β j y j. Note mn(x v, u v ) (u v, u v ) and mn(y j u, v u ) (v u, v u ). j Then mn(x, u v ) (v, u v ) (u, u v ) (v, u v ). = mn(x, u v ) (u, u v ) = ( α x, u v ) but mn(x, u v ) α (x, u v ). Thus mn(x, u v ) = (u, u v ). Also mn(y j, v u ) (u, v u ) (v, v u ) (u, v u ). j = mn(y j, v u ) (v, v u ) = ( βj yj, v u ) j j but mn(y j, v u ) βj (yj, v u ). j j Thus mn(y j, v u ) = (v, v u ). j Defnton 2.3 Let the soluton of NPP be u v then those x s correspondng to α > 0 and y j s correspondng to β j > 0 are called support vectors. As seen n Fgure 2.1, u and v can be represented by support vectors x 1, x 2, y 1 and y 6. Subtract (v, u v ) on both sdes from (16), then Smlarly from (17), mn (x v, u v ) = (u v, u v ). mn j (y j u, v u ) = (v u, v u ). 12

Fgure 2.2. Support Vectors Lemma 2.5 For any vector u v, u U, v V (18) α x (u v) δ x (u v) x (u v) where α > 0 and (x, u v) = max α >0 (x, u v). (19) β j y (v u) δ y (v u) y (v u) where β j > 0 and (y j, v v) = max (y j, v u). β j >0 m (Proof of (18)) Let u = α x, then =1 (u v, u v) = m =1 α (x v, u v) max α >0 (x v, u v). δ x (u v) x (u v) 13

Let (x, u v) = mn(x, u v) and (x, u v) = max (x, u v) α >0 Then x (u v) = (x x, u v). Set A = (α 1,..., α m ) where α = α f, α = 0 f = α = α + α f =. u = u + α (x x ) (u v, u v) = (u + α (x x ) v, u v) = (u, u v) + α ((x x ) v, u v) = (u, u) (u, v) + α ((x x ), u v) (v, u v) = (u, u) (u, v) α x (u v) (v, u) + (v, v) so = (u v, u v) α x (u v). δ x (u v) = (u v, u v) mn(x v, u v) (u v, u v) (u v, u v). But the rght hand sde of nequty s equal to (u v, u v) (u v, u v) + α x (u v). δ x (u v) α x (u v) Proof of (19) s smlar to the proof of (18). Lemma 2.6 (20) lm k α k x (u k v k ) = 0 (21) lm k β k y (v k u k ) = 0 14

(Proof of (20)) Frst, observe that u k (t) = u k + tα k (x k x k ), v k(t ) = v k + t β k (y k y k ) and u k+1 v k+1 u k v k. Note that (u k (t) v k, u k (t) v k ) = (u k v k, u k v k ) 2tα x(u k v k ) + t 2 (α x k x k ). Suppose the asserton false. Then there s a subsequence u kj v kj for whch α k j x (u kj v kj ) ɛ > 0. Then (u kj (t) v kj, u kj (t) v kj ) (u kj v kj, u kj v kj ) 2tɛ + t 2 d 2 where d = max l x p. l,p Because rght hand sde of nequalty s mnmzed at t = mn{ ε, 1} d 2 (2.6.1) If t = ε d 2, (u kj (t) v kj, u kj (t) v kj ) (u kj v kj, u kj v kj ) 2 ε d 2 ε + ( ε d 2 ) 2 d 2 (2.6.2) If t = 1( ε d 2 > 1 ε > t d 2 ) By (2.6.1) and(2.6.2) = (u kj v kj, u kj v kj ) ε2 d 2 = (u kj v kj, u kj v kj ) t ε. (u kj (t) v kj, u kj (t) v kj ) (u kj v kj, u kj v kj ) 2t ε + t 2 d 2 (u kj v kj, u kj v kj ) 2t ε + εt = (u kj v kj, u kj v kj ) t ε. (u kj (t) v kj, u kj (t) v kj ) (u kj v kj, u kj v kj ) t ε. Ths contradcts u k+1 v k+1 u k v k. Proof of (21) s smlar to (20). Lemma 2.7 (22) lm x (u k v k ) = 0 (23) lm y (v k u k ) = 0 (proof) Suppose lm x (u k v k ) 0 and lm y (v k u k ) 0 then there s x(u v ) such that lm x (u k v k ) = x(u v ). 15

Then for suffcently large k > K 1 (24) x (u k v k ) x(u v ). 2 And there s y(v u ) such that lm y (v k u k ) = y(v u ). Then for suffcently large k > K 2 (25) y (v k u k ) y(v u ). 2 By lemma 2.6, α k 0 Let t k be a pont that (u k (t) v k, u k (t) v k ) assumes ts global mnmum. Then (u k (t) v k, u k (t) v k ) = (u k + tα (x k x k) v k, u k + tα (x k x k) v k ) = (u k v k + tα (x k x k), u k v k + tα (x k x k)) = (u k v k, u k v k ) 2tα (u k v k, (x k x k )) + t 2 (α x k x k ) 2 = (u k v k, u k v k ) 2tα x (u k v k ) + t 2 (α x k x k ) 2. (26) t k = x(u k v k ) α x k x k 2 By x (u k v k ) x (u v ) 2 and α k 0, t k. Agan by lemma 2.6 β k 0 and let t k be a pont that that (v k (t ) u k, v k (t ) u k ) assumes ts global mnmum. Then (27) t k = y(v k u k ) β y k y k Agan by y (v k u k ) y (v u ) 2 and β k 0, t k. Hence, for large k > K > K 1, the mnmum of (u k (t) v k, u k (t) v k ) on the segment 0 t 1 s obtaned at t k = 1 so that for such values k (28) u k+1 = u k + α k(x k x k). 16

Also, for large k > K > K 2, the mnmum of (v k (t ) u k, v k (t ) u k ) on the segment 0 t 1 obtaned at t k = 1 so that for such values k (29) v k+1 = v k + β k(y k y k). Let u kj v kj be a subsequence such that (30) x (u kj v kj ) x(u v ). By dscardng terms from the sequence u kj v kj, the followng condtons are satsfed. (31) α k j α and β k j β where α and β are coeffcent for u and v respectvely. (32) x kj = x X Ths means that mn(x, u kj v kj ) s obtaned for all k j at the same value of (x, u kj v kj ) = mn (x, u kj v kj ) (33) x k j = x X (34) (x, u kj v kj ) = max α >0 (x, u kj v kj ) Usng (30), (31), (32) and (33), the followng s obtaned (x x, u v ) = x(u v ) Now ntroduce the new notatons, (35) ρ x = mn(x, u v ) (36) X 1 = {x X (x, u v) = ρ x} (37) X 2 = X \ X 1 Note that (38) x X 2 (x, u v ) > ρ x 17

(39) x X 1 and Set x X 2 (40) ρ x = mn (x, u v ) x X2 and (41) ρ x > ρ x then (42) τ x = mn{ x(u v ), ρ x ρ x}. Note that for x X 2, (x, u v ) ρ x + τ x. Choose δ ox such that whenever (u v) (u v ) < δ ox, (43) max (x, u v) (x, u v ) < τ x 4 (Clam) Let k j be such that u kj v kj (u v ) < δ ox. Then X 1 (u kj v kj ) X 1 where X 1 (u kj v kj ) = {x X (x, u kj v kj ) = mn(x, u kj v kj )}. (proof of clam) Let x X 1 (u kj v kj ), then (x, u v ) = (x, u kj v kj ) + (x, (u v ) (u kj v kj )) = mn (x, u kj v kj ) + (x, (u v ) (u kj v kj )) = mn (x, u kj v kj ) + (x, u v ) (x, (u kj v kj )) = ρ x ρ x + mn (x, u kj v kj ) + (x, u v ) (x, u kj v kj ) = ρ x + (x, (u v ) (u kj v kj )) + [ mn(x, u v ) + mn(x, u kj v kj )] ρ x + (x, (u v ) (u kj v kj )) + max (x, u kj v kj ) (x, u v ) ρ x + τ x /4 + τ x /4 = ρ x + τ x /2. 18

Snce δ Ox was chosen arbtrarly and (x, u v ) ρ x + τ x /2. x X 1 (end of clam) Let k j > K be such that the followng condtons hold. (44) x (u k v k ) x(u v ) τ x /4 (45) u kj v kj (u v ) < δ ox /2 (46) where p α k j +l 1 < δ ox /4d snce α k 0 l=1 d x = max x l x r > 0, d y = max y l y r and d = {d x, d y } l,r l,r (47) p β k j +l 1 < δ ox /4d snce β k 0 l=1 By (28) and (46), (48) u kj +p = u kj + p α k j +l 1(x kj +l 1 x k j +l 1) l=1 (49) u kj +p u kj < δ o x 4d d = δ o x /4 By (29) and (47), v kj +p = v kj + p β k j +l 1(y kj +l 1 y k j +l 1) l=1 v kj +p v kj < δ o x 4d d = δ o x /4. Then by (46) u kj +p v kj +p (u v ) u kj +p v kj +p (u kj v kj ) + u kj v kj (u v ) < δ ox 19

Hence X 1 (u kj +p v kj +p) X 1. Next for all x X 1 and p [1 : m], the followngs are obtaned (x, u kj +p v kj +p) = (x, u v ) + (x, u kj +p v kj +p (u v )) = (x, u v ) + (x, u kj +p v kj +p) (x, u v ) ρ x + τ x /4 x(u v ) τ x + ρ x + τ/4 = x(u v ) + ρ x 3τ x 4 (50) (x, u kj +p v kj +p) x(u v ) + ρ x 3τ x 4. Now show that for some vector u kj +p v kj +p, p [1 : m] max (x, u kj +p v kj +p) x(u v ) + ρ { α k j +p x 3τ x >0} 4. If ths nequalty fals to hold for some p, then x k j +p X 2. Snce X 1 (u kj +p v kj +p) X1, u kj +p+1 = u kj +p + α k j +p (x k j +p x k j +p ) nvolves the vector x k j +P wth zero coeffcents, whle the newly ntroduced vector x k j +p satsfes (50). Snce X 2 contans at most m-1 vectors, there s p [1 : m] such that all vectors appearng n the representaton hold (50). On the other hand, max α >0 (x, u kj +p v kj +p) x(u v ) + ρ x 3τ x 4 (51) mn(x, u kj +p v kj +p) ρ x τ x /4. Thus x (u kj +p v kj +p) = max α >0 (x, u kj +p v kj +p) mn(x, u kj +p v kj +p) x(u v ) τ x /2. Ths contradcts (44), Thus lm x (u k v k ) = 0. Smlarly, lm y (v k u k ) = 0 can be shown. Theorem 2 x (u v) + y (v u) = 0 f and only f u v = u v. 20

(proof) By lemma 2.2, x (u v) + y (v u) = MDM (u v). And by theorem 1 and lemma 2 n [1], MDM (u v) = 0 f and only f u v = u v. Thus, theorem has been proved. Theorem 3. The sequence u k v k converges to u v. (proof) By Lemma 2.7, there are subsequence u kj v kj and v kj u kj such that x (u kj v kj ) 0 and y (v kj u kj ) 0. Then by Lemma 2.3, 2.4 and 2.5 u k v k converges to u v. Theorem 4. Let (52) G = {u = α x (u, u v ) = (u, u v )} (53) G = {v = j β j x j (v, v u ) = (v, v u )} Then for large k, u k G and v k G and (proof) By theorem 1, mn (x, u v ) = (u, u v ) and mn j (y j, v u ) = (v, v u ). Set If X 2 = then u k G, k. If Y 2 = then v k G, k. X 1 = {x X (x, u v ) = (u, u v )}, X 2 = X \ X 1 = {x X (x, u v ) > (u, u v )}, Now suppose X 2, Y 2 Let Y 1 = {y j Y (y j, v u ) = (v, v u )} Y 2 = {y j Y (y j, v u ) > (v, v u )}. (54) τ = mn x X 2 (x, u v ) (u, u v ) > 0 21

and (55) τ = mn y j Y 2 (y j, v u ) (v, v u ) > 0 Note that (x v k, u k v k ) (x v k, u v ) and (y j u k, v k u k ) (y j u k, v u ) snce u k v k v u, ths follows that for large k > K 1 (56) max (x v k, u k v k ) (x v k, u v ) < τ 4 (57) max (y j u k, v k u k ) (y j u k, v u ) < τ j 4 Then by the clam n lemma 2.7 for large k > K 1 X 1 (u k v k ) X 1, where X 1 (u k v k ) = {x X (x, u k v k ) = mn (x, u k v k )} and Y 1 (v k u k ) Y 1, where Y 1 (v k u k ) = {y j Y (y j, v k u k ) = mn j (x j, v k u k )}. Now let x X 1 then (58) (x, u v ) = (u, u v ) By (56), max (x, u k v k ) (x, u v ) < τ. Then 4 (59) (x, u k v k ) (x, u v ) < τ 4 By (58), (x, u k v k ) (u, u v ) < τ 4. Then (60) (x, u k v k ) < (u, u v ) + τ 4 Now let x X 2, then by (54) (61) (x, u v ) (u, u v ) τ and by (56) (62) τ 4 < (x, u k v k ) (x, u v ) < τ 4 Add (x, u k v k ) (x, u v ) to (61) on both sdes, Then 22

(x, u k v k ) (x, u v ) + (x, u v ) (u, u v ) τ + (x, u k v k ) (x, u v ). And by (62), (63) (x, u k v k ) (u, u v ) + 3τ 4. Smlarly, (64) (y j, v k u k ) (v, v u ) + τ 4 f y j Y 1 and (65) (y j, v k u k ) (v, v u ) + 3τ 4 f y j Y 2. are obtaned. For u k, k > K 1, f the convex hull representaton for u k nvolves x X 2 wth a nonzero coeffcent, then x (u k v k ) τ 2 by (60), (63) and X 1(u k v k ) X 1. Smlarly, for v k, k > K 1, nvolves y j Y 2 wth a nonzero coeffcent, then (66) y (u k v k ) τ 2. Then for large k > K 1 and k > K 1, u k and v k can be wrtten as below (67) u k = { x X 1 } (68) v k = Consder (u k u, u v ) { x X 2 } {j y j Y 1 } α (k) x + { x X 2 } β (k) j y j + {j y j Y 2 } α (k) x β (k) j y j = (u k, u v ) (u, u v ) = ( α (k) x + α (k) x, u v ) (u, u v ) { x X 1 } { x X 2 } = ( α (k) x, u v ) + ( α (k) x, u v ) (u, u v ) { x X 1 } { x X 2 } = α (k) (x, u v ) + α (k) (x, u v ) { x X 1 } { x X 2 } α (k) (u, u v ) α (k) (u, u v ) { x X 1 } { x X 2 } = α (k) (x u, u v ) τ α (k) { x X 2 } 23

snce (x, u v ) = (u, u v ) f x X 1 and (x u, u v ) τ f x X 2. By the convergence of u k v k, the left hand sde of ths nequalty tends to zero as k Ths means that α (k) 0. (69) Smlarly, { x X 2 } (v u, v u ) = β (k) j (y j u, v u ) τ {j y j Y 2 } Also, {j y j Y 2 } β (k) j 0 can be seen. {j y j Y 2 } Choose K > K 1 so large and for k K, where d = { X 2 } β (k) j α (k) τ 2d 2 max x x j > 0. x X 1, x j X 2 Let t k be a pont that (u k (t) v k, u k (t) v k ) assumes a global mnmum. If the representaton of u k, k K, nvolves x X 2 wth a nonzero coeffcent then t k = x(u k v k ) τ α k x k x k 2 2 ( { x X 2 } α (k) )d 2 1 by x (u k v k ) τ 2. Hence u k+1 = u k + α k (x k x k ). Smlarly, choose K > K 1 so large and for k K then v k+1 = v k + β k (y k y k ). Now, for u k+1, x k X 1 meanwhle x k wll be dsappeared n the representaton of u k+1. (.e x k has zero coeffcent α k n u k+1) Then there s l {1,..., m} such that u k+l does not nvolve the vector x X 2 so there k > K + l such that u k G. Smlarly, there s some k such that v k G. 24

CHAPTER 3 SUPPORT VECTOR MACHINES 3.1. Introducton for Support Vector Machnes (SVM) Support vector machnes (SVM) s a famly of learnng algorthms used for the classfcaton of objects nto two classes. The theory s developed by Vapnk and Chervonenks. SVM became very popular n the early 90 s. SVM has been broadly appled to all knds of classfcatons from handwrtten recognton to face detecton n mage. SVM has ganed popularty because t can be appled to a wde range of real world applcatons and computatonal effcency. Moreover, t s theoretcally robust. In lnearly separable data sets, there are many decson boundares. In order to have a good generalzaton for a new decson, a good decson boundary s needed. The decson boundares of two sets should be as far away from two classes as possble n order to make an effectve generalzaton. The shortest dstance between two boundares s called a margn. SVM maxmzes the margn so that the maxmzed margn results n less errors for a future test, whle other methods only reduce classfcaton errors. In ths vew, SVM enables more generalzaton than other methods. To llustrate ths, consder the Fgure 3.1. Lne (1) and (2) have zero errors for classfyng the two groups, but the goal of ths classfcaton s to classfy future nputs. If test pont T belongs to O group, then (1) and (2) are not the same any more. Ths means that (2) s a better generalzaton than (1). Maxmal margn can be obtaned by solvng a quadratc programmng (QP) optmzaton problem. The optmal hyperplane whch has the maxmal margn can be obtaned by solvng the quadratc programmng optmzaton problem. The SVM QP optmzaton problem wll be ntroduced n Secton 3.2. A classfer wll be defned n Secton 3.2 as well. In 1992, Bernhard Boser, Isabelle Guyon and Vapnk added a kernel trck whch wll be ntroduced 25

Fgure 3.1. SVM Generalzaton n Secton 3.5 to lnear classfer. Ths kernel trck makes lnear classfcaton possble n a feature space that s usually bgger than the orgnal space. 3.2. SVM for Lnearly Separable Sets Ths secton wll ntroduce the most basc SVM problem: SVM wth lnearly separable sets. Let S be a tranng of ponts x R m wth t { 1, 1} for = 1,..., N, then S = {(x 1, t 1 ),..., (x N, t N )}.e If t = +1 then x belongs to the frst class. If t = 1 then x belongs to the second class. Let S be lnearly separable, then there s a hyperplane wx + b = 0 where w n R m and wx + b = 0 correctly classfes all n S. Defne classfer f(x ) = wx + b, f(x ) = 1 f t = 1 and f(x ) = 1 f t = 1..e wx + b > 0 f t = +1 wx + b < 0 f t = 1 26

Fgure 3.2. Lnear Classfer Even f there s a classfer that separates the tranng set correctly, a maxmal margn s needed. In order to separate the two sets correctly and a have better generalzaton, t s necessary to maxmze the margn. To compute an optmal hyperplane, consder the followng nequaltes for any = 1,..., N wx + b > 1 f t = +1 wx + b < 1 f t = 1 These two nequaltes are decson boundares. To compute the margn, consder the followng two hyperplanes : wx 1 + b = 1 wx 2 + b = 0 where dstance from x 1 to x 2 s the shortest dstance of two hyperplanes. Ths means that the drecton x 2 x 1 s parallel to w. Subtractng the frst from the second equalty, w(x 2 x 1 ) = 1 s obtaned, snce w and x 2 x 1 have the same drecton, x 2 x 1 = 1 w. In order to mnmze the error, x 2 x 1 should be maxmzed. Maxmzng x 2 x 1 s equvalent to mnmzng w. Thus, the optmal hyperplane can be found by solvng the followng optmzaton problem. (70) mn 1 2 w 2 27

Fgure 3.3. Classfer wth Margn subjec to t (w T x + b) 1 3.3. SVM Dual Formulaton Ths secton wll ntroduce the dual formulaton of SVM optmzaton problem. The dual formulaton s usually effcent n mplementaton. Also, n dual formulaton, a non-separable set wll be more generalzed. The most mportant aspect of dual formulaton s the kernel trck. The kernel trck can be used n the dual formulaton. The last part of ths secton wll show the dual formulaton usng KKT condton n optmzaton theory. Let s consder the Lagrangan functon from (70), (71) L(w, b, λ) = 1 2 w 2 N λ [t (w T x + b) 1] =1 where λ = (λ 1,...λ N ) and dfferentate wth respect to the orgnal varables. (72) (73) N L w (w, b, λ) = w λ t x = 0. =1 N L b (w, b, λ) = λ t = 0. =1 28

From (72), (74) w = Substtute (74) nto (71) N λ t x. =1 L(w, b, λ) = 1 2 N N t t j λ λ j x x j =1 j=1 N N N t t j λ λ j x x j b λt + =1 j=1 =1 N =1 λ = 1 2 Thus SVM-DUAL s the followng. N N t t j λ λ j x x j + =1 j=1 N λ. =1 (75) max 1 2 N N t t j λ λ j x x j + =1 j=1 N =1 λ N s.t λ t = 0 =1 λ 0 Once λ s found, (w, b) can be found easly by usng w = N λ t x and wx + b = 1 for x n the frst class. =1 3.4. Lnearly Nonseparable Tranng Set In the case of a lnearly nonseparable tranng set, lnear classfer wx + b = 0 cannot separate the tranng set correctly. Ths means that the lnear classfer needs to allow some errors. The followng three cases wll happen wth lnear classfer wx + b = 0. Case 1 : Satsfy t (wx + b) 1 Case 2 : 0 t (wx + b) < 1 Case 3 : t (wx + b) < 0 Case 2 and case 3 have errors whle Case 1 does not have an error. These errors can be measured usng slack varable s. In Case 1, s = 0. In Case 2, 0 < s < 1. And n Case 3, s > 1. Fgure 3.4 shows the case of a lnearly nonseparable set. 29

Fgure 3.4. Lnearly Nonseparable Set In the case of a lnearly nonseparable set, the margn s maxmzed whle the number of s > 0 s mnmzed. So there are two goals. The frst goal s to maxmze the margn, and the second goal s to mnmze the number of s > 0. Now, consder (76) 1 2 w 2 + C N =1 s In (76), C was ntroduced. Let s thnk of two extreme cases. When C = 0, the second goal can be gnored. Ths s the case of a lnearly separable set. When C, the margn s gnored. C s a parameter that the user of SVM can choose to tune the trade off between the wdth of margn and the number of s. So the optmzaton problem for nonlnearly separable case s the followng : (77) mn 1 2 w 2 + C N =1 s s.t t (wx + b) 1 s s 0 30

= 1, 2,..., N Now let s consder Lagrangan functon from (77). (78) L(w, b, s, λ, µ) = ( 1 N N 2 w 2 + C s ) [ λ (t (w T x + b) 1 + s ) + =1 =1 And dfferentate wth respect to the orgnal varables. N µ s ] =1 (79) (80) (81) N L w (w, b, s, λ, µ) = w λ t x = 0 =1 N L b (w, b, s, λ, µ) = λ t = 0 =1 L (w, b, s, λ, µ) = C λ µ = 0 s From (78), (82) w = Substtute (82) nto (78), N λ t x =1 L(w, b, s, λ, µ) N N t t j λ λ j x x j + C N s N N t t j λ λ j x x j b N λt N λ s N µs + N = 1 2 =1 j=1 = 1 2 N =1 j=1 =1 j=1 =1 =1 j=1 =1 N t t j λ λ j x x j + N λ + C N s N λ s N µs =1 By (81) N N = 1 t 2 t j λ λ j x x j + N λ =1 Thus, Daul-SVM lnearly separable s obtaned as follows =1 =1 =1 =1 =1 λ =1 (83) max 1 2 N N t t j λ λ j x x j + =1 j=1 N =1 λ s.t N λ t = 0 =1 0 λ C = 1, 2,..., N 31

3.5. Kernel SVM In many problems, a lnear classfer could not have both hgh performance and good generalzaton n the orgnal space. Fgure 3.5. Nonlnear Classfer In Fg 3.5, a lnear classfer cannot separate two data sets correctly, so a nonlnear classfer must be consdered. As a result, a nonlnear classfer produces an effcent generalzaton for the future test. As t s shown n the above Fgure 3.5, the nonlnear classfer has a good performance (.e., generalzaton.) Then how does SVM separate a nonlnear separable data set? Consder the tranng set S = {(x 1, t 1 ),..., (x N, t N )} whch s not lnearly separable. However S can be lnearly separable n a dfferent space whch s usually a hgher dmenson than the orgnal space usng the set of real functon φ 1,..., φ M. Here, assume φ to be unknown. These functons are called features. Usng φ, x = (x 1,.., x m ) can be mapped to φ(x) = (φ 1 (x),..., φ M (x)). After mappng to the feature space, the new tranng set S = {(φ(x 1 ), t 1 ),..., (φ(x N ), t N )} s obtaned. To see ths clearly, consder the followng example. S 1 = {(a, 1), (b, 1), (c, 1), (d, 1)} where a = (0, 0), b = (1, 0), c = (0, 1) and d = (1, 1). S 1 cannot be lnearly separable wthout makng any error. 32

Now let φ(x) = x 2 1 2x1 x 2 where x = (x 1, x 2 ). By the feature functon φ(x), x 2 2 a = (0, 0) φ(a) = a = (0, 0, 0) b = (1, 0) φ(b) = b = (1, 0, 0) c = (0, 1) φ(c) = c = (0, 0, 1) d = (1, 1) φ(d) = d = (1, 2, 1) Frst, check that S 1 = {(a, 1), (b, 1), (c, 1), (d, 1)} s lnearly separable n R 3, so φ(x) can be seen. However, as mentoned earler, φ usually s unknown f the kernel functon s used. Let x = (x 1, x 2 ) and y = (y 1, y 2 ). Then defne K(x, y) = (x, y) 2 = x 2 1y1 2 + 2x 1 y 1 x 2 y 2 + x 2 2y2. 2 We can easly check K(a, b) = φ(a)φ(b) K(c, d) = φ(c)φ(d). SVM uses a kernel functon that the value of the kernel functon s the same as the value of nner product n the feature space. Thus, usng new tranng set S and by (83), let s consder the followng optmzaton problem : (84) max 1 2 N N t t j λ λ j φ(x )φ(x j ) + =1 j=1 N =1 λ s.t N λ t = 0 =1 0 λ C = 1, 2,..., N Now, defne the kernel K(x, x j ) = φ(x )φ(x j ) for any nner product (x, x j ) n the orgnal space. Then (84) can be wrtten as the followng : (85) max 1 2 N N t t j λ λ j K(x, x j ) + =1 j=1 N =1 λ 33

N s.t λ t = 0 =1 0 λ C = 1, 2,..., N By the defnton of kernel, K(x, x j ) s always correspondng to the nner product n some feature space. The kernel K(x, x j ) can be computed wthout computng the mage φ(x ) and φ(x j ). (85) s called kernel substtuton. By usng kernel substtuton, nner product evaluatons can be computed n the orgnal space wthout knowng the feature space. Now, the followngs are popular kernel functons. 1. Polynomal kernels K(x, x j ) = [(x, x j ) + c] d where d s the degree of the polynomal and c s a constant. 2. Radal bass functon kernel K(x, x j ) = e x x j /2γ 2 Where γ s a parameter. 3. Sgmod kernel K(x, x j ) = tanh[α(x, x j ) + β] where α and β are parameters. 34

CHAPTER 4 SOLVING SUPPORT VECTOR MACHINES USING THE NEW ALGORITHM 4.1. Solvng SVM Problems Usng the New Algorthm Ths chapter wll show that Algorthm 2.1 can be used to solve SVM problems. Secton 4.1 wll show that solvng SVM s equvalent to fndng the mnmum dstance of two convex hulls. Fndng the maxmal margn of tranng data sets and fndng the mnmum dstance of two convex hulls gven by the tranng data sets seems to be dfferent processes. Yet, after reformulatng SVM-NV, SVM-NV can be recognzed as a NPP. Let {x } m =1 be a tranng data set, Let I be the ndex set for the frst class and J be the ndex set for the second class. Also, set U = {u u = β x, β = 1, β 0, I} and V = {v v = β j x j, β j = 1, β j 0, j J} Reformulaton of SVM-NV was done by Keerth[2]. SVM Dual s equvalent to (86) mn 1 β s β t t s t t (x s, x t ) 2 s t s.t β s 0, β = 1, β j = 1 I j J Now consder an NPP problem. In an NPP problem, u v 2 needs to be mnmzed. u v 2 = β x j β j x j 2 = ( β x j β j x j, β x j β j x j ) = ( β x, β x ) ( β x, j β j x j ) ( j β j x j, β x ) + ( j β j x j, j β j x j ) 35

(87) = β β (x, x ) β β j (x, x j ) j β β j (x, x j ) + j j β j β j (x j, x j ). Note that n (87), f the nner product of two vectors s from the same ndex set, then the sgn of the nner product s postve. Otherwse, t s negatve. (88) Thus, (87) can be wrtten as the followng: β s β t t s t t (x s, x t ) s t where t s t t = 1 f x s and x t are from the same ndex set t s t t = 1 The mnmum dstance between two convex hulls s nonzero f and only f two convex hulls do not ntersect. That s, f two convex hulls ntersect, then the mnmum dstance of two convex hulls s zero. As t s shown earler n chapter 3, real SVM problems have some volatons. If some x volates, then two convex hulls could ntersect. However, havng zeros for a mnmum dstance for two convex hulls s not desred. Here, Fress recommends usng a sum of squared volatons n the cost functon. mn 1 2 w 2 + C 2 k s 2 k s.t wx + b > 1 s j wx + b < 1 + s Ths problem s called SVM Sum of Squared Volaton (SVM-VQ). SVM-VQ can be converted to (89) by smple transformaton. To see ths, let s set and ( ( ) w w = ); b = b; x x = Cs 1, I and x j = e C ( xj ) 1, j J e j C where e s the m dmensonal vector n whch the th component s 1 and all are 0. Then 1 2 w 2 = 1 2 w 2 + C 2 w x + b = wx + b + s k s 2 k w x j + b = wx j + b s 36

Thus, SVM-VQ s equvalent to (89) mn 1 2 w 2 s.t w x + b > 1 w x j + b < 1 Here, feasble space of SVM-VQ s non-empty because of slack varable s, and the result of (89) s automatcally feasble. By settng, ( ) x x = 1, I and x j = e C ( xj ) 1, j J. the new algorthm can be used as usual. e j C 4.2. Stoppng Method for the Algorthm Ths secton wll show the stoppng method for NPP usng Algorthm 2.1. Ths secton wll use the KKT condton of optmzaton theory. Frst, consder the general Quadratc Programmng Problem. (Quadratc Programmng Problem) mn 1 2 xt Qx + cx + d s.t Ax b = 0 x 0. KKT for the Quadratc programmng s the followng L(x, λ, µ) = 1 2 xt Qx + cx + d λ (Ax b) µ x = 0 1. L(x,λ,µ) x = Qx + c λ t A µ = 0 2. λ (Ax b) = 0, µ t x = 0 3. λ 0, µ 0 By KKT condton, x 0 then (Qx + c λ t A) = 0. Ths means f nonzero ndex s known, then (Qx + c λ t A) = 0. Usng 2, the lnear system can be solved to solve the Quadratc programmng problem. 37

By (88), the NPP problem s equvalent to the followng Quadratc programmng problem. mn 1 2 at Aa where A = Xt X X t Y (NPP-QP) s.t = 2, a 0 Y t X Y t Y =1a Consder lagrangan for NPP-QP L(a, λ, u) = 1 2 at Aa λ(a t e 2) µ t a Then by KKT condton, 1. Aa λe µ = 0 2. λ(a t e 2) = 0 3. µ t a = 0 4. By 1, µ = Aa λe 5. By 3, f a 0, (Aa λe) = 0 6. By 3, f a = 0, (Aa λe) 0 Then by the above, f the nonzero ndces for a s known, only lnear system 2 and 5 need to be solved. Fnally, by usng theorem 4 n chapter 2, t s noted that ndces for u k and v k are not changed for large k. From ths, t can be assumed that u k and v k are on G and G respectvely. Once u k and v k are on G on G respectvely, nonzero ndces for a can be determned n the NPP-QP. After solvng lnear system 2 and 5, condton 6 must be checked to determne f the zero ndces are satsfed. 38

CHAPTER 5 EXPERIMENTS 5.1. Comparson wth Lnearly Separable Sets Ths secton wll compare performance between Algorthm 2.1 and sequental mnmal optmzaton(smo) usng lnearly separable sets. The MATLAB program was used to compare the performance. Frst, performance of Algorthm 2.1 and the MDM algorthm are compared n dmenson 100. By ncreasng the number of ponts, the results of experments can be seen n Fgure 5.1. Second, twenty non-ntersected pars of convex hulls n dmensons 100, 200, 300, 500, 700, 1000, 1200, 2000, and 2500 are randomly generated. Then the mnmum dstances between two convex hulls s computed. In each dmenson, the number of vectors s changed to see the performance of two algorthms. The followng graphs show CPU tmes for Algorthm 2.1 and SMO n lnear kernel wth coeffcent 0, lnear kernel wth coeffcent 1, polynomal kernel wth degree 3, and polynomal kernel wth degree 5. In the graphs, lnear kernel wth coeffcent 0 was denoted by Ln(0), lnear kernel wth coeffcent 1 was denoted by Ln(1), polynomal kernel wth degree 3 was denoted by P3, and polynomal kernel wth degree 5 was denoted by P5. Also n the graphs, for example the expresson 7000 100 ndcates that 7000 vectors are n dmenson 100. Fgures 5.2 to 5.12 llustrate the results for the experments n secton 5.1. 5.2. Comparson wth Real World Problems In ths secton, the performance between algorthm 2.1 and sequental mnmal optmzaton (SMO) s compared wth real world problems. Four kernel functons are used to compare the performance of Algorthm 2.1 and SMO. Those kernels are polynomal kernel wth degree 3, polynomal kernel wth degree 5, radcal bass wth gamma 1 and radcal bass wth gamma 10. Polynomal kernel wth degree 3, polynomal kernel wth degree 5, radcal bass 39

2.5 2 Graph of MDM Graph of Our Alg CPU tmes 1.5 1 0.5 0 50 70 90 120 The number of ponts Fgure 5.1. Algorthm 2.1 and the MDM Algorthm wth gamma 1 and radcal bass wth gamma 10 are denoted d3,d5, g1 and g10 respectvely. Fgures 5.13 to 5.16 llustrate the results for the experments n secton 5.2. 40

0.06 100X100 0.05 Graph SMO Graph of Our Alg 0.04 CPU seconds 0.03 0.02 0.01 0 Ln(0) Ln(1) P3 P5 Kernel Fgure 5.2. 100 Ponts n Dmenson 100 41

0.1 200X200 0.09 Graph SMO Graph of Our Alg 0.08 0.07 CPU seconds 0.06 0.05 0.04 0.03 0.02 Ln(0) Ln(1) P3 P5 Kernel Fgure 5.3. 200 Ponts n Dmenson 200 42

0.5 500X500 Graph SMO Graph of Our Alg 0.45 0.4 CPU seconds 0.35 0.3 0.25 0.2 0.15 Ln(0) Ln(1) P3 P5 Kernel Fgure 5.4. 500 Ponts n Dmenson 500 43

2 1000X1000 Graph SMO Graph of Our Alg 1.8 1.6 CPU seconds 1.4 1.2 1 0.8 0.6 Ln(0) Ln(1) P3 P5 Kernel Fgure 5.5. 1000 Ponts n Dmenson 1000 44

2.5 1200X1200 Graph SMO Graph of Our Alg 2 CPU seconds 1.5 1 Ln(0) Ln(1) P3 P5 Kernel Fgure 5.6. 1200 Ponts n Dmenson 1200 45

9 2000X2000 Graph SMO Graph of Our Alg 8 7 CPU seconds 6 5 4 3 Ln(0) Ln(1) P3 P5 Kernel Fgure 5.7. 2000 Ponts n Dmenson 2000 46

16 15 2500X2500 Graph SMO Graph of Our Alg 14 13 CPU seconds 12 11 10 9 8 7 Ln(0) Ln(1) P3 P5 Kernel Fgure 5.8. 2500 Ponts n Dmenson 2500 47

1.6 1.4 Graph SMO Graph of Our Alg 7000X100 1.2 CPU seconds 1 0.8 0.6 0.4 0.2 Ln(0) Ln(1) P3 P5 Kernel Fgure 5.9. 7000 Ponts n Dmenson 100 48

3.4 3.2 7000X300 Graph SMO Graph of Our Alg 3 2.8 CPU seconds 2.6 2.4 2.2 2 1.8 1.6 1.4 Ln(0) Ln(1) P3 P5 Kernel Fgure 5.10. 7000 Ponts n Dmenson 300 49

5.5 7000X500 Graph SMO Graph of Our Alg 5 4.5 CPU seconds 4 3.5 3 2.5 Ln(0) Ln(1) P3 P5 Kernel Fgure 5.11. 7000 Ponts n Dmenson 500 50

8.5 8 7000X700 Graph SMO Graph of Our Alg 7.5 7 CPU seconds 6.5 6 5.5 5 4.5 4 3.5 Ln(0) Ln(1) P3 P5 Kernel Fgure 5.12. 7000 Ponts n Dmenson 700 51

4 3.5 Heart(270X13) Graph SMO Graph of Our Alg 3 2.5 CPU seconds 2 1.5 1 0.5 0 P3 P5 g1 g10 Kernel Fgure 5.13. Heart 52

2.5 Breast cancer(263x9) Graph SMO Graph of Our Alg 2 CPU seconds 1.5 1 0.5 0 P3 P5 g1 g10 Kernel Fgure 5.14. Breast Cancer 53

0.45 0.4 Ttanc(24X3) Graph SMO Graph of Our Alg 0.35 0.3 CPU seconds 0.25 0.2 0.15 0.1 0.05 0 P3 P5 g1 g10 Kernel Fgure 5.15. Ttanc 54

2.5 Thyrod(215X5) Graph SMO Graph of Our Alg 2 CPU seconds 1.5 1 0.5 0 P3 P5 g1 g10 Kernel Fgure 5.16. Thyrod 55

CHAPTER 6 CONCLUSION AND FUTURE WORKS 6.1. Concluson As shown prevously n secton 1 of chapter 5, the performance of SMO s better than Algorthm 2.1 wth 100 100, 200 200, and 500 500. The performance wth 1000 1000 or wth hgher dmensons, Algorthm 2.1 works better than SMO. Algorthm 2.1 performs better for hgher dmensons. To show ths, fx the number of ponts and ncrease the dmenson from dmenson 100 to 300. In the dmensons of both 100 and 300 wth 7000 ponts respectvely, SMO performs faster than Algorthm 2.1. However, wth data set randomly generated by the MATLAB, CPU tme for Algorthm s shorter than SMO n the dmensons of both 500 and 700 wth 7000 ponts respectvely. Therefore, n randomly generated lnearly separable sets wth a hgher dmenson, Algorthm 2.1 performs better than SMO. Although many applcaton problems have a small number of support vectors, data sets that are generated by the MATLAB usually have a small number of support vectors. That s, f the data sets have a large number of support vectors, then Algorthm 2.1 may perform slower than SMO. Unfortunately, most of the real world data sets are n a low dmenson format. Snce the performance of SMO s better than Algorthm 2.1 wth lnearly separable sets, SMO performed better wth those data sets. But as shown n secton 2 n chapter 5, Algorthm 2.1 s stll comparable wth SMO. By the theorem 4 n chapter 2 and stoppng method for the algorthm n secton 2 n chapter 4, Algorthm 2.1 can fnd the soluton for the NPP n a fnte number of teratons. 6.2. Future Work In Algorthm 2.1, two coeffcents of u k n representaton at each teraton are updated, but note that, n fact, four or more coeffcents of u k can be updated. Instead of usng 56

u k+1 = u k + α k (x k x k ), u k+1 = u k + t k α k (x mn,av x max,av ) can be used. In the new update, 0 t k 1, α k = mn{α k, α 2, k}, x mn,av = x k +x 2, k, x 2 max,av = x k +x 2, k 2 where (x k, u k v k ) = mn(x, u k v k ), (x 2, k, u k v k ) = mn (x, u k v k ), (x k, u k v k ) = max (x, u k v k ) and (x 2, a >0 k, u k v k ) = max (x, u k v k ). Wth ths expresson, four a >0 anda α coeffcents n u k+1 are updated. The precse method of updatng four coeffcents at a tme wll be shown n Appendx B. When Algorthm 2.1 and the mproved verson of Algorthm 2.1 are compared, the number of teratons for the mproved verson of Algorthm 2.1 s always smaller than the one wth Algorthm 2.1. By usng the above method, CPU tme for Algorthm 2.1 s mproved. Moreover, mproved Algorthm 2.1 could perform better than SMO n lower dmensons. 57

u av. Appendx A Improved MDM. Defne u 2 = max α j >0,α j α (x, u) and u 2 = mn (x j, u) x j u and u av = u +u 2 and u 2 av = u+u 2 2 Ths secton wll show the mproved MDM algorthm by usng the average ponts u av and Improved MDM Step 1 Choose u k U and fnd u k and u k Step 2 Fnd u 2k and u 2k. If u 2k does not exst, then use the MDM. If (u 2k, u 2k ) (u k, u k ), then use the MDM. If (u 2k, u 2k ) > (u k, u k ), then use the MDM. If δ(u k ) > k (u k ) = (u avk u avk, u k ), then use the MDM. Step 3 Otherwse, set u k (t) = u k + tα k (u avk u avk ) where α = mn{α, α 2} and t [0, 1] α = mn{α, α 2} s used snce each coeffcent of α need to be greater than or equal to 0 n order to be n the convex hull. Step 4 Set u k+1 = u k + t k α k (u avk u avk ) where (u k(t k ), u k (t k )) = mn (u k (t), u k (t)) t [0,1] Step 5 Iterate untl k (u k ) = (u avk u avk, u k ) 0. In the representaton u k = m α k x, the mproved MDM s updatng 4 coeffcent of the =1 representaton meanwhle the MDM s updatng 2 coeffcent of the representaton. Ths makes the teraton faster. However, updatng two many coeffcent could make the teraton slow around the soluton u. By Step 4 u k+1 u k s obtaned. The proof of the convergence Assumpton 1 (1) u 2k exst (2) (u 2k, u 2k ) > (u k, u k ) (3) (u 2k, u 2k ) (u k, u k ) (4) δ(u k ) k (u k ) 58

Defne δ(u) = (u, u) (u av, u). Then clearly δ(u) 0.By Corollay 2 to Lemma 2 n [1], If {u k }, u k U, k = 0, 1, 2...s a sequence of ponts such that u k+1 u k and there s a subsequence { u kj } such that δ(ukj ) u = m =1 0 then u k u. Also, by the Assumpton 1 for any kj α x U, δ (u) (u). Lemma A.1 lm k α k k (u k ) = 0 where α k = mn{α, α 2} (proof) Let u k (t) = u k + 2tα k (u avk u k ). Then (u k (t), u k (t)) = (u k, u k ) 4tα k k (u k ) + (4α 2 )t 2 u avk u av 2. Suppose lm k α k k (u k ) 0. then there s a subsequnce u kj that α kj kj ɛ > 0. Then ( u kj (t), u kj (t) ) (u kj, u kj ) 4tɛ + 16t 2 d 2 where d = max x l x p. (u kj, u kj ) 4tɛ + 16t 2 d 2 s mnmzed at t = mn{ ε d 2, 1} If t = 1, then ε > d 2 t snce ε d 2 > t. Then by (u k (t), u k (t)) = (u k, u k ) 4tα k k (u k ) + (4α 2 )t 2 u avk u av 2 ( ukj (t), u kj (t) ) (u kj, u kj ) 4ε + 4ε = (u kj, u kj ) Ths s a contradcton. If t = ε d 2 then ( u kj (t), u kj (t) ) (u kj, u kj ) 4 ε d 2 ε + 4( ε d 2 ) 2 d 2 = (u kj, u kj ) Ths s a contradcton. Thus lm k α k k (u k ) = 0 A.2 lm k (u k ) = 0 k (proof) Suppose lm k (u k ) = # (u # ) > 0 k Then for large k > K, k (u k ) # (u # ) 2. By Lemma A.1 α k 0. Assume (u k (t), u k (t)) s mnmzed at t k = Hence, for large k > K K 1 t k = 1, thus u k+1 = u k + 2α k (u avk u av) Let { u kj } be subsequence such that kj (u kj ) # (u # ) k (u k ) 2α k u avk u av 2. In the sequence { u kj }, some terms can be omtted f necessary. The sequence that satsfes the followng 3 condtons hold can be found. (1) α,kj α # 59