Training Support Vector Machines with Particle Swarms

Similar documents
Kernel Methods and SVMs Extension

Lecture 10 Support Vector Machines II

MMA and GCMMA two methods for nonlinear optimization

Which Separator? Spring 1

Natural Language Processing and Information Retrieval

Lecture Notes on Linear Regression

Feature Selection: Part 1

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

Support Vector Machines CS434

Problem Set 9 Solutions

The Study of Teaching-learning-based Optimization Algorithm

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Support Vector Machines

Generalized Linear Methods

CSC 411 / CSC D11 / CSC C11

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Lecture 10 Support Vector Machines. Oct

Linear Classification, SVMs and Nearest Neighbors

The Minimum Universal Cost Flow in an Infeasible Flow Network

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Module 9. Lecture 6. Duality in Assignment Problems

Lecture 3: Dual problems and Kernels

Lecture 20: November 7

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 16

EEE 241: Linear Systems

Lagrange Multipliers Kernel Trick

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

Errors for Linear Systems

Solving Nonlinear Differential Equations by a Neural Network Method

A PROBABILITY-DRIVEN SEARCH ALGORITHM FOR SOLVING MULTI-OBJECTIVE OPTIMIZATION PROBLEMS

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

10-701/ Machine Learning, Fall 2005 Homework 3

Design and Optimization of Fuzzy Controller for Inverse Pendulum System Using Genetic Algorithm

NUMERICAL DIFFERENTIATION

On the Multicriteria Integer Network Flow Problem

Foundations of Arithmetic

Support Vector Machines CS434

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Physics 5153 Classical Mechanics. Principle of Virtual Work-1

Some Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS)

Solutions to exam in SF1811 Optimization, Jan 14, 2015

A new Approach for Solving Linear Ordinary Differential Equations

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Some modelling aspects for the Matlab implementation of MMA

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

Maximal Margin Classifier

On a direct solver for linear least squares problems

COS 521: Advanced Algorithms Game Theory and Linear Programming

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Assortment Optimization under MNL

Report on Image warping

PHYS 705: Classical Mechanics. Calculus of Variations II

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

6.854J / J Advanced Algorithms Fall 2008

Lecture 14: Bandits with Budget Constraints

Difference Equations

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Chapter 6 Support vector machine. Séparateurs à vaste marge

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin

CSE 252C: Computer Vision III

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

Numerical Heat and Mass Transfer

A linear imaging system with white additive Gaussian noise on the observed data is modeled as follows:

Lecture 12: Classification

Global Sensitivity. Tuesday 20 th February, 2018

The Geometry of Logit and Probit

Least squares cubic splines without B-splines S.K. Lucas

Chapter Newton s Method

4DVAR, according to the name, is a four-dimensional variational method.

ON REGULARISATION PARAMETER TRANSFORMATION OF SUPPORT VECTOR MACHINES. Hong-Gunn Chew Cheng-Chew Lim. (Communicated by the associate editor name)

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

IV. Performance Optimization

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

1 Convex Optimization

An Interactive Optimisation Tool for Allocation Problems

Temperature. Chapter Heat Engine

Week3, Chapter 4. Position and Displacement. Motion in Two Dimensions. Instantaneous Velocity. Average Velocity

Suppose that there s a measured wndow of data fff k () ; :::; ff k g of a sze w, measured dscretely wth varable dscretzaton step. It s convenent to pl

Support Vector Machines

Linear Approximation with Regularization and Moving Least Squares

18-660: Numerical Methods for Engineering Design and Optimization

Lecture 12: Discrete Laplacian

Formal solvers of the RT equation

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Ensemble Methods: Boosting

Lecture 21: Numerical methods for pricing American type derivatives

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

ELASTIC WAVE PROPAGATION IN A CONTINUOUS MEDIUM

Tracking with Kalman Filter

VQ widely used in coding speech, image, and video

A Local Variational Problem of Second Order for a Class of Optimal Control Problems with Nonsmooth Objective Function

10.34 Numerical Methods Applied to Chemical Engineering Fall Homework #3: Systems of Nonlinear Equations and Optimization

2.3 Nilpotent endomorphisms

1 GSW Iterative Techniques for y = Ax

A New Refinement of Jacobi Method for Solution of Linear System Equations AX=b

Semi-supervised Classification with Active Query Selection

Sparse Gaussian Processes Using Backward Elimination

Transcription:

Tranng Support Vector Machnes wth Partcle Swarms U Paquet Department of Computer Scence Unversty of Pretora South Afrca Emal: upaquet@cs.up.ac.za AP Engelbrecht Department of Computer Scence Unversty of Pretora South Afrca Emal: engel@drese.cs.up.ac.za Abstract Tranng a Support Vector Machne requres solvng a constraned quadratc programmng problem. Lnear Partcle Swarm Optmzaton s ntutve and smple to mplement, and s presented as an alternatve to current numerc SVM tranng methods. Performance of the new algorthm s demonstrated on the MNIST character recognton dataset. I. INTRODUCTION Support Vector Machnes (SVMs) are a young and mportant addton to the machne learnng toolbox. Havng been formally ntroduced by Boser et al. [1], SVMs have proved ther worth n the last decade there has been a remarkable growth n both the theory and practce of these learnng machnes. Tranng a SVM requres solvng a lnearly constraned quadratc optmzaton problem. Ths problem often nvolves a matrx wth an extremely large number of entres, whch make off-the-shelf optmzaton packages unsutable. Several methods have been used to decompose the problem, of whch many requre numerc packages for solvng the smaller subproblems. Partcle Swarm Optmzaton (PSO) s an ntutve and easyto-mplement algorthm from the swarm ntellgence communty, and s ntroduced as a new way of tranng a SVM. Usng PSO replaces the need for numerc solvers. A Lnear PSO (LPSO) s adapted and shown to be deal n optmzng the SVM problem. Ths paper gves an overvew of the SVM algorthm, and explans the man methodologes for tranng SVMs. PSO s dscussed as an alternatve method for solvng a SVM s quadratc programmng problem. Expermental results on character recognton llustrate the convergence propertes of the algorthms. II. SUPPORT VECTOR MACHINES Tradtonally, a SVM s a learnng machne for twoclass classfcaton problems, and learns from a set of l N- dmensonal example vectors x, and ther assocated classes y,.e. {x 1, y 1 },..., {x l, y l } R N {±1} (1) The algorthm ams to learn a separaton between the two classes by creatng a lnear decson surface between them. Ths surface s, however, not created n nput space, but rather n a very hgh-dmensonal feature space. The resultng model s nonlnear, and s accomplshed by the use of kernel functons. The kernel functon k gves a measure of smlarty between a pattern x, and a pattern x from the tranng set. The decson boundary that needs to be constructed s of the form l f(x) = y α k(x, x ) + b (2) =1 where the class of x s determned from the sgn of f(x). The α are Lagrange multplers from a prmal quadratc programmng (QP) problem, and there s an α for each vector n the tranng set. The value b s a threshold. Support vectors defne the decson surface, and correspond to the subset of nonzero α. These vectors can be seen as the most nformatve tranng vectors. Tranng the SVM conssts of fndng the values of α. By defnng a Hessan matrx Q such that (Q) j = y y j k(x, x j ), tranng can be expressed as a dual QP problem of solvng max α W (α) = αt 1 1 2 αt Qα (3) subject to one equalty constrant and a set of box constrants α T y = 0 (4) α 0 C1 α 0 (5) Tranng a SVM thus nvolves solvng a lnearly constraned quadratc optmzaton problem. III. SVM TRAINING METHODS The QP problem s equvalent to fndng the maxmum of a constraned bowl-shaped objectve functon. Due to the defnton of the kernel functon, the matrx Q always gves a convex QP problem, where every local soluton s also a global soluton [2]. Certan optmalty condtons the Karush-Kuhn- Tucker (KKT) condtons [2] gve condtons determnng whether the constraned maxmum has been found. Solvng the QP problem for real-world problems can prove to be very dffcult: The matrx Q has a dmenson equal to the number of tranng examples. A tranng set of 60,000 vectors gves rse to a matrx Q wth 3.6 bllon elements, whch does not ft nto the memory of a standard computer. For large learnng tasks, off-the-shelf optmzaton packages and technques for general quadratc programmng quckly become ntractable n ther memory and tme requrements. A

number of other approaches, whch allow for fast convergence and small memory requrements, even on large problems, have been nvented: Chunkng The chunkng algorthm s based on the fact that the nonsupport vectors play no role n the SVM decson boundary. If they are removed from the tranng set of examples, the SVM soluton wll be exactly the same. Chunkng has been suggested by V. Vapnk n [14], and breaks the large QP problem down nto a number of smaller problems. A QP routne s used to optmze the Lagrangan on an arbtrary subset of data. After ths optmzaton, the set of nonzero α (the current support vectors) are retaned, and all other data ponts (α = 0) are dscarded. At every subsequent step, chunkng solves the QP problem that conssts of all nonzero α, plus some of the α that volates the KKT condtons. After optmzng the subproblem, data ponts wth α = 0 are agan dscarded. Ths procedure s terated untl the KKT condtons are met, and the margn s maxmzed. The sze of the subproblem vares, but tends to grow wth tme. At the last step, chunkng has dentfed and optmzed all the nonzero α, whch correspond to the set of all the support vectors. Thus the overall QP problem s solved. Although ths technque of reducng the Q matrx s dmenson from the number of tranng examples to approxmately the number of support vectors makes t sutable to large problems, even the reduced matrx may not ft nto memory. Decomposton Decomposton methods are smlar to chunkng, and were ntroduced by E. Osuna n [8], [9]. The large QP problem s broken down nto a seres of smaller subproblems, and a numerc QP optmzer solves each of these problems. It was suggested that one vector be added and one removed from the subproblem at each teraton, and that the sze of the subproblems should be kept fxed. The motvaton behnd ths method s based on the observaton that as long as at least one α volatng the KKT condtons s added to the prevous subproblem, each step reduces the objectve functon and mantans all of the constrants. In ths fashon the sequence of QP subproblems wll asymptotcally converge. For faster practcal convergence, researchers add and delete multple examples. Whle the strategy used n chunkng takes advantage of the fact that the expected number of support vectors s small (< 3, 000), decomposton allows for tranng arbtrarly large data sets. Another decomposton method was ntroduced by T. Joachms [3]. Joachm s method s based on the gradent of the objectve functon. The dea s to pck α for the QP subproblem such that the α form the steepest possble drecton of ascent on the objectve functon, where the number of nonzero elements n the drecton s equal to the sze of the QP subproblem. As n Osuna s method, the sze of the subproblem remans fxed. Solvng each subproblem requres a numerc quadratc optmzer. Sequental Mnmal Optmzaton The most extreme case of decomposton s Sequental Mnmal Optmzaton (SMO) where the smallest possble optmzaton problem s solved at each step [11]. Because the α must obey the lnear equalty constrant, two α s chosen to be jontly optmzed. No numercal QP optmzaton s necessary, and after an analytc soluton, the SVM s updated to reflect the new optmal values. Wth the excepton of SMO, a numerc QP lbrary s needed for tranng a SVM. An ntutve and alternatve approach s to use PSO to optmze each decomposed subproblem. The PSO algorthm s easy to mplement, and certan propertes of the LPSO make t deal for the type of problem posed by SVM tranng. IV. PARTICLE SWARM OPTIMIZATION PSO [4] s smlar n sprt to brds mgratng n a flock toward some destnaton, where the ntellgence and effcency les n the cooperaton of an entre flock. PSO dffers from tradtonal optmzaton methods n that a populaton of potental solutons s used n the search. The drect ftness nformaton nstead of functon dervatves or related knowledge s used to gude the search. Partcles collaborate as a populaton, or swarm, to reach a collectve goal, for example maxmzng an n-dmensonal objectve functon f. Each partcle has memory of the best soluton that t has found, called ts personal best. A partcle s traversal of the search space s nfluenced by ts personal best and the best soluton found by a neghborhood of partcles. There s thus a sharng of nformaton that takes place. Partcles proft from the dscoveres and prevous experence of other partcles durng the exploraton and search for hgher objectve functon values. There exsts a great number of schemes n whch ths nformaton sharng can take place. One of two socometrc prncples s usually mplemented. The frst, called gbest (global best) PSO, conceptually connects all the partcles n the populaton to one another, so that the very best performance of the entre populaton the global best nfluences each partcle. The second, called lbest (local best), creates a neghborhood for each ndvdual comprsng of tself and some fxed number of ts nearest neghbors. Snce SVM tranng requres solvng a convex problem, the gbest verson s mplemented n ths paper. Let ndcate a partcle s ndex n the swarm. In a gbest PSO each of the s partcles p fly through the n-dmensonal search space R n wth a velocty v, whch s dynamcally adjusted accordng to ts own prevous best soluton z and the prevous best soluton ẑ of the entre swarm. In the orgnal PSO [4], each partcle s velocty adjustments are calculated separately for each component n ts poston

vector. By calculatng velocty adjustments as lnear combnatons of poston vectors, equalty constrants on the objectve functon can easly be met. Equalty Constrants and the Lnear PSO The Lnear PSO (LPSO) was ntroduced by [10] to constran the movement of a swarm to a lnear hyperplane n R n. LPSO dffers from the orgnal PSO, snce velocty updates are calculated as a lnear combnaton of poston and velocty vectors. The partcles of a LPSO nteract and move accordng to the followng equatons v (t+1) p (t+1) = wv (t) + c 1 r (t) 1 [z(t) p (t) ] + c 2 r (t) 2 [ẑ(t) p (t) ] (6) = v (t+1) + p (t) (7) where r (t) 1, r(t) 2 UNIF (0, 1) are random numbers between zero and one. These numbers are scaled by acceleraton coeffcents c 1 and c 2, where 0 c 1, c 2 2, and w s an nerta weght [12]. It s possble to clamp the velocty vectors by specfyng upper and lower bounds on v, to avod too rapd movement of partcles n the search space. When the objectve functon f needs to be maxmzed subject to constrants Ap = b, the swarm should be constraned to fly through hyperplane P. Wth A beng a m n matrx and b a m-dmensonal vector, P = {p Ap = b} defnes the set of feasble solutons to the constraned problem, and each pont n P wll be a feasble pont. It was shown n [10] that f the ntal swarm les n P, LPSO wll force the partcles to fly only n feasble drectons, and the swarm wll optmze wthn the search space P. Premature convergence s overcome by usng a verson of van den Bergh s Guaranteed Convergence Partcle Swarm Optmzer [13]. In ths algorthm, the velocty updates for the global best partcle s changed to force t to search for a better soluton n an area around the poston of that partcle. Let τ be the ndex of the global best partcle, such that z τ = ẑ. The velocty update equaton for partcle τ s changed to v (t+1) τ = p (t) τ + ẑ (t) + wv (t) τ + ρ (t) υ (t) (8) where ρ (t) s some scalng factor, and υ (t) UNIF ( 1, 1) n s a random n-dmensonal vector wth the property that Aυ (t) = 0, or υ (t) les n the null space of A. The LPSO algorthm [10] s summarzed below: 1) Set the teraton number t to zero. Intalze the swarm S of s partcles such that the poston p (0) meets Ap (0) = b. of each partcle 2) Evaluate the performance f(p (t) ) of each partcle. 3) Compare the personal best of each partcle to ts current performance, and set z (t) to the better performance,.e. { z (t 1) f f(p (t) ) f(z (t 1) ) z (t) = p (t) f f(p (t) ) > f(z (t 1) ) 4) Set the global best ẑ (t) to the poston of the partcle wth the best performance wthn the swarm,.e. ẑ (t) {z (t) 1, z(t) 2,..., z(t) s } f(ẑ (t) ) = max{f(z (t) 1 (9) ), f(z(t) 2 ),..., f(z(t) s )} (10) 5) Change the velocty vector for each partcle accordng to equaton (6). 6) Move each partcle to ts new poston, accordng to equaton (7). 7) Let t := t + 1. 8) Go to step 2, and repeat untl convergence. The LPSO algorthm s suffcent to optmze the SVM objectve functon subject to the lnear equalty constrant (4). The box constrants (5) are easly handled by ntalzng all partcles p to le nsde the hypercube defned by the constrants, and restrctng ther movement to ths hypercube. When a partcle s movng outsde the boundary of the hypercube, ts velocty vector s scaled wth some factor such that all components of ts poston le ether nsde the hypercube, or on ts boundary. The practcal sde of usng LPSO, as well as the tranng algorthm, s dscussed n the followng secton. V. TRAINING THE SVM Usng LPSO to solve the SVM QP problem requres crtera for optmalty, a way to decompose the QP, and a way to extend LPSO to optmze the SVM subproblem. Snce Q s a postve sem-defnte matrx (the kernel functon used s postve sem-defnte), and the constrants are lnear, the Karush-Kuhn-Tucker (KKT) condtons are necessary and suffcent for optmalty [2]. A soluton α of the QP problem, as stated n equaltes (3) (5), s an optmal soluton f the followng relatons hold for each α : α = 0 y f(x ) 1 0 < α < C y f(x ) = 1 α = C y f(x ) 1 (11) where s the ndex of an example vector from the tranng set. Decomposng the QP problem nvolves choosng a subset, or workng set, of varables for optmzaton. The workng set, called set B, s created by pckng q sub-optmal varables from all l α. The workng set of varables s optmzed whle keepng the remanng varables (called set N) constant. The general decomposton algorthm works as follows: 1) Whle the KKT condtons for optmalty are volated: a) Select q varables for the workng set B. The remanng l q varables (set N) are fxed at ther current value. b) Use LPSO to optmze W (α) on B. c) Return the optmzed α from B to the orgnal set of varables. 2) Termnate and return α. A concern n the above algorthm s to select the optmal workng set. The decomposton method presented here s due to [3], and works on the method of feasble drectons. The dea s to fnd the steepest feasble drecton d of ascent on the objectve functon W as defned n equaton (3), under the requrement that only q components be nonzero. The α correspondng to these q components wll be ncluded n the

workng set. Fndng an approxmaton to d s equvalent to solvng Maxmse W (α) T d subject to y T d = 0 d 0 f α = 0 d 0 d { 1, 0, 1} {d : d 0} = q f α = C For y T d to be equal to zero, the number of elements wth sgn matches between d and y must be equal to the number of elements wth sgn msmatches between d and y. Also, d should be chosen to maxmze the drecton of ascent W (α) T d. Ths s acheved by frst sortng the tranng vectors n ncreasng order accordng to y W (α). Assumng q to be even, a forward pass selects q 2 examples from the front of the sorted lst, and a backward pass selects q 2 examples from the back. A full explanaton of ths method s gven by P. Laskov n [5]. It s necessary to rewrte the objectve functon (3) as a functon that s only dependent on the workng set. Let α be splt nto two sets α B and α N. If α, y and Q are approprately rearranged, we have α = [ αb α N ], y = [ yb y N ], Q = [ ] QBB Q BN Q NB Q NN Snce only α B s gong to be optmzed, W s rewrtten n terms of α B. If terms that do not contan α B are dropped, the optmzaton problem remans essentally the same. Also, snce Q s symmetrc, wth Q BN = Q T NB, the problem s to fnd: max α B W (α B ) = α T B1 1 2 αt BQ BB α B α T BQ BN α N (12) subject to α T By B + α T Ny N = 0 α B 0 Implementng Partcle Swarm Optmzaton C1 α B 0 (13) When the decomposton algorthm starts, a feasble soluton that satsfes the lnear constrant α T y = 0, wth constrants 0 α C also met, s needed. The ntal soluton s constructed n the followng way: Let c be some real number between 0 and C, and γ some postve nteger less than both the number of postve examples (y = +1) and negatve examples (y = 1) n the tranng set. Randomly pck a total of γ postve examples, and γ negatve examples, and ntalze ther correspondng α to c. By settng all other α to zero, the ntal soluton wll be feasble. The value 2γ gves the total number of ntal support vectors, and snce these ntal support vectors are a randomly chosen guess, t s suggested that the value of γ be kept small. TABLE I INFLUENCE OF DIFFERENT WORKING SET SIZES ON THE FIRST 20,000 ELEMENTS OF THE MNIST DATASET Workng Workng Tme SVs set sze set selectons 4 8,782 02:17:43 1,631 6 8,213 03:11:40 1,637 8 7,502 03:51:24 1,639 10 10,023 06:27:06 1,648 12 9,667 07:26:23 1,652 In optmzng the q-dmensonal subproblem, LPSO requres that all partcles be ntalzed such that α T B y B + α T N y N = 0 s met. Ths s done as follows: 1) Set each partcle n the swarm to the q-dmensonal vector α B. 2) Add a random q-dmensonal vector δ satsfyng y T B δ = 0 to each partcle, under the condton that the partcle wll stll le n the hypercube [0, C] q. Intalzng the swarm n ths way ensures that the ntal swarm les n the set of feasble solutons P = {p Ap = α T N y N }, allowng the flght of the swarm to be defned by feasble drectons. For faster convergence, the vector υ (t) used to adjust the global best partcle, can be chosen as an approxmaton to the partal dervatve W (α B ), subject to y T B υ(t) = 0. VI. EXPERIMENTAL RESULTS The SVM tranng algorthm presented n ths paper was tested on the MNIST dataset [7]. The MNIST dataset s a database of optcal characters, and conssts of a tranng set of 60,000 handwrtten dgts. Each dgt s a 28 by 28 pxel gray-level mage, equvalent to a 784-dmensonal nput vector. Each pxel corresponds to an nteger value n the range of 0 (whte) to 255 (black) For tranng a SVM on the MNIST dataset, the character 8 was used to represent the postve examples, whle the remanng dgts defned the negatve examples. Tranng was done wth a polynomal kernel of degree fve: k(x, x j ) = (x x j + 1) 5 (14) Due to the sze of the dot product between two mages, rased to the ffth power, the pxel values were scaled to the range [0, 0.1]. Ths gves Lagrange multples α that are easer for the LPSO to handle. (The kernel functon of two unscaled black mages would be (784 255 2 + 1) 5, whle the kernel functon of the scaled versons gves a more practcal (784 0.01 + 1) 5 835). For an optmal soluton to be found n the followng PSO experments, the KKT condtons needed to be satsfed wthn an error threshold of 0.02 from the rght hand sde of equatons (11). Optmzaton of the workng set termnated when the KKT condtons on the workng set were met wthn an error

TABLE II SCALABILITY: TRAINING ON THE MNIST DATASET MNIST PSO Workng PSO PSO SMO SMO SVM lght SVM lght elements set selectons tme SVs tme SVs tme SVs 10,000 3,898 00:29:49 1,022 00:01:29 1,032 00:02:02 1,034 20,000 8,782 02:17:43 1,631 00:06:14 1,647 00:10:43 1,641 30,000 12,428 04:50:11 1,988 00:13:22 2,012 00:23:04 2,001 40,000 15,725 08:14:26 2,353 00:22:46 2,355 00:41:09 2,367 50,000 22,727 15:05:09 2,728 01:46:38 2,740 01:31:48 2,726 60,000 25,914 20:54:15 3,025 04:38:11 3,043 08:01:05 3,026 of 0.001, or when the swarm has optmzed for a hundred teratons. The followng parameters defned the expermental PSO: By lettng γ = 10, a total of 20 ntal support vectors were chosen to start the algorthm. The swarm sze s used n each experment was 10, whle the nerta weght w was set to 0.7. The acceleraton coeffcents c 1 and c 2 were both set to 1.4 [13]. Snce the objectve functon s constranted by a set of box constrants, the velocty vectors were not clamped. For each experment the upper bound C was kept at 100.0. The PSO tranng algorthm was wrtten n Java, and does not make use of cachng and shrnkng methods to optmze ts speed. The sparsty of nput data s used to speed up the evaluaton of kernel functons. All experments were preformed on a 1.00 GHz AMD Duron processor. Expermental results show successful and accurate tranng on the MNIST database. The nfluence of dfferent workng set szes on the LPSO tranng algorthm, ts scalablty, as well as ts relaton to other SVM tranng algorthms, were examned. Influence of workng set szes Experments on dfferent workng set szes were done on the frst 20,000 elements of the MNIST database. Results are shown n Table I, and ndcate that a workng set of sze q = 4 gves the fastest convergence tme and fewest support vectors. A workng set of sze 2 can be solved analytcally, as s true n the case of SMO. The results n Table I are not necessarly an ndcaton of the speed of the PSO on the workng set, as selecton of the workng set also burdens the speed of the algorthm (the q 2 greatest and least values of y W (α) need to be selected from a lst of thousands). Scalablty of the PSO approach Scalablty of the PSO algorthm was tested by tranng on the frst 10,000, 20,000, etc. examples from the MNIST dataset, as shown n Table II. In each case a workng set of sze 4 was used. The expermental results ndcate that the PSO tranng algorthm shows quadratc scalablty, and scales as l 2.1. Comparson to other algorthms In Table II, the PSO approach s compared to SMO and a decomposton method, SVM lght [3]. WnSVM was developed by C. Longbn [6] from the SVM lght source code, and was used as an mplementaton of SMO. Unlke these methods, the current PSO algorthm does not make use of cachng and shrnkng to optmze ts speed. Results smlar to Table I ndcate that SVM lght gves the fastest rate of convergence wth a workng set sze q = 8, whch s used n Table II s comparson. Expermental results show SMO scalng as l 2.8, and SVM lght scalng as l 3.0. Both these algorthms are substantally faster than tranng a SVM wth PSO on the MNIST dataset, but the PSO approach shows better scalng abltes ( l 2.1 ). Due to the fact that the PSO tranng algorthm starts wth a very small set of possble support vectors, wth all other α set to zero, the PSO method consstently fnds a few support vectors less than the other approaches. The man drawback from the current PSO approach s ts slow performance tmes, but from ths ntal study many optmzatons can be mplemented on both decomposton and PSO methods. VII. CONCLUSION It was shown that a PSO can be used to tran a SVM. Some propertes of LPSO make t partcularly useful to solve the SVM constraned QP problem. The PSO algorthm s smple to mplement, and does not requre any background of numercal methods. Accurate and scalable tranng results were shown on the MNIST dataset, wth the PSO algorthm fndng fewer support vectors and better scalablty than other approaches. Although the algorthm s smple, ts speed poses a practcal bottleneck, whch can be mproved from ths ntal study. ACKNOWLEDGMENT The fnancal assstance of the Natonal Research Foundaton towards ths research s hereby acknowledged. Opnons expressed n ths paper and conclusons arrved at, are those of the authors and not necessarly to be attrbuted to the Natonal Research Foundaton.

REFERENCES [1] B.E. Boser, I.M. Guyon, and V.N. Vapnk, A tranng algorthm for optmal margn classfers, n D. Haussler, edtor, Proceedngs of the Ffth Annual ACM Workshop on Computatonal Learnng Theory, pages 144-152, Pttsburgh, PA, 1992. ACM Press. [2] R. Fletcher, Practcal Methods of Optmzaton. John Wley and Sons, Inc., 2nd edton, 1987. [3] T. Joachms, Makng large-scale SVM learnng practcal, n Advances n Kernel Methods Support Vector Learnng, B. Schölkopf, C.J.C Burges, and A.J. Smola, edtors, pages 169-184. MIT Press, Cambrdge, MA, 1999. [4] J. Kennedy and R.C. Eberhart, Partcle swarm optmzaton, n Proceedngs of the IEEE Internatonal Conference on Neural Networks, IV, pages 1942-1948, 1995. [5] P. Laskov, Feasble drecton decomposton algorthms for tranng support vector machnes, n Machne Learnng, Volume 46, N. Crstann, C. Campbell, and Chrs Burges, edtors, pages 315-349, 2002. [6] C. Longbn, http://lama.a.ac.cn/personalpage/lbchen/, Insttute of Automaton, Chnese Academy of Scences (CASIA). [7] MNIST Optcal Character Database at AT&T Research, http://yann.lecun.com/exdb/mnst/. [8] E. Osuna, R. Freund, and F. Gros, Support vector machnes: Tranng and applcatons, A.I. Memo AIM-1602, MIT A.I. Lab, 1996. [9] E. Osuna, R. Freund, and F. Gros, An mproved tranng algorthm for support vector machnes, n Neural Networks for Sgnal Processng VII Proceedngs of the 1997 IEEE Workshop, J. Prncpe, L. Gle, N. Morgan, and E. Wlson, edtors, pages 276-285. IEEE, New York, 1997. [10] U. Paquet and A.P. Engelbrecht, Partcle swarms for equaltyconstraned optmzaton, submtted to IEEE Transactons on Evolutonary Computaton. [11] J. Platt, Fast tranng of support vector machnes usng sequental mnmal optmzaton, n Advances n Kernel Methods Support Vector Learnng, B. Schölkopf, C.J.C Burges, and A.J. Smola, edtors, pages 185-208. MIT Press, Cambrdge, MA, 1999. [12] Y.H. Sh and R.C. Eberhart, A modfed partcle swarm optmzer, n IEEE Internatonal Conference on Evolutonary Computaton, Anchorage, Alaska, 1998. [13] F. van den Bergh, An analyss of partcle swarm optmzers, PhD Thess, Department of Computer Scence, Unversty of Pretora, 2002. [14] V. Vapnk, Estmaton of Dependences Based on Emprcal Data [n Russan], Nauka, Moscow, 1979. (Englsh translaton: Sprnger Verlag, New York, 1982.)