cfl Cpyright by Ji Zhu 2003 All Rights Reserved ii

Size: px

Start display at page:

Janis Adams
6 years ago
Views:

1 FLEXIBLE STATISTICAL MODELING a dissertatin submitted t the department f statistics and the cmmittee n graduate studies f stanfrd university in partial fulfillment f the requirements fr the degree f dctr f philsphy Ji Zhu August 2003

3 I certify that I have read this dissertatin and that, in my pinin, it is fully adequate in scpe and quality as a dissertatin fr the degree f Dctr f Philsphy. Trevr Hastie (Principal Advisr) I certify that I have read this dissertatin and that, in my pinin, it is fully adequate in scpe and quality as a dissertatin fr the degree f Dctr f Philsphy. Bradley Efrn I certify that I have read this dissertatin and that, in my pinin, it is fully adequate in scpe and quality as a dissertatin fr the degree f Dctr f Philsphy. Rbert Tibshirani Apprved fr the University Cmmittee n Graduate Studies: iii

4 Abstract The supprt vectr machine is knwn fr its gd perfrmance in tw-class classificatin. The tpic f this thesis is based n tw variatins f the standard 2-nrm supprt vectr machine. In the first part f the thesis, we replace the hinge lss f the supprt vectr machine with the negative binmial lg-likelihd and cnsider the kernel lgistic regressin mdel. We shw that kernel lgistic regressin perfrms as well as the supprt vectr machine in tw-class classificatin. Further mre, kernel lgistic regressin prvides an estimate f the underlying prbability. Based n the kernel lgistic regressin mdel, we prpse a new apprach fr classificatin, called the imprt vectr machine. Similar t the supprt pints f the supprt vectr machine, the imprt vectr machine mdel uses nly a fractin f the training data t index kernel basis functins, typically a much smaller fractin than the supprt vectr machine. This gives the imprt vectr machine a cmputatinal advantage ver the supprt vectr machine, especially when the size f the training data set is large. In the secnd part, we replace the L 2 -nrm penalty term f the supprt vectr machine with the L 1 -nrm penalty, and cnsider the 1-nrm supprt vectr machine. We argue that the 1-nrm supprt vectr machine may have sme advantage ver the standard 2- nrm supprt vectr machine, especially when there are redundant nise features. We als prpse an efficient algrithm that cmputes the whle slutin path f the 1-nrm supprt vectr machine, hence facilitates adaptive selectin f the tuning parameter fr the 1-nrm supprt vectr machine. iv

5 Acknwledgments First and fremst, I wish t express my great appreciatin t my advisr Trevr Hastie. He suggested the tpic fr this thesis and prvided invaluable advice and cnstant encuragement thrughut the curse f my research. Withut his supprt and guidance, this wrk wuld nt have been cmpleted. Trevr has been mre than an academic advisr t me. When my sn was ne mnth ld, my wife became infected with mastitis, which resulted in avery difficult time fr ur family. I still vividly remember hw Trevr immediately called us after he heard the news and ffered his help. I feel very frtunate t have Trevr as a mentr and a friend. I shall always appreciate his guidance which led me t the wnderful wrld f statistics. I am very lucky t have met Saharn Rsset here at Stanfrd I nw have a talented cllabratr. He is a cnstant inspiratin t me and I value ur relatinship greatly. I am als very grateful t Prfessr Rb Tibshirani, whse stimulating suggestins, sharp cmments and warm encuragement accmpanied me thrughut the dissertatin prcess. I we my gratitude t Prfessr Susan Hlmes and Prfessr Persi Diacnis I wuld prbably nt have chsen t study statistics r cme t Stanfrd had I nt met Susan and Persi when I was at Crnell. Iwant t thank Susan again fr writing me tw imprtant recmmendatin letters at tw different stages,ne when I applied t the graduate schl and the ther when I applied fr a jb. I wish t thank Prfessr Brad Efrn, Prfessr Jerry Friedman and Prfessr Art Owen fr acting as members f my ral cmmittee and prviding useful suggestins n my research. I als wish t thank my friends and classmates at Stanfrd, t many t name, and the v

6 Mnday fish-bwl grup fr helping me ut in many ways and making my life here clrful and enjyable. My special thanks g t Xizha, whse cmpaninship and unselfish supprt ever since we began ur jurney tgether have brightened my life. Finally, Iwuld like t thank my parents fr their lve and cnfidence in me. My parents' supprt is a big part f everything that I accmplish. vi

7 Cntents Abstract Acknwledgments iv v 1 Intrductin The Classificatin Prblem Handwritten Digit Recgnitin DNA Micrarray Classificatin Supprt Vectr Machines Margin Maximizer Regularized Functin Fitting Multi-class Supprt Vectr Machine Outline f the Thesis Imprt Vectr Machines Kernel Lgistic Regressin KLR as Margin Maximizer Newtn-Raphsn Methd Sequential Minimal Optimizatin Methd Imprt Vectr Machines Algrithm Selectin f vii

8 2.2.3 Simulatin Results Real Data Results Multi-class Case Multi-class KLR and Multi-class SVM Multi-class IVM Summary nrm Supprt Vectr Machines Intrductin Regularized supprt vectr machines Algrithm Piece-wise linearity Initial slutin (i.e. s =0) Main algrithm Remarks Cmputatinal cst Numerical results Simulatin results Real data results Summary Micrarray Classificatin Intrductin Penalized Lgistic Regressin Penalized lgistic regressin Frmulatin Cmputatinal Issues Feature Selectin Univariate Ranking Recursive Feature Eliminatin Results viii

9 4.4.1 Chsing Leukemia Data SRBCT Data Ramaswamy Data Discussin Summary Summary f Thesis 68 A Therems and Prfs 70 Bibligraphy 83 ix

10 List f Tables 2.1 Summary f the ten benchmark datasets Classificatin perfrmance f SVMs vs IVMs Number f kernel basis used by SVMs vs IVMs Simulatin results f 1-nrm and 2-nrm supprt vectr machine Results n Micrarray Classificatin Cmparisn f leukemia classificatin methds Cmparisn f SRBCT classificatin methds Cmparisn f Ramaswamy data classificatin methds Cmparisn f leukemia classificatin methds Cmparisn f SRBCT classificatin methds x

11 List f Figures 1.1 Handwritten digit SRBCT data Linear SVMs Sme lss functins SVM vs KLR Chse Chse imprt pints SVM vs IVM Effect f training data size Multi-class IVM Piece-wise linearity slutin path nrm SVMs Chse Leukemia data SRBCT data Ramaswamy data Ramaswamy data xi

12 Chapter 1 Intrductin 1.1 The Classificatin Prblem In standard classificatin prblems, we are given a set f training data (x 1 ;y 1 ), (x 2 ;y 2 ), :::(x n ;y n ), where the input (predictr variable) x i 2R p and the utput (respnse variable) y i is qualitative and assumes values in a finite set, e.g. C = f1; 2;:::Kg. The aim is t find a classificatin rule frm the training data, s that when given a new input x, we can assign a class c(x) frmc t it. The questin, then, is what is the best pssible classificatin rule. T answer this questin, we need t define what we meanby best. A cmmn definitin f best is t achieve the lwest misclassificatin errr rate. Usually it is assumed that the training data are an independently and identically distributed samples frm an unknwn prbability distributin P (X; Y ). Then the misclassificatin errr rate is: E X;Y 1 c(x)6=y = E X P (c(x) 6= Y jx) (1.1) = 1 E X P (c(x) =Y jx) (1.2) = 1 KX Λ E X 1c(X)=k P (Y = kjx) (1.3) k=1 1

13 CHAPTER 1. INTRODUCTION 2 It is clear that c(x) = argmax k P (Y = kjx = x) will minimize this quantity with the misclassificatin errr rate equal t 1 E X max k P (Y = kjx). This classifier is knwn as the Bayes classifier, and the errr rate it achieves is the Bayes errr rate. Belw we see tw real classificatin examples that are discussed in greater detail in the thesis Handwritten Digit Recgnitin Cnsider the digits in Figure (1.1) (Reprinted frm Hastie, Tibshirani & Friedman (2001)). These are scanned in images f handwritten zip cdes frm the US Pstal Service Zip Cde Data Set. The images are greyscale maps. We thus have a p = 256 dimensinal predictr space, x 2 R 256, and a respnse space cntaining 10 pssible utcmes, C = f0; 1;:::9g. The task is t use a set f training data (pairs f x i and y i ) t frm a rule t predict y (the digit) given x (the pixels) DNA Micrarray Classificatin DNA micrarrays measure the expressin f a gene in a cell by measuring the amunt f mrna present fr that gene. Micrarrays are cnsidered a breakthrugh technlgy in bilgy, fr they facilitate the quantitative study f thusands f genes simultaneusly frm a single sample f cells. Figure (1.2) displays the heat map f a data set cntaining 2; 308 genes (rws) and 63 samples (clumns). The samples are a cllectin f small rund blue cell tumrs (SRBCT) frm children and each sample belngs t ne f fur tumr classes. The gal is t predict the diagnstic tumr categry f a sample n the basis f its gene expressin prfile. Here we have a situatin that the number f inputs (genes) (p = 2; 308) is much larger than the number f samples (n = 63). Hence, besides predicting the crrect tumr class fr a

14 CHAPTER 1. INTRODUCTION 3 Figure 1.1: Examples f hand written digits frm U.S. pstal envelpes. given sample, anther challenge in micrarray diagnsis is t identify relevant genes that cntribute mst t the classificatin. 1.2 Supprt Vectr Machines The supprt vectr machine (SVM) has been a ppular tl fr classificatin prblems in the machine learning field. Recently, it has als gained increasing attentin frm the statistics cmmunity. Belw we briefly g ver the supprt vectr machine fr tw-class classificatin frm tw perspectives. See Vapnik (1995), Burges (1998), Evgeniu, Pntil & Pggi (1999) and Hastie, Tibshirani & Friedman (2001) fr details Margin Maximizer Recall the training data are (x 1 ;y 1 ), (x 2 ;y 2 ), :::(x n ;y n ), x i 2R p. In tw-class classificatin, y i 2f 1; 1g. Let us first cnsider the case when the training data can be perfectly separated

15 E BL EWS 2 5 -! % " &'() NB RMS! " #$

16 CHAPTER 1. INTRODUCTION 5 by ahyperplane in R p. Define the hyperplane by fx : f(x) =fi 0 + x T fi =0g; where fi is a unit vectr: kfik 2 = 1, then f(x) gives the signed distance frm a pint x t the hyperplane. Since the training data are linearly separable, we are able t find a hyperplane such that y i f(x i ) > 0 8i: (1.4) Indeed, there are infinitely many such hyperplanes. Amng the hyperplanes satisfying (1.4), the supprt vectr machine lks fr the ne that maximizes the margin. The margin is defined as the shrtest distance frm the training data t the hyperplane. Hence we can write the supprt vectr machine prblem as: max fi;fi0;kfik2=1 C (1.5) subject t y i (fi 0 + x T i fi) C; i =1;::: ;n (1.6) When the training data are nt linearly separable, we allw sme training data t be n the wrng side f the edges f margin and intrduce slack variables ο i ;ο i 0. The supprt vectr machine prblem then becmes max fi;fi0;kfik2=1 C (1.7) subject t y i (fi 0 + x T i fi) C(1 ο i ); i =1;::: ;n (1.8) ο i 0; X ο i» B; (1.9) where B is a prespecified psitive number, which can be regarded as a tuning parameter. (1.7) (1.9) can be turned int a quadratic prgramming prblem and the slutin has

17 CHAPTER 1. INTRODUCTION 6 x T fi + fi0 =0 C C margin x T fi + fi0 =0 ο Λ ο 3 Λ 1 ο 2 Λ ο 4 Λ ο5 Λ C C margin Figure 1.3: Linear supprt vectr machine classifiers. the frm: f(x) =fi 0 + ff i y i hx i ;xi; (1.10) where ff i 's are Lagrangian multipliers fr the quadratic prgramming prblem, and fi = ff i y i x i : Ntice each training data has a crrespnding ff i. Using the Karush-Kuhn-Tucker (KKT) cnditins f the quadratic prgramming prblem, ne can shw a sizeable fractin f the n values f ff i are zer. This seems t be an attractive prperty, because nly the data pints near the classificatin bundary (including thse n the wrng side f the bundary) have an influence in determining the psitin f the bundary, and hence have nn-zer ff i 's. The crrespnding x i 's are called supprt pints (supprt vectrs). Figure (1.3) illustrates bth the linearly separable and nn-separable cases. As with ther linear mdels, we can make the supprt vectr machine mre flexible by enlarging the feature space using basis expansins such as plynmials r splines. Generally,

18 CHAPTER 1. INTRODUCTION 7 linear bundaries in the enlarged space achieve better training data separatin and translate t nnlinear bundaries in the riginal input space. Suppse the dictinary f the basis functins f the enlarged feature space is D = fh 1 (x);h 2 (x);:::h q (x)g ; where q is the dimensin f the enlarged feature space. Nte if q = p and h j (x) is the jth cmpnent f x, the enlarged feature space is reduced t the riginal input space. The classificatin bundary in the enlarged feature space is given by fx : f(x) =fi 0 + h(x) T fi =0g: Nte that in the linear supprt vectr machine, the slutin (1.10) depends n the basis functins nly thrugh their inner prduct. Hence, when using the enlarged basis functins, the new slutin will have the frm: f(x) =fi 0 + ff i y i hh(x i );h(x)i: (1.11) This implies that if the enlarged basis functins are basis functins f a reprducing kernel Hilbert space (RKHS) (Wahba (1990)), with K(x; x 0 )=hh(x);h(x 0 )i; then we d nt need t knw the enlarged basis functins h(x) explicitly; all we need t knw is the kernel functin K(x; x 0 ) that generates the reprducing kernel Hilbert space. This kernel trick allws the enlarged feature space t be even infinite dimensinal, i.e., q = 1, withut causing any further cmputatinal burden, since the slutin (1.11) has a finite dimensinal frm in terms f the kernel bases K(x i ;x), and the number f parameters is always n +1(n fr ff i 's and 1 fr fi 0 ). Three ppular chices f K( ; ) in the supprt

19 CHAPTER 1. INTRODUCTION 8 vectr machine literature are: dth Degree plynmial : K(x; x 0 )= 1+hx; x 0 i d ; (1.12) Radial basis : K(x; x 0 ) = exp kx x 0 k 2 =2ff 2 ; (1.13) Neural netwrk : K(x; x 0 )=tanh» 1 hx; x 0 i +» 2 ; (1.14) where d, ff,» 1 and» 2 are pre-specified parameters Regularized Functin Fitting Sectin (1.2.1) views the supprt vectr machine frm a gemetric pint f view, i.e., a hyperplane in an enlarged reprducing kernel Hilbert space that maximizes the margin f the training data. It turns ut that the supprt vectr machine is als equivalent t a regularized functin fitting prblem. With f(x) =fi 0 + h(x) T fi, cnsider the ptimizatin prblem: min fi0;fi [1 y i f(x i )] + + kfik 2 2 (1.15) where the subscript +" indicates a psitive part and is a tuning parameter. One can shw that the slutin t (1.15) is the same as the supprt vectr machine (1.7) (1.9). Furthermre, if h(x) are apprpriate basis functins f a reprducing kernel Hilbert space, (1.15) can be written as: min f2h K [1 y i f(x i )] + + J (f); (1.16) where H K is the reprducing kernel Hilbert space, and J (f) =kfk 2 H K is the square f the L 2 nrm f f(x) defined n H K. Similar t Sectin (1.2.1), althugh H K can be infinite

20 CHAPTER 1. INTRODUCTION 9 dimensinal, the slutin f (1.16) has a finite dimensinal frm: f(x) =fi 0 + ff i y i K(x i ;x); (1.17) where K( ; ) is the psitive definite kernel functin that generates the reprducing kernel Hilbert space H K. Ntice bth (1.15) and (1.16) have the frm lss + penalty, which is a familiar paradigm t statisticians in functin estimatin. The lss functin (1 yf) + is called the hinge lss. Lin (2002) shws: p 1 (x) argmin f E [(1 Yf(x)) + ]=sign(p 1 (x) 1=2) r sign lg ; 1 p 1 (x) where p 1 (x) = P (Y = 1jX = x) is the cnditinal prbability f a pint being in class 1 given X = x. Hence the supprt vectr machine tries t implement the ptimal Bayes classificatin rule withut estimating the actual cnditinal prbability p 1 (x) Multi-class Supprt Vectr Machine The supprt vectr machine classifier s far described is fr tw-class classificatin. In ging frm tw-class t multi-class classificatin, many researchers have prpsed varius prcedures. In practice, the ne-vs-rest scheme is ften used: given K classes, the prblem is divided int a series f K ne-vs-rest prblems, and each ne-vs-rest prblem is addressed by a different class-specific supprt vectr machine classifier (e.g., class 1" vs. nt class 1"); then a new sample takes the class f the classifier with the largest real valued utput c = argmax k=1;:::k f k, where f k is the real valued utput f the kth supprt vectr machine classifier. Instead f slving K prblems, Westn & Watkins (1999) and Vapnik (1998) generalize

21 CHAPTER 1. INTRODUCTION 10 (1.7) (1.9) by slving ne single ptimizatin prblem: max fi k ;fi0k C (1.18) subject t (fi 0yi fi 0k )+x T i (fi yi fi k ) C(1 ο ik ); (1.19) i =1;:::n; k =1;:::K;k 6= y i (1.20) ο ik 0; X i KX k=1 X k6=y i ο ik» B (1.21) kfi k k 2 2 =1: (1.22) Similar t sectin 1.2.1, this setup als has a nice gemetric interpretatin. Recently, Lee, Lin & Wahba (2002) prpsed an algrithm that implements the Bayes classificatin rule and estimates argmax k P (Y picture f their algrithm is still nt clear. = kjx = x) directly, but the gemetric 1.3 Outline f the Thesis In Chapter 2 we prpse a new apprach fr classificatin, called the imprt vectr machine (IVM), which isbuiltnkernel lgistic regressin (KLR).We shw that the imprt vectr machine nt nly perfrms as well as the supprt vectr machine in tw-class classificatin, but als can naturally be generalized t the multi-class case. Furthermre, the imprt vectr machine prvides an estimate f the underlying prbability. Similar t the supprt pints f the supprt vectr machine, the imprt vectr machine mdel uses nly a fractin f the training data t index kernel basis functins, typically a much smaller fractin than the supprt vectr machine. This gives the imprt vectr machine a cmputatinal advantage ver the supprt vectr machine, especially when the size f the training data set is large. The imprt vectr machine is based n kernel lgistic regressin, which replaces the hinge lss f the supprt vectr machine with negative binmial lg-likelihd. In Chapter 3, we replace the penalty term kfik 2 2, the square f the L 2 nrm f fi, in (1.15) f the supprt vectr machine, with the L 1 nrm kfik 1,andwe fit the 1-nrm supprt vectr machine. The

22 CHAPTER 1. INTRODUCTION 11 mtivatin fr ding such a replacement is that in additin t shrinking the fitted cefficients ^fi twards zer, just as the L 2 penalty des, the L 1 penalty als tends t set sme f the fitted cefficients exactly equal t zer. Hence the 1-nrm supprt vectr machine fits a mdel that des autmatic feature selectin. We shw that the fitted cefficients path ^fi as a functin f the tuning parameter is piece-wise linear, and we give an efficient algrithm that cmputes the whle cefficients path. This facilitates efficient adaptive selectin f the tuning parameter. In Chapter 4, we cncentrate n DNA micrarray classificatin using techniques develped in Chapter 2 and Chapter 3. Often a primary gal in micrarray diagnsis is t identify the genes respnsible fr the classificatin, rather than class predictin. Besides the autmatic gene selectin dne by thel 1 penalty as described in Chapter 3, we cnsider tw gene selectin methds used in the literature, univariate ranking (UR) and recursive feature eliminatin (RFE). Empirical results indicate that recursive feature eliminatin methd tends t select fewer genes than ther methds and als perfrms well in bth crss-validatin and test data. In the Appendix we give prfs fr all the therems cntained in the thesis.

23 Chapter 2 Imprt Vectr Machines In this chapter, we prpse a new apprach fr classificatin, the imprt vectr machine (IVM), that is built n kernel lgistic regressin (KLR). We shw that the imprt vectr machine nt nly perfrms as well as the supprt vectr machine in tw-class classificatin, but als can naturally be generalized t the multi-class case. Furthermre, the imprt vectr machine prvides an estimate f the underlying prbability. Similar t the supprt pints f the supprt vectr machine, the imprt vectr machine mdel uses nly a fractin f the training data t index kernel basis functins, typically a much smaller fractin than the supprt vectr machine. This gives the imprt vectr machine a ptential cmputatinal advantage ver the supprt vectr machine, especially when the size f the training data set is large. 2.1 Kernel Lgistic Regressin As described in Chapter 1, the standard supprt vectr machine prduces a nn-linear classificatin bundary in the riginal input space by cnstructing a linear bundary in an enlarged versin f the riginal input space. The dimensin f the enlarged space can be very large, even infinite, in sme cases. This seemingly prhibitive cmputatin is achieved thrugh a psitive definite reprducing kernel K( ; ), which gives the inner prduct in the enlarged space. 12

24 CHAPTER 2. IMPORT VECTOR MACHINES 13 In Chapter 1 we have als nted the relatinshipbetween the supprt vectr machine and regularized functin fitting in the reprducing kernel Hilbert spaces (RKHS).An verview can be fund in Evgeniu, Pntil & Pggi (1999), Wahba (1999) and Hastie, Tibshirani & Friedman (2001). Fitting a supprt vectr machine is equivalent t: min f2h K [1 y i f(x i )] + + kfk 2 H K ; (2.1) where H K is the reprducing kernel Hilbert space generated by a kernel K( ; ). By the representer therem (Kimeldrf & Wahba (1971)), the ptimal f(x) has the frm: f(x) =fi 0 + and nly the supprt pints will have nn-zer ff i 's. ff i K(x i ;x); (2.2) Nte that (2.1) has the frm lss + penalty. The lss functin (1 yf) + is pltted in Figure 2.1, alng with several traditinal lss functins. As we can see, the negative lg-likelihd (NLL) f the binmial distributin has a shape similar t that f the supprt vectr machine: bth increase linearly as yf gets very small (negative) and encurage y and f t have the same sign. If we replace (1 yf) + in (2.1) with ln(1 + e yf ), the negative lg-likelihd f the binmial distributin, the prblem becmes a kernel lgistic regressin prblem: min f2h K ln 1+e y if(x i ) + kfk 2 H K : (2.3) Because f the similarity between the tw lss functins, we expect that the fitted functin perfrms similarly t the supprt vectr machine fr tw-class classficatin. There are tw immediate advantages f making such a replacement: (a) Kernel lgistic

25 CHAPTER 2. IMPORT VECTOR MACHINES 14 Lss Binmial Lg-likelihd Squared Errr Supprt Vectr yf(x) Figure 2.1: Sme lss functins. y 2f 1; 1g: regressin estimates the lg-dds: P (Y =1jX = x) f(x) = lg P (Y = 1jX = x) = fi 0 + (2.4) ff i K(x i ;x): (2.5) Hence, besides giving a classificatin rule, kernel lgistic regressin als ffers a natural estimate f the prbability p 1 (x) = ef(x) 1+e f(x) ; while the supprt vectr machine nly estimates (Lin (2002))» p 1 (x) sign [p 1 (x) 1=2] r sign lg ; 1 p 1 (x) where p 1 (x) P (Y = 1jX = x) is the cnditinal prbability fa pint being in class 1

26 CHAPTER 2. IMPORT VECTOR MACHINES 15 given X = x; (b) The kernel lgistic regressin can naturally be generalized t the multiclass case thrugh kernel multi-lgit regressin, whereas this is nt the case fr the supprt vectr machine. Hwever, because the kernel lgistic regressin cmprmises the hinge lss functin f the supprt vectr machine, it n lnger has the supprt pints prperty; in ther wrds, all the ff i 's in (2.5) are nn-zer. Kernel lgistic regressin is a well-studied prblem; see Wahba, Gu, Wang & Chappell (1995), Green & Yandell (1985), Hastie & Tibshirani (1990) and references therein; hwever, they are all under the smthing spline analysis f variance scheme. We use a simulatin example t illustrate the similar perfrmance f kernel lgistic regressin and the supprt vectr machine. The data in each class are simulated frm a mixture f Gaussian distributin (Hastie, Tibshirani & Friedman (2001)): first we generate 10 means μ k frm a bivariate Gaussian distributin N((1; 0) T ; I) and label this class +1. Similarly, 10 mre are drawn frm N((0; 1) T ; I) and labeled class 1. Then fr each class, we generate 100 bservatins as fllws: fr each bservatin, we pickanμ k at randm with prbability 1=10, and then generate a N(μ k ; I=5), thus leading t a mixture f Gaussian clusters fr each class. We use a radial basis kernel (1.13). The regularizatin parameter is chsen t achieve gd misclassificatin errr. The results are shwn in Figure 2.2. The radial basis kernel prduces a bundary quite clse t the Bayes ptimal bundary fr this simulatin. We see that the fitted mdel f kernel lgistic regressin is quite similar in classificatin perfrmance t that f the supprt vectr machine. In additin t a classificatin bundary, since kernel lgisitic regressin estimates the lg-dds f class prbabilities, it can als prduce prbability cnturs (Figure 2.2) KLR as Margin Maximizer Recall that in Chapter 1 we described the supprt vectr machine frm tw perspectives: (a) gemetrically as the margin maximizer, and (b) as a regularized functin fitting prblem. The mtivatin fr fitting a kernel lgistic regressin mdel is the similarity in shape between the negative lg-likelihd f the binmial distributin and the hinge lss f the supprt

27 CHAPTER 2. IMPORT VECTOR MACHINES 16 SVM 130 Supprt Pints Training Errr: Test Errr: Bayes Errr: KLR Radial Basis Training Errr: Test Errr: Bayes Errr: Figure 2.2: The slid black lines are classificatin bundaries; the dashed purple lines are Bayes ptimal bundaries. Fr the SVM, the dashed black lines are the edges f the margins and the black pints are the pints exactly n the edges f the margin. Fr KLR, the dashed black lines are the p1(x) = 0:25 and 0:75 lines. vectr machine, and this mtivatin is frm the regularized functin fitting perspective. Since the supprt vectr machine was initiated as a methd t maximize the margin f the training data, then a natural questin is: what des kernel lgistic regressin d with the margin? It turns ut that kernel lgistic regressin can als be regarded as a margin maximizer. Similar t sectin 1.2.1, let D = fh 1 (x);h 2 (x);:::h q (x)g be the dictinary f the basis functins f the enlarged feature space, where q is the dimensin f the enlarged feature space. The classificatin bundary, a hyperplane in this enlarged feature space, is given by: fx : f(x) =fi 0 + h(x) T fi =0g:

28 CHAPTER 2. IMPORT VECTOR MACHINES 17 Suppse the enlarged feature space is s rich that the training data are separable, then the margin-maximing supprt vectr machine can be written as: max fi0;fi;kfik2=1 C (2.6) subject t y i fi0 + h(x i ) T fi C; i =1;::: ;n (2.7) where C is the shrtest distance frm the training data t the separating hyperplane and defined as the margin. Nw cnsider an equivalent setup f kernel lgistic regressin: min fi0;fi ln 1+e y if(x i ) (2.8) subject t kfik 2 2» s (2.9) f(x i )=fi 0 + h(x i ) T fi; i =1;::: ;n: (2.10) Then we have the fllwing therem: Therem 2.1 Suppse the training data are separable, i.e. 9fi 0 ;fi,s.t.y i (fi 0 + h(x i ) T fi) > 0; 8i. Let the slutin f (2.8) (2.10) be dented by ^fi(s), then ^fi(s) s! fi Λ as s!1; where fi Λ is the slutin f the margin-maximizing supprt vectr machine (2.6) (2.7), if fi Λ is unique. If fi Λ is nt unique, then ^fi(s) s represent margin-maximizing separating hyperplanes. may have multiple cnvergence pints, but they will all The prf f the therem is delayed in the Appendix. Therem 2.1 implies that kernel lgistic regressin, similar t the supprt vectr machine, is als a kind f margin maximizer.

29 CHAPTER 2. IMPORT VECTOR MACHINES Newtn-Raphsn Methd In this sectin, we describe hw t slve kernel lgistic regressin using the Newtn-Raphsn methd. Let H = ln 1+e y if(x i ) + 2 kfk2 H K : (2.11) Then, by the representer therem (Kimeldrf & Wahba (1971)), the slutin that minimizes H has the frm: f(x) =fi 0 + ff i K(x i ;x): Let 1 p i = ; i =1;:::n (2.12) 1+e y if(x i ) ~ff = (fi 0 ;ff 1 ;:::ff n ) T (2.13) ~p = (p 1 ;:::p n ) T (2.14) ~y = (y 1 ;::: ;y n ) T (2.15) K 1 = ~ 1;K(x i ;x i 0) n i;i 0 =1 (2.16) A (2.17) K 2 = 0 K(x i ;x i 0) n i;i 0 =1 W = diag (p 1 (1 p 1 );:::p n (1 p n )) (2.18) Ntice K 1 is a n (n + 1) matrix, K 2 is a (n +1) (n + 1) matrix and W is a n n matrix. With sme abuse f ntatin, (2.11) can be written in a finite dimensinal frm: H = ~1 T ln 1+e ~y (K 1~ff) + 2 ~fft K 2 ~ff: (2.19)

30 CHAPTER 2. IMPORT VECTOR MACHINES 19 Nw wehave where " dentes = KT 1 (~y ~p)+ K 2 ~ff 2 T = K T 1 WK 1 + K 2 (2.21) The Newtn-Raphsn methd then iterates as Algrithm 2.1: Algrithm 2.1 Newtn-Raphsn Methd fr KLR 1. Initialize ~ff Cmpute ~p; K 1 ;K 2 and W H ~ff t = ~ff t = ~ff t 1 + K T 1 WK 1 + K 2 1 K T 1 (~y ~p) K 2 ~ff t 1 (2.22) (2.23) = K T 1 WK 1 + K 2 1 K T 1 W~z (2.24) where ~z = K 1 ~ff t 1 + W 1 (~y ~p) 4. Repeat steps (2) and (3) until ~ff t cnverges. In (2.24), we have re-expressed the Newtn-Raphsn step as a weighted least-squares step. ~z is smetimes knwn as the adjusted respnse, and the step is referred t as iteratively reweighted least squares. ~ff 0 = 0 is usually a gd starting value. Since H is cnvex, the algrithm typically des cnverge, but vershting can ccur. In the rare case that vershting ccurs, step size halving will gurantee cnvergence.

31 CHAPTER 2. IMPORT VECTOR MACHINES Sequential Minimal Optimizatin Methd The drawback f the Newtn-Raphsn methd is that in each iteratin, an n n matrix needs t be inverted. The crrespnding cmputatinal cst can be high when n is large. Recently, Keerthi, Duan, Shevade & P (2002) prpsed a dual algrithm fr kernel lgistic regressin that avids inverting huge matrices. It fllws the spirit f the ppular sequential minimal ptimizatin (SMO) algrithm (Platt (1999)). Let C =1=. Let F i = f(ff) = 1 2 i 0 =1 ff i 0y i 0K(x i 0;x i )+y i ln ffi C ff i XX i + C X i ff i ff i 0y i y i 0K(x i ;x i 0) i 0 ff i C ln ff i C +(1 ff i C )ln 1 ff i C (2.25) (2.26) Then, the sequential minimal ptimizatin algrithm prceeds as Algrithm 2.2: Algrithm 2.2 Sequential Minimal Optimizatin Algrithm fr KLR 1. Initialize ff i such that 0 <ff i <C and ff i y i =0: 2. Let F up = max i F i (2.27) F lw = min i F i (2.28) i up = argmax i F i (2.29) i lw = argmin i F i (2.30)

32 CHAPTER 2. IMPORT VECTOR MACHINES 21 If F up = F lw, stp. If nt, let ~ff iup (s) = ff iup s=y iup (2.31) ~ff ilw (s) = ff ilw + s=y ilw (2.32) ~ff i (s) = ff i 8i 6= i up ;i lw (2.33) Find s Λ that minimize f (~ff(s)). 3. Update ff ψ ~ff(s Λ ): G t step (2). At the end f the algrithm, we have fi 0 = F up = F lw and f(x) =fi 0 + ff i y i K(x i ;x) (2.34) Preliminary cmputatinal experiments shw that this sequential minimal ptimizatin algrithm is rbust and fast (Keerthi, Duan, Shevade & P (2002)). We have generalized this algrithm t the multi-class case. A detailed descriptin f the multi-class case algrithm is given in the Appendix. 2.2 Imprt Vectr Machines Althugh the sequential minimal ptimizatin methd helps reduce the cmputatinal cst f kernel lgistic regressin, in the fitted mdel f(x) =fi 0 + ff i K(x i ;x); (2.35)

33 CHAPTER 2. IMPORT VECTOR MACHINES 22 as mentined in sectin 2.1, all the ff i 's are nn-zer. This is nt the case fr the supprt vectr machine, fr nly supprt pints have nn-zer ff i 's. S the supprt vectr machine allws fr data cmpressin and has the advantage f less strage and quicker evaluatin. In this sectin, we prpse an imprt vectr machine mdel that finds a sub-mdel t apprximate the full mdel (2.35) given by kernel lgistic regressin. The sub-mdel has the frm: f(x) =fi 0 + X x i 2S ff i K(x i ;x) (2.36) where S is a subset f the training data fx 1 ;x 2 ;:::x n g, and the data in S are called imprt pints. The advantage f this sub-mdel is that the cmputatinal cst is reduced, especially fr large training data sets, while nt jepardizing the perfrmance in classificatin; and since nly a subset f the training data are used t index the fitted mdel, data cmpressin is achieved. Several ther researchers have als investigated techniques in selecting the subset S. Lin, Wahba, Xiang, Ga, Klein & Klein (2000) divide the training data int several clusters, then randmly select a representative frm each cluster t make up S. Smla & Schölkpf (2000) develpe a greedy technique t sequentially select m clumns f the kernel matrix [K(x i ;x i 0)] n n, such that the span f these m clumns apprximates the span f [K(x i ;x i 0)] n n well in the Frbenius nrm. Williams & Seeger (2001) prpse randmly selecting m pints f the training data, then using the Nystrm methd t apprximate the eigen-decmpsitin f the kernel matrix [K(x i ;x i 0)] n n, and expanding the results back up t n dimensins. Nne f these methds uses the utput y i in selecting the subset S (i.e., the prcedure invlves nly x i ). The imprt vectr machine algrithm uses bth the utput y i and the input x i t select the subset S in such away that the resulting fit apprximates the full mdel well.

34 CHAPTER 2. IMPORT VECTOR MACHINES Algrithm As mentined befre, we want t find a subset S f fx 1 ;x 2 ;:::x n g, such that the submdel (2.36) is a gd apprximatin f the full mdel (2.35). Since it is cmputatinally impssible t search fr every subset S, we use a greedy frward strategy as described in Algrithm 2.3. We call the pints in S imprt pints. Algrithm 2.3 Basic IVM Algrithm 1. Let S = ;, R = fx 1 ;x 2 ;::: ;x n g, t =1. 2. Fr each x l 2R, let f l (x) =fi 0 + X x i 2S[fx l g ff i K(x i ;x) Find ~ff t minimize H(x l ) = ln (1 + exp( y i f l (x i ))) + 2 kf l(x)k 2 H K (2.37) = ~1 T ln 1 + exp( ~y (K1~ff)) l + 2 ~fft K2~ff l (2.38) where the regressr matrix K l 1 =[~1;K(x i ;x i 0)] n (m+1) ; x i 2fx 1 ;:::x n g;x i 0 2S[fx l g; the regularizatin matrix and m = jsj. 3. Find K l 2 = K(x i ;x i 0) 1 A (m+1) (m+1) ;x i ;x i 0 2S[fx l g; x l Λ = argmin xl 2RH(x l ):

35 CHAPTER 2. IMPORT VECTOR MACHINES 24 Let S = S[fx l Λg, R = Rnfx l Λg, H t = H(x l Λ), t! t Repeat steps (2) and (3) until H t cnverges. Algrithm 2.3 is cmputatinally feasible, but in step (2) we need t use the Newtn- Raphsn methd r sequential minimal ptimizatin methd t find ~ff iteratively. When the number f imprt pints m becmes large, the cmputatin can be expensive. T reduce this cmputatin, we use a further apprximatin. Instead f iteratively cmputing ~ff until it cnverges, we can simply d a ne-step Newtn-Raphsn iteratin, and use it as an apprximatin t the cnverged ne. T get a gd apprximatin, we take advantage f the fitted result frm the current ptimal" S, i.e., the sub-mdel when jsj = m, and use it as the initial value. This ne-step update is similar t the scre test in generalized linear mdels (GLM), but the latter des nt have a penalty term. The updating frmula allws the weighted regressin (2.24) t be cmputed in O(nm) time. Hence, we have the revised step (2) f Algrithm 2.3 in Algrithm??. Algrithm 2.4 Revised Step (2) (2 Λ ) Fr each x l 2R, crrespndingly augment K 1 with a clumn, and K 2 with a clumn and a rw. Use the updating frmula t find ~ff in (2.24). Cmpute (2.37). In step (4) f Algrithm 2.3, we need t decide when t stp the algrithm. A natural stpping rule is t lk at the regularized negative lg-likelihd. Let H 1 ;H 2 ;::: be the sequence f regularized negative lg-likelihd btained in step (3). At each step t, we cmpare H t with H t t, where t is a pre-chsen small integer, fr example, t =1. If the rati jht H t tj jh tj is less than sme pre-chsen small number ffl, fr example, ffl =0:001, we stp adding new imprt pints t S Selectin f S far, we have assumed that the tuning parameter is fixed. In practice, we als need t chse an ptimal". We can randmly split all the data int a training set and a tuning set, and use the misclassificatin errr n the tuning set as a criterin fr chsing

36 CHAPTER 2. IMPORT VECTOR MACHINES 25. T reduce the cmputatin, we take advantage f the fact that the regularized negative lg-likelihd cnverges faster fr a larger. Thus, instead f running the entire revised algrithm fr each, we prpse Algrithm??, which cmbines bth adding imprt pints t S and chsing the ptimal. Algrithm 2.5 Simultaneus Selectin f S and 1. Start with a large tuning parameter. 2. Let S = ;, R = fx 1 ;::: ;x n g, t =1. 3. Run steps (2 Λ ), (3) and (4) f the revised Algrithm 2.4, until the stpping criterin is satisfied at S = fx i 1 ;::: ;x i t g. Alng the way, als cmpute the misclassficatin errr n the tuning set. 4. Decrease t a smaller value. 5. Repeat steps (3) and (4), starting with S = fx i 1 ;::: ;x i t g. We chse the ptimal as the ne that crrespnds t the minimum misclassificatin errr n the tuning set Simulatin Results In this sectin, we use a simulatin t illustrate the imprt vectr machine methd. The data are generated in the same way as Figure 2.2. The simulatin results are shwn in Figure 2.3 Figure 2.5. Figure 2.3 shws hw the tuning parameter is selected. The ptimal is fund t be equal t 1 and crrespnds t a misclassificatin rate 0:262. Figure 2.4 fixes the tuning paramter t = 1 and finds 19 imprt pints. Figure 2.5 cmpares the results f the supprt vectr machine and the imprt vectr machine: the supprt vectr machine has 130 supprt pints, and the imprt vectr machine uses 19 imprt pints; they give similar classificatin bundaries. Figure 2.6 is fr the same simulatin but different sizes f training data: n = 200; 400; 600; 800. We see that as the training data size n increases, the number f imprt pints des nt tend t increase.

37 CHAPTER 2. IMPORT VECTOR MACHINES 26 Chsing Lambda Chsing Lambda Regularized Deviance Optimal Lambda = 1 Misclassificatin Rate Optimal Lambda = # f Imprt Pints # f Imprt Pints Figure 2.3: Radial kernel is used. n =200, ff 2 =0:7, t =3, ffl =0:001, decreases frm e 10 t e 10. The minimum misclassificatin rate 0:262 is fund t crrespnd t =1. Remark: The supprt pints f the supprt vectr machine are thse which are clse t the classificatin bundary r misclassified and usually have large weights [p(x)(1 p(x))]. The imprt pints f the imprt vectr machine are thse that decrease the regularized negative lg-likelihd the mst, and can be either clse t r far frm the classificatin bundary. The difference in prperties is natural, because the supprt vectr machine is nly cncerned with the classificatin sign[p(x) 1=2], while the imprt vectr machine als fcuses n the unknwn prbability p(x). Thugh pints away frm the classificatin bundary d nt cntribute t determining the psitin f the classificatin bundary, they may cntribute t estimating the unknwn prbability p(x). The ttal cmputatinal cst f the supprt vectr machine is O(n 2 s), where s is the number f supprt pints, while the cmputatinal cst f the imprt vectr machine methd is O(n 2 m 2 ), where m is the number f imprt pints. Since m des nt tend t increase as n increases, as illustrated in

38 CHAPTER 2. IMPORT VECTOR MACHINES 27 Lambda = 1 Lambda = 1 Regularized Deviance # f Imprt Pints = 19 Misclassificatin Rate Test Errr = # f Imprt Pints # f Imprt Pints Figure 2.4: Radial kernel is used. n =200, ff 2 =0:7, t = 3, ffl =0:001, =1. The stpping criterin is satisfied when jsj =19. Figure 2.6, the cmputatinal cst f the imprt vectr machine can be smaller than that f the supprt vectr machine, especially fr large training data sets Real Data Results In this sectin, we cmpare the perfrmance f the imprt vectr machine and the supprt vectr machine n sme real datasets. Ten benchmark datasets are used fr this purpse: Banana, Breast-cancer, Flare-slar, German, Heart, Image, Ringnrm, Splice, Thyrid, Titanic, Twnrm and Wavefrm. Detailed infrmatin abut these datasets can be fund in Rätsch, T.Onda & K.R.Müller (2000) r are available at http : ==ida:f irst:gmd:edu= ο raetsch=data.

39 CHAPTER 2. IMPORT VECTOR MACHINES 28 SVM 130 Supprt Pints Training Errr: Test Errr: Bayes Errr: IVM 19 Imprt Pints Training Errr: Test Errr: Bayes Errr: Figure 2.5: The slid black lines are classificatin bundaries; the dashed purple lines are Bayes ptimal bundaries. Fr the SVM, the dashed black lines are the edges f the margins, and the black pints are the supprt pints exactly n the margin. Fr the IVM, the dashed black lines are the p1(x) = 0:25 and 0:75 lines, and the black pints are the imprt pints. Table 2.1 cntains a summary f these datasets. Radial kernel (1.13) K(x; x 0 )=e kx x0 k 2 2ff2 is used thrughut these datasets. The parameters ff and are fixed at specific values that are ptimal fr the supprt vectr machine's generalizatin perfrmance (Rätsch, T.Onda & K. R. Müller (2000)). Each dataset has 20 realizatins f the training and test data. The results are in Table 2.2 and Table 2.3. The number utside each bracket is the mean ver 20 realizatins f the training and test data, and the number in each bracket is the standard deviatin. Frm Table 2.2, we can see that the imprt vectr machine perfrms as well as the supprt vectr machine in classificatin n these benchmark datasets. Frm Table 2.3, we can see that the imprt vectr machine typically uses a much smaller fractin f the training data than the supprt vectr machine t index kernel basis functins. This may give the imprt vectr machine a cmputatinal advantage ver the supprt vectr

40 CHAPTER 2. IMPORT VECTOR MACHINES 29 # f Training Data = 200 # f Training Data = 400 # f Training Data = 600 # f Training Data = 800 Regularized Deviance # f Imprt Pints = 19 Regularized Deviance # f Imprt Pints = 22 Regularized Deviance # f Imprt Pints = 26 Regularized Deviance # f Imprt Pints = # f Imprt Pints # f Imprt Pints # f Imprt Pints # f Imprt Pints Figure 2.6: The data are generated in the same way as Figure 2.3 Figure 2.5. Radial kernel is used. ff = 0:7, =1, t =3, ffl =0:001. The sizes f training data are n =200; 400; 600; 800, and the crrespnding numbers f imprt pints are 19; 22; 26; 22. machine, especially when the size f the training data is large. 2.3 Multi-class Case In this sectin, we briefly describe a generalizatin f the imprt vectr machine t multiclass classificatin. Suppse there are K classes. The cnditinal prbability f a pint being in class k given X = x is dented as p k (x) = P (Y = kjx = x). Hence the Bayes classificatin rule is given by: c(x) = argmax k2f1;::: ;Kg p k (x)

41 CHAPTER 2. IMPORT VECTOR MACHINES 30 Table 2.1: Summary f the ten benchmark datasets. n is the size f the training data, p is the dimensin f the riginal input, ff 2 is the parameter f the radial kernel, is the tuning parameter, and N is the size f the test data. Dataset n p ff 2 N Banana : Breast-cancer : Flare-slar German Heart Image Ringnrm Thyrid Titanic Twnrm Wavefrm Table 2.2: Cmparisn f classificatin perfrmance f SVM and IVM n ten benchmark datasets. Dataset SVM Errr (%) IVM Errr (%) Banana 10:78(±0:68) 10:34(±0:46) Breast-cancer 25:58(±4:50) 25:92(±4:79) Flare-slar 32:65(±1:42) 33:66(±1:64) German 22:88(±2:28) 23:53(±2:48) Heart 15:95(±3:14) 15:80(±3:49) Image 3:34(0:70) 3:31(±0:80) Ringnrm 2:03(±0:19) 1:97(±0:29) Thyrid 4:80(±2:98) 5:00(±3:02) Titanic 22:16(±0:60) 22:39(±1:03) Twnrm 2:90(±0:25) 2:45(±0:15) Wavefrm 9:98(±0:43) 10:13(±0:47)

42 CHAPTER 2. IMPORT VECTOR MACHINES 31 Table 2.3: Cmparisn f number f kernel basis used by SVM and IVM n ten benchmark datasets. Dataset #fsv #fiv Banana 90(±10) 21(±7) Breast-cancer 115(±5) 14(±3) Flare-slar 597(±8) 9(±1) German 407(±10) 17(±2) Heart 90(±4) 12(±2) Image 221(±11) 72(±18) Ringnrm 89(±5) 72(±30) Thyrid 21(±2) 22(±3) Titanic 69(±9) 8(±2) Twnrm 70(±5) 24(±4) Wavefrm 151(±9) 26(±3) The mdel has the frm: p 1 (x) = p 2 (x) = e f 1(x) P K k=1 ef k(x) ; (2.39) e f 2(x) P K k=1 ef k(x) ; (2.40) p K (x) =. (2.41) e f K (x) P K k=1 ef k(x) ; (2.42) where f k (x) 2 H K ; H K is the reprducing kernel Hilbert space generated by a psitive definite kernel K( ; ). Ntice that f 1 (x);::: ;f K (x) are nt identifiable in this mdel, fr if we add a cmmn term t each f k (x), p 1 (x);::: ;p K (x) will nt change. T make f k (x) identifiable, we cnsider the symmetric cnstraint KX k=1 f k (x) =0: (2.43) Then the multi-class kernel lgistic regressin fits a mdel t minimize the regularized

43 CHAPTER 2. IMPORT VECTOR MACHINES 32 negative lg-likelihd H = = ln p yi (x i )+ 2 kfk2 H K (2.44) h y T i f(x i )+ln i e f 1(x i ) + + e f K (x i ) + 2 kfk2 H K (2.45) where y i is a binary K-vectr with values all zer except a 1 in psitin k if the class is k, and f(x i ) = (f 1 (x i );::: ;f K (x i )) T ; (2.46) kfk 2 H K = KX k=1 kf k k 2 H K : (2.47) Using the representer therem (Kimeldrf & Wahba (1971)), ne can shwthatf k (x), which minimizes H, has the frm f k (x) =fi 0k + ff ik K(x i ;x): (2.48) Hence, (2.44) becmes H = h i yi T (K 1 (i; )A) T +ln 1 T e (K 1(i;)A) T + 2 KX k=1 ff T k K 2ff k (2.49) where A =(ff 1 :::ff K ), K 1 and K 2 are defined in the same way asinthetw-class case; and K 1 (i; )istheith rw fk 1. Ntice that in this mdel, the cnstraint (2.43) is nt necessary anymre, fr at the minimum f (2.44), P K k=1 f k(x) = 0 is autmatically satisfied.

44 CHAPTER 2. IMPORT VECTOR MACHINES Multi-class KLR and Multi-class SVM Similar t Therem 2.1, a cnnectin between the multi-class kernel lgistic regressin and the multi-class supprt vectr machine als exists. Let D = fh 1 (x);::: ;h q (x)g be the dictinary f the basis functins f the enlarged feature space. Cnsider min fi0k;fi k h i yi T f(x i)+ln e f 1(x i ) + + e f K (x i ) (2.50) subject t f k (x i )=fi 0k + h(x i ) T fi k ; i =1;::: ;n (2.51) KX k=1 kfi k k 2 2» s: (2.52) Therem 2.2 Suppse the training data are pairwise separable, i.e. 9fi 0k ;fi k, s.t. (fi 0yi fi 0k )+h(x i ) T (fi yi fi k ) > 0; 8i; 8k 6= y i. Let the slutin f (2.50) (2.52) be dented by ^fi k (s), then ^fi k (s) s! fi Λ as s!1; where fi Λ is the slutin f the multi-class supprt vectr machine (1.18) (1.22), if fi Λ is unique. If fi Λ is nt unique, then ^fi k (s) s slutins f (1.18) (1.22). may have multiple cnvergence pints, but they are all The prf f the therem is very similar t that f Therem 2.1, we mit it here Multi-class IVM The multi-class imprt vectr machine prcedure is similar t the tw-class case, and the cmputatinal cst is O(Kn 2 m 2 ). Figure 2.7 is a simulatin f the multi-class imprt vectr machine. The data in each class are generated frm a mixture f Gaussians (Hastie,

45 CHAPTER 2. IMPORT VECTOR MACHINES 34 Tibshirani & Friedman (2001)). Multi-class IVM - with 32 imprt pints Training Errr: Test Errr: Bayes Errr: Figure 2.7: Radial kernel is used. K =3, N = 300, =0:368, jsj = Summary We have discussed the imprt vectr machine methd in tw-class and multi-class classificatin. We shwed that it nt nly perfrms as well as the supprt vectr machine, but als prvides an estimate f the prbability p(x). The cmputatinal cst f the imprt vectr machine is O(n 2 m 2 ) fr the tw-class case and O(Kn 2 m 2 ) fr the multi-class case, where m is the number f imprt pints.

46 Chapter 3 1-nrm Supprt Vectr Machines In Chapter 2, we replace the hinge lss functin f the supprt vectr machine with the binmial lg-likelihd and fit kernel lgistic regressin and imprt vectr machine mdels. In this chapter, we replace the L 2 -nrm penalty f the supprt vectr machine with the L 1 -nrm, and cnsider the 1-nrm supprt vectr machine. We argue that the 1-nrm supprt vectr machine may have sme advantage ver the standard 2-nrm supprt vectr machine, especially when there are redundant nise features. We als prpse an efficient algrithm that cmputes the whle slutin path f the 1-nrm supprt vectr machine, hence facilitates adaptive selectin f the tuning parameter fr the 1-nrm supprt vectr machine. 3.1 Intrductin In standard tw-class classificatin prblems, we are given a set f training data (x 1 ;y 1 ), :::(x n ;y n ), where the input x i 2 R p, and the utput y i 2 f1; 1g is binary. We wish t find a classficatin rule frm the training data, s that when given a new input x, we can assign a class y frm f1; 1g t it. 35

Pattern Recognition 2014 Support Vector Machines

Pattern Recognition 2014 Support Vector Machines Pattern Recgnitin 2014 Supprt Vectr Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 1 / 55 Overview 1 Separable Case 2 Kernel Functins 3 Allwing Errrs (Sft