SVM-Based Negative Data Mining to Binary Classification

Size: px

Start display at page:

Download "SVM-Based Negative Data Mining to Binary Classification"

Julia McDaniel
5 years ago
Views:

Georga State Unversty ScholarWorks @ Georga State Unversty Computer Scence Dssertatons Department of Computer Scence 8-3-6 SVM-Based Negatve Data Mnng to Bnary Classfcaton Fuhua Jang Follow ths and

1 Georga State Unversty Georga State Unversty Computer Scence Dssertatons Department of Computer Scence SVM-Based Negatve Data Mnng to Bnary Classfcaton Fuhua Jang Follow ths and addtonal works at: Part of the Computer Scences Commons Recommended Ctaton Jang Fuhua "SVM-Based Negatve Data Mnng to Bnary Classfcaton." Dssertaton Georga State Unversty 6. Ths Dssertaton s brought to you for free and open access by the Department of Computer Scence at Georga State Unversty. It has been accepted for ncluson n Computer Scence Dssertatons by an authorzed admnstrator of Georga State Unversty. For more nformaton please contact scholarworks@gsu.edu.

2 SVM-BASED NEGATIVE DATA MINING TO BINARY CLASSIFICATION by FUHUA JIANG Under the Drecton of A. P. Preethy ABSTRACT The propertes of tranng data set such as sze dstrbuton and number of attrbutes sgnfcantly contrbute to the generalzaton error of a learnng machne. A notwell-dstrbuted data set s prone to lead to a partal overfttng model. The two approaches proposed n ths paper for the bnary classfcaton enhance the useful data nformaton by mnng negatve data. Frst error drven compensatng hypothess approach s based on the Support Vector Machnes wth +k tmes learnng where the base learnng hypothess s teratvely compensated k tmes. Ths approach produces a new hypothess on the new data set n whch each label s a transformaton of the label from the negatve data set further produces the chld postve and negatve data subsets n subsequent teratons. Ths procedure refnes the model created by the base learnng algorthm creatng k number of hypotheses over k teratons. A predctng method s also proposed to trace the relatonshps between the negatve subsets and testng data set by vector smlarty technque. Second a statstcal negatve eamples learnng approach based on theoretcal analyss mproves the performance of base learnng algorthm learner by creatng one or two addtonal hypothess audt and booster to mne the negatve eamples output from the learner. The learner employs a regular support vector

3 machne to classfy man eamples and recognze whch eamples are negatve. The audt works on the negatve tranng data created by learner to predct whether an nstance could be negatve. The negatve eamples are strongly mbalanced. However boostng learnng booster s appled when audt does not have enough accuracy to judge learner correctly. Booster works on the tranng data subset wth whch learner and audt do not agree. The classfer for testng s the combnaton of learner audt and booster. The classfer for testng a specfc nstance returns the learner s result f audt acknowledges learner s result and learner agrees wth audt s judgment otherwse returns the booster s result. The error of base learnng algorthm s proved to decrease from O to O. INDEX WORDS: Data partton Data preparaton Support vector machnes Multple passes learnng Vector smlarty Data classfcaton Bonformatcs Machne learnng

4 SVM-BASED NEGATIVE DATA MINING TO BINARY CLASSIFICATION by FUHUA JIANG A Dssertaton Submtted n Partal Fulfllment of Requrements for the Degree of Doctor of Phlosophy n the College of Arts and Scences Georga Stage Unversty 6

5 Copyrght by Fuhua Jang 6

6 SVM-BASED NEGATIVE DATA MINING TO BINARY CLASSIFICATION by FUHUA JIANG Major Professor: A.P. Preethy Commttee: Yan-Qng Zhang Y Pan Ychuan Zhao Electronc Verson Approved: Offce of Graduate Studes College of Arts and Scences Georga State Unversty August 6

7 v ACKNOWLEGMENTS Frstly my specfc thanks go to my advsor Dr. A. P. Preethy and Dr. Yan-qng Zhang for ther knd gudance and precse advsement durng the process of my Ph.D. dssertaton. The dssertaton would not have been possble wthout ther helps. Secondly I would lke to thank Dr. Y Pan and Dr. Ychuan Zhao for ther wellapprecated support and assstance. Fnally I want to thank my famly and frends for ther support and belefs.

8 v Table of Contents ACKNOWLEGMENTS... v LIST OF FIGURES...v LIST OF TABLES... LIST OF ACRONYMS... CHAPTER INTRODUCTION.... Learnng Problem Termnology.... Evaluatng the Performance of Bnary Classfcaton Relatvely Performance Evaluaton Challenges of Machne Learnng Negatve Data Mnng....6 Introducton to Negatve Data Drven Compensatng Hypothess Approach NDDCHA... 3 CHAPTER RELATED WORK Boostng and Baggng Kernel Methods Dstance n the Feature Space..... Polynomal kernel....3 Support Vector Machnes SVMs The Mamal Margn Classfer The Soft Margn Optmzaton Karush-Kuhn-Tucker condton KKT... 8

9 v.4 k-nearest NEIGHBOR KNN and Knowledge Representaton... 9 CHAPTER 3 METHODOLOGY Concepts of Negatve Data Introducton of negatve and postve data Separator and parttoner μ-negatve data Motvaton Characterstcs of SVMs VC Dmenson Vector Smlarty Theoretcal Analyss on NDDCHA The Patterns of Eamples Dstrbuton n the Feature Space Small sze or mbalanced tranng data Nose outler and mssng value eample Compensatable negatve eamples Not compensatable negatve eamples Imbalanced eamples Compensatng Hypothess Approach CHAPTER 4 ERROR DRIVEN COMPENSATING HYPOTHESIS APRROACH Negatve Data Tranng Phase Learnng Termnaton Crtera Testng Phase... 69

10 v 4.5 Dscusson of Vector Smlarty n the Feature Space Algorthm of NDDCHA NDDCHA Algorthm Smulaton CHAPTER 5 STATISTICAL NEGATIVE EXAMPLES LEARNING APPROACH Concept of True Error Introducton to Statstcal Negatve Eamples Learnng Approach Analyss of Two Stages Learnng Under-Samplng Over-Samplng and Hybrd Samplng Algorthm of Two-stage Learnng Three-stage Learnng of SNELA Smulaton... 5 CHAPTER 6 CONCLUSION AND FUTURE WORK Summary Future Work... 8 BIBLIOGRAPHY...

11 v LIST OF FIGURES Fgure. The relatonshp of label and predctng confdence 5 Fgure. Model h s a hyper-surface. Instance s well-separated; nstance s not well-separated and nstance 3 s msclassfed where ther labels y y and y 3 >. 6 Fgure.3 An eample of ROC Curve for a gven hypothess[] y-as s senstvty and -as s the -specfcty. The dagonal lne from to s drawn for random classfer as a reference. 7 Fgure.4 Comparng to classfer h and h of the bnary classfcaton the model wth hgh degree s prone to overfttng where f s the underlyng functon. Fgure.5 The Yn-Yang symbol. Fgure. h s the hyper-plane n the feature space. Ponts and z n the nput space are mapped nto feature space. w s the normal vector of hyper-plane h. Fgure. Mamal Margn Support vectors and nosy eamples 8 Fgure 3. Well-separated data and not well-separated data are n the dfferent area. The ponts wth sold pattern are msclassfed. 36 Fgure 3. μ-negatve eamples are defned n the SVM feature space whch are ponts marked wth sold pattern. 39 Fgure 3.3 Dstrbuton of target labels and predcatng label on the hepatts[8] 5 Fgure 3.4 Dstrbuton of target labels and predcatng label on the musk 53 Fgure 3.5 Lnear separable eamples 54

12 Fgure 3.6 An eample of outler the red crcle on the rght-bottom s an outler whch s far from other eamples. 55 Fgure 3.7 Sngle sde negatve eamples 57 Fgure 3.8 Patchng a testng eample n the drectly compensatable pattern. Crcle ponts are n class+; rectangle ponts are class -; trangle pont s test pont. 57 Fgure 3.9 Interweaved postve and negatve eamples 59 Fgure 3. Patchng a testng eample n the non-drectly compensatable pattern. Crcle ponts are n class+; rectangle ponts are class -; trangle pont s test pont. 59 Fgure 3. The negatve tranng eample s compensated by h when n the tranng phase but negatve testng eample can not be compensated by h. 6 Fgure 3. Imbalanced eamples 6 Fgure 3.3 Under-samplng strategy 6 Fgure 3.4 Archtecture of Yan et al. SVM ensembles 63 Fgure 3.5 Compensatng hypothess approach 64 Fgure 4. Tranng phase: S s negatve data subset S # s the postve data subset h s the patchng model or hyper-surface and d are dvders for k 68 Fgure 4. Testng phase. 69 Fgure 4.3 Sgmod functon 73 Fgure 5. Scheme of SNELA 83 Fgure 5. Under-samplng strategy 87 Fgure 5.3 Over-samplng strategy 88 Fgure 5.4 Hybrd-samplng strategy 88 Fgure 5.5 Possblty of samplng data 9

13 Fgure 5.6 The scheme of two stages learnng ncludng base and negatve learnng. 9 Fgure 5.7 The scheme of base learnng. S P U N 9 Fgure 5.8 The scheme of base testng. T P s the correctly predcted nstances and T N s ncorrectly predcted nstances. In ths stage T P and T N are unknown wheret T P UTN. 9 Fgure 5.9 Constructon of compensated tranng data S for h usng under-samplng strategy 93 Fgure 5. The number of correctly judged eamples n the negatve leanng s TN. The FN eamples are correctly classfed n base learnng but not be judged correctly. 97 Fgure 5. r relatonshp dagram 97 Fgure 5. ma relatonshp dagram the performance s mproved n the predctng negatve eamples when falls nto the area under the curve 98 Fgure 5.3 over-samplng strategy Fgure 5.4 under-samplng and over-samplng could be consdered as the specal case of hybrd-samplng. Fgure 5.5 The dstrbuton D D and D on the three-stage learnng. P N N u N v. P s Pt P TP TN and N FP FN 7 Fgure 5.6 The parameter μ s determned by movng around the lne B to mnmze the sze of FP FN 4

14 LIST OF TABLES TABLE. Confuson matr 4 TABLE 4. Comparson of three data sets 78 TABLE 4. Smulaton on the data set musk 79 TABLE 4.3 Smulaton on the data set Cancer 8 TABLE 4.4 Smulaton on the data set Cement 8 TABLE 5. The possblty of four areas 8 TABLE 5. Overvew of negatve learnng performance 5

15 LIST OF ACRONYMS Support Vector Machne Negatve Data Drven Compensatng Hypotheses Approach Instance Based Learnng k-nearest Neghbor Statstcal Negatve Eamples Learnng Approach Independent Identcally Dstrbuted Vapnk Chervonenks Dmenson Recever Operatng Characterstcs Area Under Curve Probably Appromately Correct SVM NDDCHA IBL KNN SNELA..d. VC Dmenson ROC AUC PAC

16 CHAPTER INTRODUCTION The approach to solve the comple problem wthout precse model s to learn functonalty from the pars of nput and output of eamples. The eamples are the classfcaton of proten types based on DNA sequence [] the regresson of the surface roughness of parts n manufacturng and so forth. In general the problem of supervsed machne learnng s to search a hypothess hα from a space of potental hypotheses H to determne the hypothess that wll best ft underlyng functon f and any pror knowledge as well where s the testng vector and α s the parameters of hypothess []. The learnng has tranng and testng phases. The tranng s to estmate the parameter α to the hypothess or model h α. The testng s to use the model to predct the labels of testng data. A hypothess h α can be abbrevated by h once α s determned. A hypothess h can be consdered as a hyper-surface n the n-dmensonal nput space where n by geometrc nterpretaton. For eample a hypothess of fuzzy controller or the support vector machne SVM [ 3-7] s a hyper-surface although the hypothess created by nstance-based learnng[8 9] s not n ths case. A hypothess h can be also consdered as a hyperplane n the feature lnear space of SVM. A hypothess learned from tranng eamples s not perfect to ft underlyng functon because the computatonal errors of appromaton and estmaton are nevtable to overcome; tranng data ncludes noses and does not well dstrbuted. Not well dstrbuted eamples means that these eamples are not well represented the whole nput

17 space. Some areas may have more eamples and other areas may only have a few eamples. Therefore some eamples are not negatve contrbuton to the hypothess learned. These negatve eamples can be mned to mprove the accuracy of hypothess. Ths chapter descrbes basc concepts and brefly ntroduces the man approach proposed.. Learnng Problem Termnology There s an nstance vector from an nput space X a response or label y from output space Y and a hypothess h form hypotheses space H for a learner L. We have n n... X R X R - where R s a set of real numbers nteger n> s the sze of vector. Y { + } or Y R s n bnary classfcaton Y {... m } s m-class classfcaton and Y R s n regresson. The learned hypothess h returns a predctng label y h of an nstance a real number. In the bnary classfcaton f h returns a confdence value then y > means y s n the class + whereas y < means n the class -. A tranng data set S s a collecton of tranng eamples or observatons gven by z y. It s denoted by S { y y.. l y l }.. l - where l S s the sze of the tranng set. In ths paper the label of bnary classfcaton Y s etended to Y R the fnal output of bnary classfcaton s the sgn of label. There ests a true functonal relatonshp or underlyng functon f: X R n Y whch s often based on the knowledge of the essental mechansm. These types of models are called mechanstc models. A hypothess h s an appromaton to the underlyng functonal relatonshp f between varables of nterest. The problem for the

18 3 learner L s to learn an unknown target functon h: X Y drawn from H and output a mamum lkelhood hypothess.. Evaluatng the Performance of Bnary Classfcaton In a bnary classfcaton eamples of class + and class - are usually sad to be postves and negatves respectvely. Tradtonally three metrcs named accuracy senstvty and specfcty are used to evaluate the performance of hypothess based on the confuson matr n Table.: TN + TP accuracy -3 TN + FN + FP + TP TP senstvty -4 TP + FN TN specfcty -5 FP + TN Senstvty s the proporton of true postves and specfcty s the proporton of true negatves. The predctve value postve and predctve value negatve s evaluated accuraces of the postve and negatve eamples respectvely. TP predctve value postve -6 TP + FP TN predctve value negatve -7 TN + FN The sum of FP and FN s the number of msclassfcaton eamples on the unseen testng dataset whereas the sum of TP and TN s the number of correctly classfed eamples. Predctve postves are conssted of true postves TP and true negatves TN. Predctve negatves nclude false postves FP and false negatves FN.

19 4 TABLE. Confuson matr Real Postve True Postve Negatve False Postve Postve TP FP TP + FP Test False Negatve True Negatve Negatve FN TN FN + TN TP + FN FP + TN The accuracy ρ s usually used as metrc to evaluate whether a model s good or not n the bnary classfcaton. Tranng accuracy ρ t and the testng accuracy ρ p are used to evaluate the performance of learnng machne. Sometmes a hgh ρ t results n a hgh ρ p otherwse a hgh ρ t results n a low ρ p whch s called overfttng. The testng accuracy s a measure of generalzaton capacty of a model. If there est two models for the same learnng problem wth the same tranng accuracy how can we determne whch model has a hgher probablty of performance wthout predctng. The support vector machne SVM uses mamal margn as the metrc. A hypothess h could be consdered as a predctng confdence of an nstance. The relatonshp between confdence and label s shown on Fgure. whch s an eample predctng task wth 3 testng nstances. Although some nstances are labeled as class + ther confdences are qute dfferent. It can be consdered that a predctve data wth hgh confdence s true n hgh probablty.

20 5 confdence label label/confdence eamples Fgure. The relatonshp of label and predctng confdence A metrc average resdual AR to evaluate the model s shown on -8 based on the generalzaton theory [5] AR l l y h -8 where l s the sze of tranng data set and y s the label of nput vector. The hgh AR means hgh tranng accuracy. AR could be negatve f there are many msclassfed eamples. The hyper-surface h separates the hyperspace nto two sdes n the bnary classfcaton y { + }. In SVM the predctve value of y h s proportonal to the geometrc dstance from pont to the hyperplane n the feature space. y s the geometrc dstance f mamal margn s normalzed to. Then the new metrc s the average dstance to hyper-surface as shown on Fgure.. One sde h> s n the class + and the other sde h< s n the class -. The hyperplane h s the separator. The hypothess h becomes a measure of performances to separate vector.

21 6 y h h h 3 3 h Fgure. Model h s a hyper-surface. Instance s well-separated; nstance s not well-separated and nstance 3 s msclassfed where ther labels y y and y 3 >. Furthermore when the tranng data are strongly mbalanced accuracy may mslead because the all postve or all negatve classfers may have a very good accuracy. The Recever Operatng Characterstcs ROC curve has been ntroduced by the sgnal detecton theory to evaluate the capablty of a human operator of dstngushng sgnal and nose[]. ROC analyss s now beng acknowledged as a practcal tool to evaluate classfers of mbalanced data even when the pror dstrbuton of the classes s not known[]. ROC curve s a two-dmensonal measure of classfcaton performance. It can be understood as a plot of the probablty of correctly classfyng the postve eamples aganst the rate of ncorrectly classfyng negatve eamples as shown below. The AUC s defned at the area under an ROC curve. Processng the AUC would need the computaton of an ntegral n the contnuous case. The followng equaton s the AUC on dscrete case such as n the classfcaton. l + l AUC h + + j h > h j -9 l l

7 Fgure.3 An eample of ROC Curve for a gven hypothess[] y-as s senstvty and -as s the -specfcty. The dagonal lne from to s drawn for random classfer as a reference. where h s the hypothess.

22 7 Fgure.3 An eample of ROC Curve for a gven hypothess[] y-as s senstvty and -as s the -specfcty. The dagonal lne from to s drawn for random classfer as a reference. where h s the hypothess. + and respectvely denote the class + and - eamples and l+ and l are respectvely the numbers of class + and - eamples and π s defned to be f the predcate π holds and otherwse. AUC value s the probablty PY >Y where Y s the random varable correspondng to the dstrbuton of the outputs for the postves and Y s the one correspondng to the negatves[3]. The average of AUC s monotoncally ncreasng as the accuracy of hypothess but the standard devaton for mbalanced dstrbutons s grown[4]. Therefore AUC s a better metrc than accuracy n the case of mbalanced eamples dstrbuton. Alan Rakotomamonjy proposed an AUC mamzaton algorthm and show that under certan condtons -norm soft margn SVMs can mamze AUC[5]. AUC s not as an optmzaton objectve n ths dssertaton but as an evaluaton metrc. Genune SVMs assume that msclassfcaton costs are equal for both classes of bnary classfcaton. So SVMs are not sutable n detectng a small sze class n the mbalanced data set. For eample t s very hard to

23 8 detect the negatves when the sze of negatve s far less than that of postves. Thus ROC curve approach s of nterest because t reflects both true postve and false postve nformaton. However f tranng eamples are separable any hypotheses n the verson space wll mamze AUC. In order not to confuse the termnologes of postves and negatves n the followng chapters and sectons postves and negatves are called class + and class - eamples respectvely because labels belong to { + }. That s very convenent to etend bnary classfcaton nto multple classfcatons. For eample 4-category classfcatons problem have class 3 4 eamples because labels belong to { 34 }..3 Relatvely Performance Evaluaton Most lteratures evaluated a hypothess by usng absolutely evaluaton hypothess method AEHM. The eamples nclude accuracy AUC least square sum. All AEHMs have to assume the data dstrbuton s..d. These methods are necessary n the evaluaton of generalzaton capacty n the unseen data. When the sze of data set s small AEHMs s not meanngful because a small sze of eamples s not capable to show the whole pcture of data dstrbuton. Then a relatvely evaluaton hypothess method REHM s ntroduced. To show how REHMs works assume a data set D ncludes l eamples and n-fold cross-valdaton are used. We have n pars of tranng and testng data set. The sze of tranng data and testng eamples are lsted below: n S D n.. n - T D.. n - n

24 9 Frstly data set D s traned and hypothess h D s gotten. We get performance valueη D by testng data set D usng h D. The performance value could be accuracy AUC and etc. Secondly data set S s traned and hypothess h s gotten. We get performance value η by testng data set T usng h. Let the average of η be η S. Then REHM s defned below: η REHM S - η D The performance value η D s the best value of gven hypothess n terms of data set D. Any other η cannot be better thanη D because the testng eamples are ncluded n the tranng data n the AEHM. Thereby REHM s a value less or equal than.. To the separatable eamples REHM s eactly the same as AEHM..4 Challenges of Machne Learnng The goal of learnng s to have hgh testng or predctng accuracy rather than tranng accuracy. The underlyng functon f of the practcal problem s unknown and even hard to be descrbed. What can be known n the problem are the tranng data set S and the lmt and not full pror knowledge of the problem. The general purpose learnng algorthm does not even take advantage of doman knowledge such as statstcal learnng methods. They only consder or assume the dstrbuton of data where all data are drawn from ths dstrbuton possblty although such assumpton s not realstc. The underlyng functon f of problem s n the target space TS; the model s a hypothess h from the hypotheses space H. Therefore three types of error are nevtable n the process of machne learnng [6]. The frst s the appromaton error from the number of

25 hypotheses n the hypotheses space less than that of target space H <TS. The underlyng functon f may be beyond the hypothess space f h f. The second s the estmaton error for a tranng algorthm from selectng a non-optmal model or hypothess due to the technque of computaton for eample the back propagaton algorthm cannot produce the optmal soluton because of local mnma problem. The last one s the generalzaton error jontly from the appromaton and estmaton error. In addton to those the propertes of data such as a small sze drty mbalanced or not well-dstrbuted tranng data set whch means that the tranng data set does not well reflect the real problem contrbute to the generalzaton error. Hence the generalzaton error s the composte error from all aspects. In the supervsed machne learnng the hypotheses space s selected by human and the number of types of hypotheses spaces that are avalable to human s lmted. The hypotheses space n the artfcal neural network ANN s the topology of network and the appromaton functons such as sgmod functons n the neuron[7-]. In the support vector machne SVM the hypotheses spaces could be regarded as the kernel functons such as polynomal kernel; radal bass functon kernel RBF and etc. Therefore the appromaton error cannot be reduced once the hypotheses space s chosen. How to choose a sutable hypothess space depends on human s a pror knowledge of dentfyng characterstcs of a real learnng problem and the learnng accuracy. It s known from above dscusson the performance of testng or capacty of generalzaton reles on the shape of the hyper-surface or model. Sometmes the hypotheses spaces are larger enough than the target spaces; the model s stll prone to overfttng due to not well dstrbuted tranng data as shown on Fgure.4. Not well

26 dstrbuted data are scattered on the nput space un-unformly. One challenge s there ests a general method to compensate a hypothess and let t fall n hypothess space to reduce appromaton error. Another s how to reduce estmaton error. SVMs are proved as global optmzaton method once kernel s chosen. Underlyng f Tranng Test Set y+ y- y+ y- h h Fgure.4 Comparng to classfer h and h of the bnary classfcaton the model wth hgh degree s prone to overfttng where f s the underlyng functon..5 Negatve Data Mnng Accordng to tradtonal Chnese phlosophy Yn and Yang are the two prmal cosmc prncples of the unverse. Yn s the passve female prncple whle Yang s the actve masculne prncple. The best state for everythng n the unverse s a state of harmony represented by a balance of Yn and Yang. True harmony requres Yang to be domnant. It's just the natural phenomena. As show on Fgure.5 Yn-Yang symbol when Yn and Yang are n harmony wth one another they are two halves of the crcle

one dark and the other lght. The small crcle wthn each half shows that the part of each opposte s always found wthn the other. They are not really oppostes at all. Yn and Yang s nterrelated.

27 one dark and the other lght. The small crcle wthn each half shows that the part of each opposte s always found wthn the other. They are not really oppostes at all. Yn and Yang s nterrelated. Partal Yn s nsde of Yang whereas partal Yang s nsde of Yn. Yn and Yang should be respected to an equal etent. Fgure.5 The Yn-Yang symbol. The target space could be consdered as a unverse n the Yn-Yang theory. To a specfed hypothess h H all eamples n the unverse TS are dvded nto two prmal groups postve and negatve data whch matches Yang and Yn. The postve data s the subset of all correctly classfed eamples where negatve data s the rest. An eample could be postve or negatve. The negatve data does not mean the data s wrong or corrupt. What negatve data can be known s that a hypothess cannot make t wellseparated. Negatve data strongly depends on the hypothess. Whether an eample s postve or negatve s relatve. To a specfc eample hypothess A classfes t to be negatve whle hypothess B may classfy t to be postve. Furthermore even for the same hypothess an eample probably belongs to postve or negatve n terms of the dfferent parameters α of a hypothess hf α. Yn-Yang Theory clams that Yn and Yang together form a unverse. Yn and Yang are opposte group and each can be always found n the opposte wthn the other. Ths s the foundaton phlosophy of negatve data mnng. The Yn-Yang theory ndcates that negatve data contans the postve nformaton. The more nformaton the

28 3 hgher accuracy of learn machne can be gotten. By mnng negatve data the accuracy of machne learnng wll be enhanced[]. There are two ways of mprovng the performance of classfcaton. One mproves the learnng algorthm or method to reduce appromaton and estmaton error by choosng a sutable learnng algorthm or nvent a new algorthm. Here we only consder the other way to mne tranng data to ncrease the accuracy of hypothess such as boostng and baggng[-5]..6 Introducton to Negatve Data Drven Compensatng Hypothess Approach NDDCHA For a specfc model to a specfc learnng problem there are several ways to mprove the model or hypothess f msclassfed eamples est. The frst way makes hypothess space larger than the target space. The second s to reduce estmaton error. The thrd s to make the sze of eamples large. The last s the tranng data mnng n whch a sequence of learnng algorthms takes advantage of the dstrbuton of data to create a combnaton of algorthm. A good eample s the SVM ensemble powered by the baggng and boostng approach [-5]. In baggng each base learnng algorthm s traned ndependently by usng randomly chosen tranng eamples va a bootstrap technque. In boostng the base learnng algorthm s traned usng tranng data eamples chosen accordng to the eamples' dstrbuton. The boostng approach calls a weak base learnng algorthm more than one tme. The base learnng algorthm could be any algorthms such as ANN or SVM. Each tme weak algorthm s fed wth a dfferent subset of the tranng eamples

29 4 and generates a new weak predcton rule. After many rounds the boostng algorthm combnes these weak rules nto a sngle predcton rule to produce more accurate rule. The problems of how each dstrbuton should be chosen n each round and how the weak rules should be combned nto a sngle rule s to mantan set of weghts over the tranng set. Therefore the boostng approach s combnng a seres of hyper-surfaces nto a sngle hyper-surface where each hyper-surface s ndependent. Chang [6] proposed a boostng SVM classfer wth logstc regresson for mbalanced tranng data by usng clusterng technque. Km et al. [7] proposed baggng and boostng SVM approach and tested majorty votng least squares estmaton based weghtng and double-layer herarchcal combnaton aggregatng methods. Vapnk n hs book [5] gave a detal eplanaton on how to use SVM ensemble powered by Schapre s AdaBoost algorthm [5]. The man drawbacks of baggng and boostng are tme consumng and the performance largely depends on the tranng data of probablty dstrbuton and aggregaton methods. The Boostng could be consdered as a negatve mnng algorthm whch emphaszes learnng on msclassfed data or negatve data. The negatve data drven compensatng hypothess approach NDDCHA drven by the negatve data nformaton s proposed n ths paper. Ths approach looks smlar to the SVM ensemble whch s learnng technque where multple SVMs are traned to solve the same problem [5 8-3]. The SVM ensemble s to generate a sequence of SVMs by usng Baggng or Boostng approaches and then combnng ther predctng. The dfference s that the ensemble approach s combnng the results of SVMs and each SVM s ndependent whle NDDCHA s compensatng the labels of base SVM by a sequence of patchng SVMs and makng tranng eamples well separated by usng AR metrc

30 5-8. The NDDCHA works on the negatve data and the sze of negatve tranng data s reduced n each pass and therefore t converges quckly n the rate of eponentally. In our practce the number of passes s not greater than 3. The man dea of the approach proposed here s to mamze the yh for every eample n the tranng phase by usng a seres of hypotheses h...k whereas testng data fnd approprate h by usng vector smlarty technque to predct the eample n the predctng phase. The approach s to mprove the capacty of generalzaton and reduce the appromaton error by etendng the tradtonal learnng method lke SVMs n two aspects. The frst s to compensate hypothess by makng use of the eamples from tranng data S wth hgh tranng error due to H<TS and not wellseparated eamples. The second s data cleanng and data enhancng by utlzng the negatve data whch has hgh predctng error or not well-separated n the phase of tranng and testng. The rest of ths paper s organzed as follows. In Chapter the related work ncludng boostng k-nearest neghbor algorthm KNN and SVM are ntroduced; the concepts of NDDCHA and prncple of generalzaton theory are also ntroduced. In the Chapter 3 the concept of negatve eamples s ntroduced. In Chapter 4 the algorthm of NDDCHA s studed n detal. In Chapter 5 the statstcal negatve eample learnng s studed. Fnally n Chapter 6 the man contrbuton of ths paper s summarzed.

31 6 CHAPTER RELATED WORK There are two general strateges n mprovng an algorthm. One s modfcaton of algorthm structure and the other s modfcaton tranng data. The frst one ncludes changng the objectve functon of optmzaton for eamples approach of support vector machne to decson tree[3]. The second one ncludes the baggng and boostng. The approaches n ths dssertaton focus on the second strategy. The related works are brefly ntroduced n ths chapter ncludng boostng and baggng locally weghted regresson kernel and support vector machnes.. Boostng and Baggng Bremans s baggng[4] and Freund and Schapre s boostng[3 5 3] [33] both form a set of classfers or hypotheses that are combned by votng. Baggng generates replcated bootstrap eamples of the data and boostng adjusts the weghts of tranng eamples. Two approaches are based on theoretcal analyses of the behavor of the composte classfers. Baggng can be appled for the stuaton where a small agtatng the tranng data set wll result n sgnfcant changes n the classfer bult. Boostng strengthens the base or weak learn algorthm. Boostng based on PAC learnng[34] causes the learner to focus on those msclassfed eamples then t generates new classfers by adjustng the weght of eamples. The hgh weght of eample ndcates the hgh nfluence on the classfer

32 7 constructed. Boostng learns eamples many tmes. In every tme of learnng the weght of eamples s adjusted to reflect the accuracy of classfer bult on prevous teraton. Obvously the msclassfed eample wll be assgned hgh weght on the net teraton. In the testng phase multple classfers are combned by majorty votng strategy to form a composte classfer. Boostng uses dfferent votng strength n terms of the accuracy of component classfer n the tranng phase. How to determne the weght of eamples s key pont of boostng. One mplementaton of boostng s AdaBoost[33] shown n the followng: Gven: Tranng data set S defned on the -. Intalze the weght dstrbuton: For t to T D l Tran weak learner usng dstrbuton D t Get week hypothess h t : X { + } wth error Pr ~ [ h y ] t Choose a t ln t t D t t Update: D e f h y at D t Dt ep at yht t+ where Z a t s a t Zt e f h Z t y t normalzaton factor The output the composte classfer or hypothess: H sgn T t a h Determnng the number of T s stop crteron that uses two ways: f >. 5 t t t

33 8 h t correctly classfed all eamples Other boostng mplementatons nclude Brownboost[35] and Logtboost[36]. If weght zero was assgned to the correctly classfed eamples n the etremely condton then the net teraton of learnng wll only use the negatve data. The sde effect n ths case s that the sze of net generaton tranng data may be mbalanced. The assumpton of baggng and boostng s that a small change of eamples on a gven dstrbuton wll cause sgnfcant changes on the classfer bult. As long as the accuracy of every component classfer s greater than 5% Freund and Schapre proved that accuracy of the composte classfer on the gven tranng data set ncreases n the rate of eponentally quckly as the number of teratons ncreasng. However the composte classfer cannot guarantee the generalzaton performance. And boostng also produces severe degradaton on some datasets [37]. Most estng boostng algorthms are lmted to combne only a fnte number of hypotheses and the generated ensemble s usually sparse. Ln et al. proposed nfnte ensembles may surpass fnte and/or sparse ensembles n learnng performance and robustness[38]. Baggng and boostng requres that the learnng system should not be stable and then the small changes to the tranng eamples should have consderable changes n the hypothess[37 39].. Kernel Methods Kernel methods[4] provde an alternatve soluton to non-lnear system by projectng the data nto a hgh dmensonal feature space where data can be solved by lnear system. The successful applcatons of kernel based algorthm have been found n dfferent areas for eamples pattern recognton[4 4] tme seres predcton[43] tet

34 9 categorzaton[44] gene epresson profle analyss[45] DNA and proten analyss[46] and etc. Suppose a vector n the nput space X projects nto φ n the feature space F.... a φ φ φ... φ n m m X φ F { φ X} - The n-dmensonal vector has n coordnates n the nput space the coordnates are called attrbutes. And the coordnates n the feature space s called features. If m<n ths s known as dmensonalty reducton. If m>>n ths s known as curse of dmensonalty. Usng to large number of features may lead to the overfttng problem[]. In the mean tme the large number of features ncreases the computatonal cost. A kernel s a functon K such that z X K z φ φ z - where functon φ s a non-lnear mappng functon from X to an nner product feature space F. A kernel functon calculates an nner product whch epresses a degree of smlarty of two vectors. Kernel functon s symmetrc but not all of symmetrc functons over X X are kernels. Kernel functon has to be postve defnte accordng to Mercer s theorem[]. Many researches etended kernel functon n practcal applcaton. Ong et al. proposed methods to learn non-postve kernel [47] whch has been promsng n emprcal applcatons. Kernel functon plays a key role n determnng the performance of SVM. S. Armar and S. Wu proposed a method of modfyng a kernel functon to mprove the performance of SVM based on the Remannan geometrcal structure[48]. Srvastava et al. proposed a method of mture densty Mercer Kernels

35 whch learn kernel drectly from data[49]. The followng are some most common used nonlnear kernel functons Polynomal Kernel d c z a z K * + -3 Radal Bass Kernel ep z z K γ -4 Sgmod Kernel * tanh c z z K + γ -5.. Dstance n the Feature Space Kernel epresses doman knowledge about the pattern beng constructed encoded as a smlarty metrc between two vectors [5 5]. Let K be a kernel over X X then a dstance d of two vectors and z n the feature space defned as [5]: z z K z K K z z d + φ φ -6 Radal bass kernel -4 functon has a close relaton between kernel and dstance. z can be substtuted by any metrc that calculates the dstance between and z. The angle θ between two vectors and z n the feature space satsfes [ cos π θ φ φ φ φ θ z z K K z K z z -7 Suppose θ s the angle between vectors and z andθ s the angle between vectors and z. The follow equaton can be gotten: cos cos K z K K z K z z K K z K z z K K z K θ θ η -8 Theorem. Gven that the feature space defned by kernel functon z z K φ φ s lnear space and then the hypothess h n the feature space s

36 lnear whch can be epressed h φ w + b where w s a constant vector to defne hyperplane b s a bas. The vector s closer to vector z than vector φ φ z φ φ z cos cosθ θ where angles θ θ [ ] defned on the equaton -7. θ π s the angles between φ φ z and w; and θ s the angles between φ φ z and w as shown below. w φ φ θ φ z θ h+ h- Fgure. h s the hyper-plane n the feature space. Ponts and z n the nput space are mapped nto feature space. w s the normal vector of hyper-plane h. Then: h h z h h -9 z Proof: Q φ φ z φ φ z both sdes multply w cosθ cosθ φ φ z w cosθ cosθ φ φ z w cosθ cosθ

37 φ φ z w cosθ φ φ z w cosθ and Q z z cosθ φ w φ z w cosθ φ w φ z w cosθ snce nner product s dstrbutve. h z cosθ h h z cosθ h and Q cosθ cosθ h h z h h then the theorem s proved. z The theorem. shows that predcatng label s much smlar f two vectors n the feature space are closer and the angle between vectors s smaller. It also shows that the predctng label depends on hyperplane because the angle s relatve to w. Suppose that vector z s an nstance from testng data set; eamples and y are from tranng data set. If d z d n equaton -6 and η n y z equaton -7 the concluson s the value of predctng label hz of nstance z s closer to y.than y. If θ that means vector φ φ z parallels to hyperplane and no matter what θ s vector φ s further to φz than vector φ. The theorem. s the foundaton of vector smlarty n the feature space... Polynomal kernel The polynomal kernel s defned: K z + d z c d> - For eample d and vector or z has n dmensons

38 3 K z z n j n n n j + c j j z z j + j n n z z + z + c n j n c z + c c j z j + c cz + c - n + n + Therefore the total feature of degree polynomal kernel s + n +. In n n n + d general the total feature of degree d polynomal kernel has number of features. n.3 Support Vector Machnes SVMs Support Vector Machnes SVMs [ ] s a learnng algorthm for classfcaton regresson and densty estmaton. SVM has been used successfully n many areas ncludng bonformatcs[56]. For eample the SVM can be used to learn polynomal radal bass functon RBF and mult-layer perceptron MLP classfers. SVMs are based on the structural rsk mnmzaton SRM prncple whch ncorporates capacty control to reduce overfttng..3. The Mamal Margn Classfer The basc SVM s a lnear classfer to separate the tranng data S nto two classes. The separator or hyperplane s w T +b where w s the weght vector and b s the bas term. Suppose the tranng data are separable the optmal hyperplane satsfes condtons n the followng by mamze the margn of separator whch s the wdth of separaton between classes

39 4 l b y T b subject to Mnmze K + Φ w w w w. - The rsk functonal s the Φw. Lagrange multplers are ntroduced λ l... for each constrant n -. The followng Lagrangan s gotten: [ ] + Λ l T b y b L w w w λ -3 where T l λ λ K Λ are the Lagrange multplers one for each eample. The task s to mnmze -3 wth respect to w b and mamze t wth respect to Λ. Dfferentatng wth respect to w and b and settng the dervatves equal to to get optmal pont Λ l y b L w w w λ -4 and Λ l y b b L λ w -5 The optmal w* s l l y y * * subject to λ λ w -6 Substtutng -3 by -6: Λ l l j j T j j l l y y F w λ λ λ λ -7 We get dual problem of prmal problem -:

40 5 T T Mamze F Λ Λ I Λ DΛ T subject to Λ Λ y -8 T where y y K I s an dentty matr and D s a symmetrc l l matr wth y l elements D j j T y y. To solve the above conve quadratc programmng QP j problem we get the classfer f * h yλ l f sgn h T + b * b * y w * T -9 where λ*>. If λ* the th vector s a support vector. To a nonlnear separable problem an n-dmensonal nput vector s projected nto a hgh m-dmensonal space n m usng nonlnear functon φ : R R and then the output s lnear. Then -9 becomes l T * * * T * f sgn h sgn φ w + b sgn y λφ φ + b - T The functon - s n the form of nner products φ φ z whch can be represented T by a kernel functon K z φ φ z φ φ z. Usng kernel functons makes t not necessary to fnd the mappng functon. Therefore the kernel functon s a way to construct non-lnear hyper-surface. For nstance f a polynomal kernel s used then hypothess h can be represented by a contnuous hyper-surface wth polynomal h n the nput space whereas h s also a hyperplane n the hgh-dmensonal feature space through mappng functon φ. The hyperplane h s determned by the tranng eamples...l. If the tranng eamples are changed the shape of the hyper-surface wll also be changed. The hyper-surface s usually vectors senstve. In the SVM the

41 6 hyperplane s determned by the support vectors that are only part of eample subset of tranng set. Only the support vectors wll affect the hyperplane. The overfttng s much serous f the number of the support vectors s close to the sze of tranng data set. By usng kernel functon the decson functon -9 becomes l * * f sgn h sgn yλ K + b - where the bas s gven by b * l * T * y w φ y y λ K - j j j for any support vector..3. The Soft Margn Optmzaton SVM ntroduced a vector of slack varables Ξ when the hypothess s T ξ ξ... ξl found to be nconsstent wth any sngle eample as show on Fgure.. It does not completely elmnate a hypothess f an nconsstent eample s found. l k Mnmze Φ w b Ξ w + C ξ w b Ξ -3 subject to y w T φ + b ξ ξ K l where C s a regularzaton parameter that controls the trade-off between mamzng the margn and mnmzng the tranng error term; k s nteger k>. If C s too small nsuffcent stress wll be placed on fttng the tranng data. If C s too large the algorthm leads to overfttng. The slack varable ξ k s related to nose senstvty. The optmzaton hypothess h of learnng task -3 has a smlar form to the equaton -9.

42 Γ Ξ Λ l l l T C b y b L ] [ ξ ξ γ ξ φ λ w w w -4 where T l λ λ K Λ as before and T l γ γ K Γ are the Lagrange multplers correspondng to the postve of the slack varables. Dfferentatng wth respect to w b and Ξ and settng the results equal to zero to obtan. Γ Ξ Λ Γ Ξ Λ Γ Ξ Λ l l C b L y b b L y b L γ λ ξ λ φ λ w w w w w -5 The dual problem of soft margn s: subject to Mamze Λ Λ Λ Λ Λ Λ y C D I F T T T -6 where T y l y y K I s an dentty matr and D s a symmetrc l l matr wth elements j T j j y y D. The decson functon mplemented s eactly as before n -9. The bas term * b s gven by -8 where a support vector s for whch C < λ <. Hsu et al paper [57] gves a general gude to choose a good value of C. The paper[58] also dscussed how to tune parameter automatcally. An alternatve algorthm was presented to get mamum margn between tranng eamples and decson boundary[59]. To compare dfferent SVMs C s not easy to use. v-support Vector Machnev-SVM s ntroduced n[6]. In v-svm C s replaced by v lmted on nterval ]. The parameter v s asymptotcally wth an upper bound on the number of margn errors and a lower bound on the number of support vectors.

43 8.3.3 Karush-Kuhn-Tucker condton KKT Karush-Kuhn-Tucker condtons KKT are[] T λ [ y w φ + b + ξ ] ξ λ C -7 The KKT condtons mply that non-zero slack varables ξ can be occur n the followng can be obtaned [6]. λ C. Then λ < λ < C λ C y h and ξ y h and ξ y h < and ξ > -8 If y h then s correctly classfed and well separated. Otherwse s support vector. If y h then s msclassfed. If < y h < then s correctly classfed but ts confdence s small. ξ ξ Fgure. Mamal Margn Support vectors and nosy eamples

44 9 The geometrcal nterpretaton of support vector classfcaton s that the SVM searches the optmal separatng surface n the hypotheses space. Ths optmal separatng hyperplane has many nce statstcal propertes. The value h / w s the geometrc dstance between vector to the hyperplane. Accordng to KTT condton -7 and for all y h we get λ C and T y w φ b + ξ -9 + y h ξ -3 h s the predctng label of nstance. Mamzng AR metrc s eactly mnmzng the -norm emprcal error n the soft margn SVM f h s set to when h >. The reason s below: AR l l yh l l l ξ ξ -3 l In SVM applcaton one only needs to determne the kernel functon and the regulaton parameter to control trade off between margn and emprcal error. Ths characterstc s convenent because of less parameter user needs to decde. However ths s also drawback because t provdes less control n a comple applcaton. NDDCHA gves much control than standard SVM lke Boostng provdes much control than base learnng algorthm..4 k-nearest NEIGHBOR KNN and Knowledge Representaton The nstance-based approaches can construct a dfferent hypothess for each dstnct testng vector where the hypothess s created from a subset of tranng data. Aha Kbler and Albert descrbed three eperments n Instance-based learnng IBL[6]. In the frst

45 3 eperment IB to learn a concept or knowledge smply requred the program to store every eample. When an unclassfed eample was presented to be classfed t used a smple Eucldean dstance measure as vector smlarty method to determne the nearest neghbor of the object and the class gven to t was the class of the neghbor. Ths scheme has capablty to tolerate some degree of nose n the data. The dsadvantage s that t requres a large amount of storage memory. IB s actually an nstance k of k-nearest neghbor method under the condton of all possble eamples are known. In the second eperment t etended the performance of IB and reduced the storage by classfyng new eample. Eamples correctly classfed were gnored and only ncorrectly classfed eamples were stored to be part of concept. The knowledge of correctly classfed eamples postve data s ncluded n the classfer or hypothess. Ths scheme used less memory and was less nose tolerant than IB. The thrd eperment IB3 used the scheme of IB and mantaned a record of the number of correct and ncorrect classfcaton attempts for each saved eamples. Ths record summarzed an eample s classfcaton performance. IB3 uses a sgnfcance test to determne whch eamples are good classfers and whch ones are beleved to be nosy. The latter are dscarded from the concept descrpton. Ths method strengthens nose tolerance whle keepng storage requrements down. In the IBL t s a naïve approach to only store and search those ncorrectly classfed eamples negatve data. Eucldean dstance between two vectors could be a metrc of vector smlarty. Ths technque s wdely used n the nstance-based learnng[63 64] such as k-nearest neghbor and locally weghted regresson[]. However the dstance s calculated on the Eucldean space whch s not sutable for the feature space defned by kernel functon.

46 3 Kernel functon also descrbes a smlarty of vectors for eample the basc kernel functon K z z z cosθ. Therefore the kernel functon s a probably way to tell the smlarty of vectors n the feature space because the feature space s defned by kernel methods. When the dstance metrc s appled the dstance between two vectors s calculated based on all attrbutes of vector. Ths metrc may lead to performance degradaton f rrelevant attrbutes are present whch s a type of the curse of dmensonalty. As the number of attrbutes ncreasng the computatonal cost and generalzaton capacty can be degraded whch s a phenomenon also belongs to the category of curse of dmensonalty. Bnary classfcaton k-nearest NEIGHBOR algorthm[ 65] s n the followng: Tranng algorthm: For each tranng eample y.. l add the eamples to the tranng set S. Classfcaton algorthm: Gven a query nstances t to be classfed. Let... k denote the k nstances from S that are nearest to t y y... yk are labels of these nstances. The hypothess returns: v { + } k h arg ma δ v y -3 t where δ s an ndcator functon δ f and where δ otherwse. When target functon s a contnuous real value the k-nearest NEIGHBOR algorthm wll be the same as bnary classfcaton on above ecept the equtaton -3 s replaced by

47 3 k h -33 t y k The dstance-weghted NEAREST NEIGHBOR algorthm uses equaton -3 to be replaced by k w y h t -34 k w where weght w s the recprocal value of Eucldean dstance between t and. w t t t t The performance of the KNN depends on a locally constant posteror probablty assumpton. Ths assumpton however becomes problematc n hgh dmensonal spaces due to the curse of dmensonalty and the nose of data. Wang el at proposed an adaptve nearest neghbor algorthm for classfcaton by consderng the sze of nfluence sphere and confdence level of eample nstead of Eucldean dstance[66]. KNN could be consdered a presentaton of knowledge. KNN provdes a method of vector smlarty whch could be used n the feature space defned by kernel functon.

48 33 CHAPTER 3 METHODOLOGY The computatonal cost of learnng method and accuracy and ntellgblty ssues are concerned n the emprcal machne learnng processng. In the rapd ncreasng computatonal capablty of computer the performance/cost rato has not been emphaszed n most applcatons especally n the batch machne learnng whch s traned off-lne. Accuracy s a man concern n all applcatons of learnng and relatvely easy to be measured compared to the ntellgblty. The approach proposed manly focuses on the accuracy by mnng negatve data. 3. Concepts of Negatve Data 3.. Introducton of negatve and postve data The tranng data set s parttoned nto two dsjont subsets msclassfed and correctly classfed eamples n terms of a hypothess h. The msclassfed eamples are negatve data. The correctly classfed eamples are postve data. Postve data s conssted of not well-separated and well-separated data. Not well-separated data has small confdence say ts confdence h < μ and threshold μ s an arbtrary number great than zero to clam that they are postve whle well separated data has hgh confdence to be postve. Negatve and not well-separated eamples together are called μ-etended negatve data whereas the well-separated eamples are called μ-shrunk postve data. Negatve data and etended negatve data are echangeable to be used n

49 34 the not confused envronment. For the same reason postve data and shrunk postve data are also echangeable. Postve or negatve data are eactly -shrunk postve or - etended negatve data respectvely. To ground our dscusson of concepts above consder the eample task of learnng whch has 5 tranng eamples and hypothess h: S { Suppose the predcted values h.3 h -. h 3. h h 5.8 and the threshold μ.6. Then msclassfed data or negatve data s { 5 } and classfed data s { }. The.6-etended negatve data s { } and.6-shrunk postve data s { 3 + }. An eample could be postve or negatve. The negatve data does not mean the data s wrong or corrupt. What negatve data can be known s that a hypothess can not make t well-separated. Negatve data depends on the hypothess. Whether an eample s postve or negatve s relatve. To a specfc eample hypothess A classfes t to be negatve whle hypothess B may classfy t to be postve. Furthermore even for the same hypothess an eample probably belongs to postve or negatve n terms of the dfferent confdence threshold μ. The parts of the hyper-surface classfyng negatve eamples need to be repared n order to mprove performance. As shown on Fgure. the rectangle wth thck sold color needs to be repared. The other parts of the hyper-surface classfy the postve data sets as well-separated and they have hgh generalzaton capacty. 3.. Separator and parttoner A bnary classfcaton of SVM s a lnear classfer n the feature space. The data set must be mapped nto feature space from nput space f the classfer s non-lnear. The }

50 35 mappng mechansm s fnshed by so-called kernel functon. Then t s much sutable to dscuss data separable concepts on the lnear space f SVM s a base learnng algorthm. Wthout loss of generalty we dscuss concepts on the lnear space where the hypersurface becomes a hyperplane snce a non-lnear space can be mapped nto lnear space by a mappng functon. If there s a hyperplane h that correctly classfes all tranng data set S we say that the data set are separable. If no such hyperplane ests the data set are sad to be non-separable. In general f the data set has nose or non-optmal hypothess s used the data set cannot be separated. As shown n Fgure 3. the subset conssts of msclassfed eamples whch are ponts wth the sold color. If an eample y s correctly classfed accordng to the classfer or separator h y s sad to be n the consstent subset y CS S otherwse y s n the nconsstent subset ISS-CS yis. The testng data set T s the unon of all correct and n correct classfed data set TTP+TF+FP+FN then consstent subset s CSTP+TN and nconsstent subset ISFP+FN as seen on Table.. A data subset of CS s sad to be not well-separated denoted to NWSS f the ponts of these eamples are much close to the hyperplane h. The data subset WSSCS - NWSS s sad to be well separated. Eamples n the IS and together wth NWSS s called n the etended negatve data subset NIS+NWSS. The metrc of much close to s gven by a parttoner phy whch s a fuzzy word dependng on the hyperplane h and data set S. The return value of phy s the logcal value ether true or false.

51 36 y+ phy Data y+ y- y- Δy Well-separated set Not well-separated set Δy h Fgure 3. Well-separated data and not well-separated data are n the dfferent area. The ponts wth sold pattern are msclassfed. One smple eample of parttoner n the classfcaton s the crsp boundary ph y: y-h μ μ.5 then not-well-separated data set s NWSS{ y y CS y-h μ μ.5}. The second eample s the fuzzy parttoner and then WSS and NSWW are fuzzy set. The boundary of parttoner s a range. In the thrd eample a non-symmetrc lnear parttoner phy: y-h - μ y-h μ μ μ [.5] s defned on both sdes of hyperplane NWSS{y y CS y-h - μ y-h μ μ μ [.5]}. How to defne a parttoner and how to choose the parameter for the parttoner depend on the real applcaton whch could reference to the rato of number of support vectors and tranng eamples VC Dmensons the sze of tranng set cross-valdaton and etc. Usually the cross valdaton s an effcent method to determne the parttoner. Note that

52 37 the well separated data set WSS s correctly classfed. And t s also called shrunk postve data subset P brefly called postve data. The boundary between postve data set and negatve data set s called border. In general the relatonshps of the subset mentoned above are S CS+IS CS {y y S y*h>} IS {y y S y*h } CS WSS+ NWSS 3- NWSS {y y CS ph ytrue} WSS {y y CS ph yfalse} P WSS N NWSS+IS. We can say the postve and negatve data set are dvded by both separator h and parttoner phy. Let dhy [ph ytrue or y*h ]. Then N {y y S dhy} and dhy s denoted a dvder to dvde the tranng data set nto postve and negatve data set. In ths dssertaton only lnear parttoner s consdered then the precse defnton of negatve data s gven n the secton μ-negatve data Defnton 3. Data Type: suppose h s a hypothess learned from a tranng data set S { y y.. l y l }... l 3- and a vector of varables Ξ l K T ξ ξ... ξ ξ R l 3-3 to let t satsfy the followng equalty:

53 38 ξ y h 3-4 An eample s negatve data f ξ > whereas t s postve data f ξ <. An eample s μ-negatve data f ξ > μ < μ < whch s etended negatve data whereas t s μ -postve data f ξ < μ whch s shrunk postve data. The μ-negatve data s denoted by N S h μ 3-5 The μ-postve data s denoted by P S h μ 3-6 The rato of the μ-negatve data and μ-postve data s denoted by N S h μ c S h μ 3-7 P S h μ whch s a measure of degree of unbalancng n terms of tranng data hypothess and threshold μ. The msclassfed eamples are N S h accordng to defnton of μ-negatve data whle correctly classfed eamples are P S h. The accuracy of hypothess h s least 5% then we get c S h <. However c S h μ s not always less than whch depends on the number of support vectors. The threshold μ plays a dvder role n the tranng data set. The concept of negatve data can be etended to the testng data set. The terms of negatve data etended negatve data or μ-negatve data are echangeable to be used n the not confused envronment. It s the same reason to echange the terms of postve data shrunk postve data or μ-postve data.

54 39 y+ μ μ h y- Fgure 3. μ-negatve eamples are defned n the SVM feature space whch are ponts marked wth sold pattern. Theorem 3.Negatve Support Vectors all negatve eamples of tranng data n SVM are support vectors. Proof: For all negatve eamples the mamum of threshold μ s less than < μ < accordng to defnton of negatve data. Snceξ > μ we getξ >. Accordng to the equalty of y h ξ all negatve eamples satsfes the nequalty y h < for.. l. Accordng to the KKT condtons ξ C n the -7 we can get λ C λ C λ

55 4 All eamples wth non-zero Lagrange multplers are support vectors accordng to the defnton of support vector. Therefore the theorem s proved. The msclassfed eamples meet the condton y h < thereby all msclassfed eamples are support vectors. Accordng to the Theorem 3. all msclassfed tranng eamples and eamples located on the regon between SVM margns are support vectors. It s known there s negatve data f eamples are not separable because there are ested msclassfed eamples. Theorem 3. shows that the number of negatve eamples s related to number of support vectors. Negatve eamples are caused from non-separatable data. Decreasng number of negatve eamples obvously enhances generalzaton capablty of SVM because that means less classfcaton error. Based on ths reason reducng the number of support vectors can mprove the performance of SVM. Reduced SVM actually demonstrates ths dea[67 68]. The hypothess of SVM s determned by kernel functon and support vectors therefore the negatve data domnate the performance of classfer. One knd of negatve eample s named outler where ξ >>. Thereby outlers wll degrade largely performance of hypothess. Outlers are assumed to be removed from orgnal tranng data n the data preprocessng phase n ths dssertaton. 3. Motvaton The classfer or hypothess n the classfcaton s a hyper-surface n multdmensonal space. A low accuracy hypothess ndcates that some areas of hypersurface are not eactly or close to the underlyng functon. Intutvely reparng those lower accuracy areas wll mprove the hypothess accuracy. Then compensatng a

56 4 hypothess has the same meanng as reparng a hyper-surface. The hyper-surface n the nput space s mappng to a hyperplane n the feature space by a kernel functon n the SVM []. The hyper-surface s smooth n the SVM snce feature space s a lnear space and commonly used kernel functons are contnuous n the space. A hyper-surface s comprsed of a set of small peces of sub-hyper-surface n the fuzzy control because of the nature of fuzzy membershp functon segments [69]. Therefore the hyper-surface could have more than one outlnes ether a sngle large surface or a seres of small surfaces. As shown n Fgure.4 the sngle hyper-surface h s not enough for the hgh predctng accuracy because any mprovements n tranng accuracy wll be prone to overfttng such as n the hgh degree of polynomal hyper-surface. To reduce the possblty of overfttng the low degree of hyper-surface h s preferred here h s curve lne. However h wll lead to low predctng accuracy or underfttng. Therefore n order to mprove the predctng accuracy or generalzaton capacty the hyper-surface h needs to be repared so that t appromates to the underlyng functon f. There are fve possble ways n whch one can repar. In the frst method a splne hyper-surface s a set of pecewse sub-hyper-surface whch s appled to a collecton of subsets of nput space X by usng clusterng technque to segment the nput space X nto sub-spaces. The method can be thought n ths way usng a ple of mosacs to lay tles to cover the whole surface of terran. Each sub-hypersurface predcts the sub nput space. The advantage s that each sub-hyper-surface matches specfc sub nput space well; hence the machne gves output wth low error. A sub-hyper-surface only has relatonshp to neghbor sub-hyper-surface. Ths feature makes good mprovement capablty by decreasng the overfttng. The property of

57 4 localty of splne hyperplane makes the re-tranng machne n a partal area of hypersurface rather than the whole system whch reduces re-tranng tme and also makes the system robust. The lmtaton s to resolve the ssue of connectng the sub-hyper-surface smoothly n the boundary to combne together as a whole hyper-surface and also to dvde nto sub-space. The crtera of clusterng are stll not really clear and large sze of tranng data s needed. Ths method also requres a reasonable sze of tranng data set. Vladmr Vapnk gave a method of kernel generatng splne [5]. Splne kernel s powerful and B-splne kernel SVM can be nterpreted the CMAC network ntroduced by Horvath [7]. In the second method a splne kernel [7] n SVMs s chosen to repar hypersurface. In ths approach the contnuty s guaranteed but the localty s lost because h s not a splne functon. Another lmtaton of splne kernel method s that the order k of splne kernel cannot be hgh usually k 4 [3]. The thrd and fourth methods are tentatve deas and may not be practcal. In the thrd method cut-paste approach keeps the most area of h. It replaces the partal area wth hgh error n h by a new small sze of hyper-surface h lke patchng a hole of h by a new small hyper-surface h. Ths procedure assumes that there are a lot of holes needed to patch. In ths method the ssues lke how to determne the holes and how to make the boundary smooth between patches and hyper-surface have not been addressed to. The fourth method pre-stress s to fnd the hgh error regons n the hypersurface and further to gve the opposte regulaton on those regons. Ths method stretches and compresses the hyper-surface by the force accordng to negatve data set. The advantage s to keep the hyper-surface smooth whereas the dsadvantage s the

58 43 dffculty to control force and the neghbor areas of repared parts n the hyper-surface are affected. In the last method overlappng approach uses a base hyper-surface whch appromates the man outlne of underlyng functon whch s created by a base learnng algorthm. Further a number of patchng hyper-surfaces are overlapped onto base hypersurface to form a new hyper-surface. Ths approach s adopted n ths work. So far we know why where and how to repar a hyper-surface. Now the queston s how we could ascertan whether the compensated hyper-surface s enhanced or not. In the generalzaton theory the structural rsk mnmzaton SRM model s to mamze the hyperplane margn measures n the feature space whle mnmzng the emprcal error and hence prevent overfttng by controllng [5]. The emprcal rsk mnmzaton model ERM only mnmzes the emprcal rsk whch may lead to low capacty of generalzaton. The ERM prncple s ntended to use n the large sze of eamples. Suppose the emprcal rsk s fed the hypothess wth large margn has hgher capacty of generalzaton than those hypotheses wth small margns. The SRM prncple defnes a trade off between the qualty of the appromaton of gven eamples and complety of the appromaton functon[3 5]. In fact the SVM s based on the SRM to epand the capacty of generalzaton and then reducng the overfttng. A hypothess h wth large value h for a gven vector n the bnary classfcaton makes eamples well-separated n the data set; whle h wth small margn makes eamples not well-separated. A hypothess makng eamples wellseparated has hgh generalzaton capacty because the margn of these eamples s large. If h< t s sad that s classfed nto class - whereas f h> s n the class +.

59 44 The h s a separator. For nstance the two hypotheses h and h both make all eamples separated. Here h > means that s n the class + whle h < means that s n the class - where s an eample. If h > h then h has hgher generalzaton capacty on than h. Therefore the accuracy of tranng s not a precse measure because of the fact that both hypotheses wth % accuracy have dfferent generalzaton abltes. The formula -8 s more sutable to be a metrc for generalzaton ablty. We epect the hypothess to output a real number h R. Therefore a learner lke the decson tree s not consdered as base learnng approach here. Ths s because the hypothess of decson tree s an ndcator functon. The SVM wll be used as a learner and predctor. Yet another reason for usng SVM s that the estmaton error s well controlled [ 3 4]. 3.3 Characterstcs of SVMs A tranng data set S wth number of eamples l S s generated n ndependently and dentcally..d accordng to a fed and unknown dstrbuton D. The eamples are drawn from the dstrbuton P whle response y s from dstrbuton Py and an eample y drawn accordng to D Py PPy. A learner L conssts of some classes of functons h α defned over X whch are the subset of hypotheses H h α H αδ α s an adjustable parameter whch s generated by the learner accordng to the tranng set. For eample α s correspondng to the weghts and bases n the neural network wth fed archtecture. h α s wrtten by h shortly once α s determned. Each h n H s a functon of the form h:x Y. To choose a best appromaton to the underlyng functonal relatonshp based on the tranng set one must consder the loss Ly h α between response y from tranng set to a gven nput and the response h

60 45 α provded by the learner. The epected value of the loss 3-8 gven by the rsk functonal [7]: R α L y h α dp y. 3-8 The goal s to mnmze the rsk functonal to fnd the h h α a mamum lkelhood hypothess. P y s unknown but ts nformaton s covered by the tranng set. The emprcal regresson model emprcal rsk mnmzaton ERM mnmzes the followng loss functon 3-9: Ly f α y - f α 3-9 The task of learnng algorthm s to mnmze the rsk functonal 3- wth loss functon 3-9 where dstrbuton P y s unknown and fed and tranng data s gven. The general model of learnng problem can be descrbed as follows [73]. The rsk functon: R α Q z α dp z α Λ z S 3- where Qz α s a specfc loss functon and z s a par of y n the tranng eamples. The emprcal rsk functonal s R emp There are two theorems from Vapnk []: l α Q z α. 3- l Theorem 3.: A hypothess space H has VC dmenson d. For any probablty dstrbuton Py on bnary classfcaton wth probablty δ over random tranng sets S any hypothess hh that makes k errors on S has error no more than err h k l + d log l el d 4 + log δ 3-

61 46 provded d l. The VC dmenson of a hypothess space s a measure of the number of dfferent classfcatons ablty. The value of k/l s true error. The nequalty 3- shows that the capacty of generalzaton depends not only on the emprcal error but also on the hypothess space. It provdes several ways to low error bound: Reduce VC dmenson d Mnmze number of tranng error k Increase sze of data set Theorem 3.3: If a sze l of tranng set S s separated by the mamal margn hyperplane then the epectaton of the probablty of test error s bounded by the epectaton of the mnmum of three values: the rato n/l where n s the number of support vectors the rato R w /l where Rma S and w s the value of margn and the rato m/l. where m s the dmensonalty of the nput space X. n R w m err mn 3-3 l l l Inequalty 3-3 gves four ways to mprove the generalzaton ablty ncrease the sze of tranng set make margn as large as possble reduce the dmensonalty reduce support vectors Reducng support vectors can also speed up the classfcaton of SVM; Quang-Anh Tran et al proposed a method by usng k-mean clusterng. The k central vectors from k groups of clusters form new data set to reduce the sze of data set. Accordng to [74] the

62 47 support data can be etracted and then to reduce support vector. The tradeoff between speed and performance s controlled by k value[75]. ERM prncple succeeds n dealng wth large sze of tranng data. It can be justfed by consderng the nequalty 3-. When k/l s large err s small. A small value of k/l cannot guarantee a small value of the actual rsk. Classcal approach gnores the last three ways; only reles on the frst one. The NDDCHA reles on the frst one by mnng negatve data the second one by makng tranng data well-separated and thrd one by mplctly calculatng the vector smlarty n the feature space through kernel functon. SVM assumes tranng eamples s..d.; and costs of msclassfcaton nto dfferent classes are the same. When these assumptons are volated the standard SVM does not work properly[76]. 3.4 VC Dmenson VC dmenson of a set of ndcator functons Q z α. α Λ s the mamum number d of eamples that can be separated eamples z... z zd nto two classes n all d possble ways usng functons of the set f for any n there ests a set of n eamples that can be shattered by the set Q z α. α Λ then the VC dmenson s equal to nfnty. VC dmenson can be estmated by [73]: d R w 3-4 where R s the radus of the smallest sphere that contans the all vectors n the feature spaces and w s the norm of the weghts n the SVM. VC dmenson s determned by kernel and tranng eamples. Gaussan kernel s employng an nfnte dmensonal feature space.

63 Vector Smlarty The vector-smlarty s a metrc to descrbe smlar degree of two vectors. The vector-smlarty depends on the learnng problem greatly. Defnton 3.: gven vectors and { } by functon vs []. X. The smlar degree s defned f vs f 3-5 f The symbol s borrowed as smlar operator here. The Eucldean dstance of vector and the cosne of the angle between and correlaton coeffcent [77 78] and the sum of errors of all attrbutes of two vectors can be used as measures of the vectorsmlarty. Bandemer [79] gves a batch of vector smlarty functons on fuzzy set. vs s not related to label y and t can be defned n many ways e.g. sg mod 3-6 Data set smlarty descrbes overall smlar degree of two dfferent data sets by comparng vector smlarty of every vector n two data sets. In the bnary classfcaton each vector assocated wth a label y. If y and y are n the same class + or - vs keeps unchanged otherwse f y and y are n dfferent class we let vs <. Therefore the etended smlarty defned by: vs yy 3-7 Defnton 3. Vector Smlarty: gven data set A and B data sets where A X and B X and a constant number υ [].

64 49 A and B A B such that: vs A B υ where vs s the functon of vector smlarty defned n the 3-7. Then we can say data sets A and B s smlar n the degree ofυ whch s denoted by: A B υ 3-8 For eample A B. 7 s the lowest smlar degree of A and B greater than.7 for any vector A n A and B n B. To postve data P and negatve data N n terms of hypothess h t s desred P N has lower value. Otherwse an nstance from test data set T s hard to be recognzed as smlar to P or N n the KNN. Therefore the followng condtons are desred: P N υ N T υ 3-9 where υ and υ are decded by applcaton. In contrast wth smlar degree of two data sets dssmlar degree of two data sets can be defned as below: A B 3- When Eucldean dstance s used as smlar metrc vector smlarty s eactly - NEAREST NEIGHBOR algorthm. In terms of the concept of KNN vector smlarty can be etended to -k vector smlarty from - vector smlarty whch compares only two vectors. 3.6 Theoretcal Analyss on NDDCHA A feature space s a lnear space determned by a kernel functon mplctly. The dmensonalty n the feature space s very large although the dmensonalty n the nput

65 5 space s not large. And the dmensonalty ncreases abruptly as the complety of kernel ncreasng. For eample feature spaces F 3 and F 4 constructed by a polynomal kernel n degree 3 and degree 4 respectvely the dmensonalty of F 3 s m + 3 dm 3 and the m dmensonalty of F 4 s dm 4 m + 4 where m s dmensonalty n the nput space. m Supposed m5 ths number s reasonable large for most applcatons then the rado of d 4 and d 3 : dm dm 3 m m m m 33! m m m m 4! m The sparse algebrac polynomal P m α α R s a set of polynomals of arbtrary degree that contans m terms. For d d eample P α α + α α α R dmenson wth the set of loss of functons wth two nonzero terms. To estmate VC Q z α α y Pm 3- Karpnsky and Werther[8] showed that the bound of VC dmenson d for the sparse algebrac polynomals s d* * 3m d 4m If P m α s a hypothess whch s determned by a kernel functon of SVM then formula 3-3 ndcates that a hgh degree of polynomal has a large VC dmenson and a low degree of polynomal kernel has a small VC dmenson. To avod overfttng or to get a small confdence nterval one has to construct machnes wth small VC dmenson[3] such as low degree of polynomal kernel.

66 5 The VC dmenson n the above eample on 3- wll ncrease 33 tmes accordng the formula 3-3. F 3 has dm 3 terms and F 4 has dm 4 terms. The VC dmenson d 3 n the feature space s lmted n the range of formula dm 3dm 3 4 d d * 3 * 4 4dm dm Then the rato of VC dmensons n the feature space F 4 and F 3 s about 3*33. VC dmenson s a metrc to evaluate the complety of a hypothess. A hypothess wth hgh VC dmenson wll be prone to overfttng easly accordng to nequalty 3-[3]. The above eample of polynomal kernel tells that usng lower degree polynomal as much as possble can get the hgh generalzaton capacty. On the other hand f the set of functons has small VC dmenson then t s dffcult to appromate the tranng data. To other types of kernel the smlar concluson can be also gotten. For eample the eponental kernel could be appromated by polynomal kernel. However there ests a tradeoff between overfttng and poor appromaton. Therefore the approach NDDCHA proposed here uses a lower complety of kernel as a base learnng algorthm to reduce the chance of overfttng and uses compensated hypotheses to ncrease the accuracy of appromaton. 3.7 The Patterns of Eamples Dstrbuton n the Feature Space Two factors of the value of the emprcal rsk and the value of the confdence nterval have been consdered to mnmze the rsk n a gven set of functons n mplementng the SRM nductve prncple. If eamples are lnear separable then no etra work needs to do snce basc SVM can solve ths knd of applcatons perfectly as shown on Fgure 3.5. In practce data and hypotheses are not perfect. To the pont of

67 5 vew of data the eamples from nput space nclude nose outler; dstrbuton of eamples s not..d.; the number of eamples n class + and - are mbalanced. To the pont of vew of hypothess the sze of hypothess space s lmted. The optmzaton method also restrcts the learnng machne to get global optmal soluton n most learnng algorthm ecept SVM. These ssues are consdered n NDDCHA. Dstrbuton of Target Label frequency y Dstrbuton of Predctng Label frequency frequency confdence Dstrbuton of Correctly Clssfed and Msclassfed Label y*confdence Fgure 3.3 Dstrbuton of target labels and predcatng label on the hepatts[8] The fgure 3.3 and 3.4 show the dstrbuton of target labels and predcatng label n the data sets of hepatts and musk. To predctng labels degree 3 of polynomal kernel wth C. are used on the hepatts. And degree 3 of polynomal kernel wth C.5 are used on the musk. The top bo shows the dstrbuton of target label. Two data sets llustrate that datasets are strongly mbalanced. The mddle bo shows the dstrbuton of

68 53 predctng labels. The nstances whch predctng value s wthn - and + are support vectors. The bottom bo shows that all postves are correctly classfed nstances. 6 Dstrbuton of Target Label frequency y Dstrbuton of Predctng Label frequency confdence Dstrbuton of Correctly Clssfed and Msclassfed Label frequency y*confdence Fgure 3.4 Dstrbuton of target labels and predcatng label on the musk 3.7. Small sze or mbalanced tranng data The eample sze l s sad to be small when rato of l/d s small say l/d< where d s the VC dmenson of hypotheses space[5]. E.g. data set Hepatts [8]has 55 eamples wth 9 nput attrbutes. 3 eamples are n the class + and 3 eamples n the class -. The estmated VC dmenson of Hepatts s about d accordng to

69 54 nequalty 3-4 when polynomal kernel z + 3 s used n SVM lght [83 84]. The rato of l/d s.496 <<. The number of support vector s 8 and tranng error s zero. When polynomal kernel z + s used VC dmenson s about d38.7 wth.38% tranng error. The rato of l/d s.475<<. Therefore Theorem 3. of predcatng error bound cannot be appled n the small sze tranng data set applcaton. Fgure 3.5 Lnear separable eamples 3.7. Nose outler and mssng value eample An outler s an eample that les outsde the overall pattern of a dstrbuton[85]. Usually the presence of an outler ndcates some sort of problem. Ths can be a case whch does not ft the hypothess constructed or an error n measurement as shown on Fgure 3.6. In NDDCHA the outlers are negatve eamples havng long dstances to the

70 55 hyperplane whch could be smply deleted from data set. Nosy eamples are also negatve data whch are ether random or systematc. The random nose eamples can be elmnated n the SVM. One of systematc noses could be caused from the small sze of hypothess space or non-optmal parameters α of specfed hypothess h αα Λ. Because SVM depends on a small set of support vectors comparng to tranng data set the hypothess may be senstve to noses and outlers. Accordng to Theorem 3. outlers are support vectors. Sometmes a few of attrbutes n an eample are mssng value because of several reasons ncludng: the value s hard to get or the value s stll gong to get. Then the value of kernel s decreased f all values of attrbutes are normalzed to nterval [ ]. The symptom s eactly lke nosy eamples because we can thnk the mssng value s resulted from noses. Fgure 3.6 An eample of outler the red crcle on the rght-bottom s an outler whch s far from other eamples.

71 56 Chun-fu Ln and Sheng-de Wang proposed fuzzy support vector machnes FSVMs provdng a method to classfy data wth noses and outlers by assocated wth a fuzzy membershp value to each tranng eample[86]. The knowledge of membershp s acqured from strateges of kernel-target[87] and KNN whch fnd the modfed hyperplane by FSVMs n the feature space. FSVM provdes a way to deal wth the noses and outlers although the computatonal cost s hgh Compensatable negatve eamples The pattern of compensatable negatve data shown on Fgure 3.7 s compensatable drectly by patchng a sngle hypothess. Some eamples of class + appeared crcle ponts on area B are msclassfed. However these eamples are not ntervewed wth class - appeared rectangle ponts. SVM tres to mnmze total rsk for all eamples. To acheve that eamples on area B s obvously not on the margn accordng to the rsk functonal of soft margn SVM defned on Equaton -3. The regulaton parameter C s preferred to choose a small value then nsuffcent stress wll be placed on fttng the tranng data otherwse SVM wll trend to overft the tranng data. The parameter k on the equaton -3 also contrbutes to the negatve data. If k then second term C l ξ counts the number of tranng errors therefore the lowest value k s k. The k value k s also used although ths s more senstve to outlers n the data. If we choose k then we are performng regularzed least squares.e. the assumpton s that the nose n eamples s normally dstrbuted. No matter what parameters are chosen on SVM the pattern of Fgure 3.7 cannot be elmnated snce pattern s that result of that eamples s not well dstrbuted or..d..

72 57 B Fgure 3.7 Sngle sde negatve eamples Negatve eamples are smlar eamples Fgure 3.8 Patchng a testng eample n the drectly compensatable pattern. Crcle ponts are n class+; rectangle ponts are class -; trangle pont s test pont.

73 Not compensatable negatve eamples The pattern of compensatable negatve data shown on Fgure 3.9 s not compensatable drectly by patchng a hypothess. Postve and negatve eamples are nterwoven on the area A where eamples cannot be lnear separated n the feature space. One dea s to re-use SVM agan to only those eamples n the area A. Snce feature space s mplcated by kernel functon eamples cannot be gotten drectly. Furthermore the dmensonalty of vector n the feature space s very hgh based on the analyss of applyng a polynomal kernel on the secton of Theoretcal Analyzng on NDDCHA. The approach proposed here s stll patchng hypothess. However the dffculty s how to select a desred patchng hypothess n the testng phase. Applyng whch patchng hypothess to testng eamples s based on the vector smlarty between a testng eample and negatve data set. As shown on Fgure3.8 vector s n the negatve data subset N whch s smlar wth testng vector then wll apply for the patchng hypothess used for N. If any smlar vector of cannot found cannot not be compensated. Sometmes t s very dffcult to fnd a smlar vector n the negatve data subset as shown on Fgure 3. s smlar wth { } and { }. The drecton of + 3 compensatng for s opposte. Suppose real { } f we choose as smlar vector + then result s desred. Otherwse we choose 3 as smlar vector and then the result s even worse. To attach whether s n the class {+} or {-} k-nearest neghbor method are used whch method s more precse than vector smlarty by comparng a vector to a group of neghbor vectors n the negatve eamples.

74 59 A Fgure 3.9 Interweaved postve and negatve eamples Negatve eamples 3 Smlar eamples? Fgure 3. Patchng a testng eample n the non-drectly compensatable pattern. Crcle ponts are n class+; rectangle ponts are class -; trangle pont s test pont.

75 6 Negatve Tranng eample h Negatve Testng eample h Fgure 3. The negatve tranng eample s compensated by h when n the tranng phase but negatve testng eample can not be compensated by h Imbalanced eamples SVM mnmzes the rsk functonal for all eamples. It does not try to mnmze dstnctvely the rsk of class + eamples or the rsk of class - eamples. Ths s because SVM adopts the assumpton of eamples whch s..d. and number of two classes { + } of eamples s balanced. Because of ths nherent nsuffcency of SVM SVM has less power on mbalanced eamples than on well balanced eamples. Fgure 3. shows mbalanced eamples the number of class + eamples s great than number of class -. The hyperplane created by SVM wll be prone to the sde that has more eamples than other sde. Therefore the predctve negatve value s closed to zero f mnorty class s class - because SVM optmzes accuracy. In ths case ROC metrc can

76 6 let SVM focus on the mnorty class. In certan condtons -norm soft margn SVMs can mamze AUC[5]. Fgure 3. Imbalanced eamples To deal wth mbalanced class problem one could modfy data dstrbuton by usng over-samplng and under-samplng[88]. Over-samplng replcates the mnorty class whle under-samplng removes partal majorty class. Both samplng makes tranng dataset balanced. The drawback of under-samplng s to lose some useful data. To overcome that the support vectors can be kept and non-support vectors can be removed n the SVM because SVM s determned by support vectors. Over-samplng may brng overfttng. In ths dssertaton under-samplng s adopted as shown on Fgure 3.3. Under-samplng tranng data keeps all supports vectors and partal non-support vectors whch depend on the number of negatve data. Non-support vectors may not be ncluded when support vectors area has more postve data.

77 6 Tranng Data Non-support vectors Postve Data Support vectors Negatve Data Under-samplng Data Fgure 3.3 Under-samplng strategy Yan et al proposed SVM ensemble to deal wth mbalanced eamples by combnng small class eamples wth a pece of large class eamples to form new k data sets and use k SVMs [89]. Each h.. k s a decson functon. Snce majorty votng and probablty based combnaton assumes all classfers are equal weghts the proposed strategy s creatng a herarchcal SVM. The fnal classfer or hypothess for an nstance s h h h h... h as shown on Fgure 3.4. k The bas b could be used as a regulaton parameter to control the poston of hyperplane. So far the problem how to recognze these cases above s stll remaned. Imbalanced eamples outler and lnear separatable eamples can be detected ntutvely. Our work focuses on the compensatable negatve eamples and nterweaved postve and negatve eamples. Compenstable eamples can be compensated by movng patchng hyper-surface to the desred drecton. Interweaved eamples can be dstngushed by k-nearest neghbor n the feature space. Vector smlarty can be thought as -nearest neghbor.

78 63 Fgure 3.4 Archtecture of Yan et al SVM ensembles 3.8 Compensatng Hypothess Approach Compensatng hypothess approach proposed n ths paper for the bnary classfcaton enhances the useful data nformaton by mnng negatve data. Ths approach s based on the Support Vector Machnes wth +k tmes learnng where the base learnng hypothess s teratvely compensated k tmes. Ths approach produces a new hypothess on the new data set n whch each label s a transformaton of the label from the negatve data set further producng the chld postve and negatve data subsets n subsequent teratons. Ths procedure refnes the model created by the base learnng algorthm creatng k number of hypotheses over k teratons as shown on Fgure 3.5. h- s the - th patched hypothess..k. A patchng hypothess h s patchng hypothess to compensate the base hypothess. Area A belongs to not well dstrbuted eamples. If

79 64 h- fts the area A h- wll be overfttng. The compensated hypothess s hh-+ h. Note that h s base hypothess created by bnary SVM. A h h- Fgure 3.5 Compensatng hypothess approach Ths approach s smlarly applyng for the Dvde and Conquer prncple. It perfectly accord wth the phlosophy of attackng the man ssues frst and then mnor ssues. The man ssue s attacked by creatng a man framework whch s a base hypothess to ft most of..d. eamples n the tranng data set. The mnor ssues are attacked by compensatng hypotheses.

80 65 CHAPTER 4 ERROR DRIVEN COMPENSATING HYPOTHESIS APRROACH It s mpossble to select a perfect model for a practcal problem wthout appromaton error n a learnng algorthm. Imagnng that underlyng functon f s a fluctuant terran t s hard to ft the terran by usng a huge sze of carpet h. The reason s that only tranng set and lmted pror knowledge s avalable. The man dea to reduce the appromaton error s to compensate the parts of not well-ft huge carpet by a sequence of small sze of carpets h whch s drven by the negatve data subset of tranng data. 4. Negatve Data Let tranng data set S S n the defnton -. It can be parttoned nto two subsets accordng to a dvder dhy y S where h s produced by a base # learnng algorthm. One subset s postve subset S whch s a set of well-separated eamples from S by the hypothess h h. The remanng of S s the negatve # subset S satsfyng S S + S. The negatve data does not mean the data s wrong or corrupt. What negatve data can be known s that the hypothess cannot make t wellseparated. The negatve data s not lmted to those ncorrect data alone but may also contan a few correctly classfed data as well. The defnton of negatve data s gven on t 3.. Boostng uses an equaton a t ln where s the error n the last teraton to t weght those negatve eamples. In NDDCHA the dvder s strongly related to the

81 66 number of support vectors sze of tranng eamples VC dmenson and radus of the sphere contanng all eamples whch can be determned by cross- valdaton method. Not all negatve eamples contan useful nformaton. The outlers obvously are junk. Thereby the negatve eamples wth confdence grater than. wll not be consdered n the negatve patchng learnng. To the μ-negatve data the crteron to judge whch eample needs to compensate s below: y h μ.. l 4- because the msclassfed eamples meet y h and not well separated eamples meet y h μ. The compensaton value s y h.. l 4- E.g. assume μ. 4 predctng value of an nstance s h -.3 and target label s y - y h *.3. 3 < μ then nstance needs to compensate. The compensaton value of the nstance s y h Tranng Phase Let X be a collecton of nput vectors from tranng set S and Y be a vector consstng of all labels of tranng set X { y S} and Y {y y S}. 4-3 Let h are the patchng negatve models workng on the tranng negatve subset S and d h y be dvders. And S s the negatve subset of S - accordng to d - h - y.

82 67 Here h s the comprehensve patchng model. And hk s the fnal model provdng to testng procedure. h h for.. k. 4-4 h h + h The sgn + n above epresson 4-4 provdes an overlap operaton for two models. Therefore the hypothess h of the NDDCHA approach s h k h for k > The tranng data sets are defned as follows k. S S S # S { Δy S Δy S S Δy Xd h h and Δy Δy Y} for.. k. 4-5 The labels on the tranng subset S are the dfferences or resduals of predcated labels and epected labels. For nstance a vector y.9+ has one attrbute.9 and label + from tranng data. The predctng label y of s.. Although y s correctly classfed because y.> s n the class + y s not well separated. On the net pass learnng the vector y becomes as the new tranng data. The hypothess s produced by tranng on the resdual data snce the dea of NDDCHA s to compensate the base hypothess each tme. Snce above algorthm s terated over k tmes t has to be regresson learnng algorthm. There are a total of +k passes n ths algorthm. S # are the postve data subsets and do not change durng the tranng. The algorthm has two phases tranng and testng as shown n Fgure 4. and k The fnal tranng output su S #. In the tranng phase the herarchy of tranng s a

83 68 chan. 4.3 Learnng Termnaton Crtera In every step tranng s drven by negatve data and t produces a seres of hypotheses h. The key pont n the tranng s the learnng termnaton crtera TCkS and depends on how large the value of k s. There are many possble ways to determne k out of whch three are mentoned here. In the frst case k s taken as the number S # S # S # 3 S # k+ d d d d k S S S S k S k+ h h h k Fgure 4. Tranng phase: S s negatve data subset S # s the postve data subset h s the patchng model or hyper-surface and d are dvders for k of teratons when the sze of S k s less than the number of nput attrbutes S k. In the second case k s taken as the number of teratons when the dfference sze between S k- and S k.e. S k- - S k μ s small enough here μ s a postve nteger. In the thrd case k s taken as the number of teratons when h k gves the epected output wth the specfed accuracy.

84 Testng Phase In the testng phase the hypotheses from tranng are used to create patchng data to compensate the base hypothess. The key pont n the testng phase s to determne the sutable patchng hypothess. The functon of vector set smlarty VS accepts two data sets S and T - one from the tranng data set and the other from the testng data set to generate a subset of T from T -. As a result each vector T - becomes smlar to at least one vector S denoted to vs δ where δ[] s the degree of smlarty. For eample when then vs and when then vs. Here P predcts labels on negatve data set T T T T VS S T f T S vs δ δ [] for k P { y T y h } P { y T S vs δ y h } for.. k 4-7 T h OV T P P # VS h OV S T P P # VS h OV S P P # 3 P # k- T k- OV VS h k S T k k P k P # k Fgure 4. Testng phase.

85 7 VS s the module of vector-smlarty OV s the module of overlappng. S s negatve data from tranng phase. T T s testng data set. T has the smlar vectors or elements between S and T - P s the patchng hyper-surface. P # s compensated testng outputs...k. P # k s fnal testng output. In above epressons δ s the regulatng parameter to control the degree of two vectors smlarty. It can be seen that T s smlar to S so that h can be used for testng T. to generate the values of P. These values are overlapped on to compensate labels as P # OVP # P # -. The fnal ouputs P # k are the predctng labels. For nstance Eucldean dstance - can be used as vector smlarty functon where s n the testng set T - and s n the negatve tranng data subset S. The vector smlarty functon vs s.- - f and are normalzed. The output labels whch are compensated value are gven as follows. P P # # P OV P # P.. k. 4-8 It can be seen that n the tranng phase the learner uses the hypotheses hh together wth the parttoner functon p h y as dvder generatng a postve group S # k + ku # S and a negatve group S k+ S-S # k. We fnd a subset of testng data T whch s smlar to S and use the hypotheses produced on S for testng T. The + operaton s one case of OV functon and hence the fnal testng result s treated as a summaton of overlappng functon k # #. P k P

86 7 4.5 Dscusson of Vector Smlarty n the Feature Space Compensatng an eample strongly depends on the functon of vector smlarty. There are four possbltes of compensatng an eample:. epected to compensate the result compensates. epected to compensate but not compensate 3. epected not to compensate the result compensates 4. epected not to compensate and not compensate Case mproves accuracy of base hypothess. Case and 4 are harmless. In worst case they keep the accuracy of base hypothess. Case 3 s dangerous case and need further study. Those testng eamples whch meet the condtons of Theorem. wll be compensated. The vector-smlarty plays etremely mportant role n NDDCHA learnng. To apply the reparng hyper-surface the frst thng s to fnd out whch vectors n testng data set need to be compensated. The vector-smlarty s used to fnd the relatonshp of vectors n the negatve data subset S and testng data subset T -. Only those vectors n T - wth hgh smlarty to those n S need to be compensated that s S T υ v s the degree of smlarty. The evaluaton of vector-smlarty n the classfcaton applcaton n the SVMs cannot be obtaned drectly because of the fact that the smlarty n the feature space s dfferent wth n the nput space X. The connecton between nput space and feature space was dscussed to fnd a vector n the nput space by gven a vector n the feature space[9]. The man dea s that the comparson of two vectors and are computed on

87 7 the feature space. There are three crtera of smlarty to be consdered to determne the smlar degree of and f:. and are located on the same sde of the hyperplane on ether class + sde or class - sde.. Dstances of and to the hyperplane are close then and have smlar separable capablty. The dfference of two dstances s less than δ 3. - s small enough whch s less than δ. To compute the vector-smlarty n the feature space h suppose hw T +b. The condton s equvalent to h *h >; and condton s θ δ δ w h h h h w w h w h dst dst Note that θ s constant. The dfference of h h s bound on w that means the lmtaton s not tght f w s large. Ths phenomenon shows agan that SVM mamzng the w / s correct. Condton 3 s δ + K K K d based on the equaton -6 where K s the kernel functon used n the th tranng. The vector smlarty algorthm n the feature space of SVM s functon vector_smlarty f h *h < then return +

73 else return -ma sgmodh - h sgmodd The dfference of confdence h - h and dstance d need to normalze so that the output of functon vector_smlarty has comparablty among eamples.

4-9 + e Algorthm k-nearest neghbor s an nstance learnng method whch learns hypothess only upon a new nstance queryng. The sgnfcant advantage of nstancebased learnng s local appromaton.

88 73 else return -ma sgmodh - h sgmodd The dfference of confdence h - h and dstance d need to normalze so that the output of functon vector_smlarty has comparablty among eamples. One eample of normalzaton method s a sgmod functon whch outputs a value between and. Fgure 4.3 Sgmod functon sgmod e Algorthm k-nearest neghbor s an nstance learnng method whch learns hypothess only upon a new nstance queryng. The sgnfcant advantage of nstancebased learnng s local appromaton. The dsadvantage of that s that the cost of classfyng a new eample can be hgh. When vector-smlarty functon vs s called k- nearest neghbor algorthm retreves k number of smlar related eamples from negatve data to determne how strong relatonshp between and. In the NDDCHA suppose t T n yn S where T s testng subset and S s negatve subset n the tranng data. The functon vs t n > f n s n the class + otherwse vs t n < f n s n the class -.

89 74 vs t n f y n < then return vector_smlarty t n else return + vector_smlarty t n To enhance to power of vector smlarty functon k-nearest NEIGHBOR algorthm s ntroduced. A testng eample s T and k eamples nearest T to T are... k S. The dstances between T and... k are d d... dk. The dstance weghted k-nearest NEIGHBOR gves estmated value of vector smlarty Δy T to T : Δy T k j k d Δy j j d j j. 4- where Δy j s the compensated value n the th tranng phase. To normalze the vector smlarty value vs [ + ] an etended sgmod functon s ntroduced: etended _ sgmod sgmod e 4.6 Algorthm of NDDCHA The procedure of NDDCHA has three parameters S s the tranng data set; T s the testng data set; and δ s the degree of vector smlarty. The return value of the algorthm s the predctve labels of testng data set. S subroutnes are nvoked: h LEARNS

90 75 P PREDICTT h S # + S + DIVIDERS h T VSS T - δ P # OVP # -P TCkS LEARN denotes tranng routne to get the model or hypothess; PREDICT routne predcts the labels of gven data set and model. DIVIDER s the routne to dvde tranng data set nto postve and negatve data subset by gven the hypothess and the functon parttoner dhy. In each pass the functon VS and DIVDER could be dfferent. The algorthm s descrbed below as pseudo-code. The code gven below follows the sequence or procedure developed and shown above n ths secton. NDDCHA S T δ >Learnng phase S[] S h[] LEARNS[] repeat + S # [] S[] DIVIDERS[-] h[-] h[] LEARNS[] untl TCS k > the number of teraton n repeat loop

91 76 > Testng phase T[] T P[] PREDICT T h[] P # [] P[] for to k do T[]V[] VSS[]T[-] δ f T[] Φ > T[] s not empty set then P[] PREDICTT[] h[] return P#[k] P#[] OVP#[-] P[] V[] DIVIDERS[-] h[-] X ΔY Φ foreach y n S[-] do > let XΔY be S[-] > ntalze to empty set X X {} ΔY ΔY {y} S[] Φ foreach Δy[-] n XΔY do Δy[] PREDICT h[-] f dh[-] Δy[-] then S[] S[] { Δy[]} Δy[] Δy[-]- Δy[] > update ΔY

92 77 S # S[-] - S[] return S#[] S[] VSS[]T[-] δ T[] Φ V[] Φ foreach n T[-] do foreach n S[] do f vs δ then T[] T[] {} V[] V[] {sgn vs } break return T[] V[] To apply the NDDCHA learnng algorthm t s requred that the parttoner functon dhy termnate crtera functon TCk S and vector smlarty vs to be provded. The performance of NDDCHA very much depends on the selecton of parttoner and vector-smlarty functon whch needs a pror knowledge of learnng task. 4.7 NDDCHA Algorthm Smulaton NDDCHA algorthm s mplemented by Perl whch uses a slght modfed SVM lght [83 84] as the base learnng algorthm ncludng learnng and classfyng modules. Three bnary classfcaton case studes on musk[9] breast cancer and cement have been analyzed. Before the cases were studed the three functons parttoner functon dhy

93 78 termnate crtera functon TCkS and vector smlarty vs needed to be defned. To smplfy the complety of computaton the parttoner was defned on the feature space by dh y ffh < true false [.5]. And TCkS was defned by TC S[]ff S[] true false. Vector smlarty Eucldean dstance method was used for musk; and feature space method was used for breast cancer and cement. Feature space vector smlarty approach s adopted n our work because of the fact that TABLE 4. Comparson of three data sets musk cancer cement SVM Average accuracy% NDDCHA average accuracy% Increase accuracy% the data n two cases of three has mssng value. The eucldean method s obvously not sutable for ths stuaton snce ths method wll make a vector wth mssng value n a hgh degree of smlarty. The threshold δ of vector smlarty s.7 n the dataset musk and δ.6 s used for both datasets breast cancer and cement. The data sets used n case studes musk and breast cancer are from UCI Knowledge Dscovery n Databases KDD Archve [8] and the data set used n cement s from Wangchang cement company. The n-fold cross-valdaton s performed n each case. The orgnal data s randomly dvded nto n groups; each group has the same or

94 79 appromate sze. Each group does not have the smlar dstrbuton of classes. One group s used as testng data and the remanng n- groups are grouped as tranng data. The valdaton procedure runs n tmes and each tme the testng data s from the th group...n. The average result of n-fold s the fnal accuracy gven as shown n TABLE 4.. It should be mentoned that the parameters of SVM are not well tuned because the cases are used to show the NDDCHA has better performance than the base learnng algorthm. TABLE 4. Smulaton on the data set musk No TP FN FP TN M C A% TP FN FP TN M C A% I% Aver The table has left and rght sdes. The left sde s the result of regular SVM. The rght sde s the result of NDDCHA. TP true postves FN false negatves FP false postves TN true negatves M msclassfed FN+FP Ccorrect classfed A%accuracy C/C+M*% and I% the accuracy mproved rght A% - left A%/left A% *% Musk has 6598 eamples wth 68 attrbutes and no mssng value. -fold cross-valdaton shows that the accuracy of predcton s 9.3% by usng SVM RBF model wth parameter γ.5. Before SVM s used to predct the test data all data

95 8 ncludng tranng set and test set are normalzed by unt normal scalng approach [9]. The operaton of normalzaton mproves the average accuracy mprovement from 7% to 9.3%. The unt normal scalng approach s appled on all data sets. The number of support vectors s very large near the sze of tranng set the rato of support vectors to eamples r88% whch means that there are lots of nose nvolved n the data set or the hypothess space s less than the target space. In a hgh nose problem many of the slack varables become non-zero and the correspondng eamples become support vectors [93]. The model h classfy tranng set and postve data P and negatve data N are dvded by the parttoner dh y ffh <.3true false whch mples all data TABLE 4.3 Smulaton on the data set Cancer No TP FN FP TN M C A% TP FN FP TN M C A% I% Aver The labels have the same meanng as TABLE 4. predctng output n are negatve subset. The number of compensatng s one. The sze of N s 6 data set N s traned as hypothess h and then h classfes the testng data set agan. The accuracy s as low as 5.7% and uses only support vectors. The reason s the sze of tranng set N s too small. However h wth.5% rato of number of support vectors and the number of tranng data gves % accuracy to classfy group N. It s nterestng to note that only two support vectors alone can gve such a hgh accuracy on the negatve tranng data set N. Detterch et al. [94] proposed ths algorthm terated-dscrm APR n ther paper. Compared to ther other seven algorthms descrbed n dfferent papers the terated-dscrm APR demonstrated the best

96 8 performance resultng n the correct predcton of 89.% wth confdence nterval [83.%-95.%] by usng -fold cross-valdaton wth eamples on Musk data. The NDDCHA usng -fold cross-valdaton wth 6 eamples gave the hgh predcatng confdence nterval wth [ ] as shown n Table 4.. TABLE 4.4 Smulaton on the data set Cement No. TP FN FP TN M C A% TP FN FP TN M C A% I% Aver The labels have the same meanng as TABLE 4.. Cancer has 569 eamples wth 3 attrbutes and mssng value. 5-fold crossvaldaton shows that the accuracy of predcton has been mproved from 88.9% to 9.86% by usng two learnng approaches as shown n Table 4.3. Snce there are mssng values n ths data set the vector smlarty works on the feature space. The kernel functon of SVM s mappng from nput space nto feature space. Note that the th attrbute of vector n the feature space s the combnaton of attrbutes of n the nput space and then t overcomes the effects due to mssng value. The same vector smlarty method s used for cement data set for the same reason as Cancer. The cement has eamples wth attrbutes and mssng data. The parttoner n the cancer and cement s also dh y ffh <.3true false. The smlarty degree s vs.99. The 5-fold cross-valdaton shows that the testng accuracy s ncreased from 7.56% to 74.94%. From Table and 4.4 t s observed that NDDCHA can always mprove

97 8 the testng accuracy n every fold testng. The tme cost on Musk s around 3 mnutes n the M-Pentum. GHz Wndows XP PRO computer. It s less than 5 mnutes n the other two data sets. It s shown that the vector smlarty functon s senstve to learnng problems n the cases studed. The fnal performance s dependent on the degree of vector smlarty. The Eucldean dstance method works on nput space whch means that the predctng value should be smlar f nput vectors are smlar. However Eucldean dstance method treats every attrbute n the vector wth equal contrbuton. Ths s not true for the real world problem because some attrbutes are more sgnfcant. Therefore all attrbutes must be at least normalzed before data s fed nto the learnng machne. When mssng value ests n the data set the dstances between the vector wth mssng attrbutes and normal vector wll be small. As a result these two vectors seem to be smlar but n fact they are not. In general the Eucldean method s sutable for large sze of eamples wthout mssng data such as n data set musk. Conversely the data set cancer has small sze eamples and mssng attrbutes. The good vector smlarty approach s to compare the smlarty of two vectors n the feature space. The reason s that n-dmensonal nput vector s projected nto a hgh m-dmensonal space usng nonlnear functonφ : n m R R and then mssng attrbutes n the nput space do not appear to be mssng on the feature space. The feature space s determned by kernel functons so the smlarty s related to learnng model.

98 83 CHAPTER 5 STATISTICAL NEGATIVE EXAMPLES LEARNING APPROACH Statstcal negatve eamples learnng approach SNELA has two or three stages of learnng ncludng the base learnng the negatve learnng and the boostng learnng for bnary classfcaton as shown n Fgure 5.. D :Learner h h S h D :Audt h Combnng h h h h h D 3:Booster h Fgure 5. Scheme of SNELA The base learnng whch s named learner employs a regular support vector machne to classfy man eamples and recognze whch eamples are negatve. The negatve learnng whch s named audt judges the predctng results of learner works on the negatve tranng data whch s strongly mbalanced to predct whch nstance could be negatve based on learner. When an nstance s predcted by audt as negatve ths nstance s clamed to be msclassfed by learner. The net step s compensaton where

99 84 we move ths nstance nto opposte class ether from class + to class - or vce versa f the nstance s negatve. Furthermore boostng learnng booster s appled when audt does not have enough accuracy to judge learner correctly. Booster works on the tranng data subset wth whch learner and audt do not agree. The classfer for testng s the combnaton of learner audt and booster. The classfer for testng a specfc nstance returns the learner s result f audt acknowledges learner s result and learner agrees wth audt s judgment otherwse returns the booster s result. If audt has enough accuracy boostng learnng may be skpped. 5. Concept of True Error The notaton Pr[ π ] D means the probablty of Boolean epresson π holdng on nstance drawn from nput space X accordng to dstrbuton D. In general the equaton below s held[]: D Pr[ π ] D Pr[ π ] X 5- where D s the probablty of nstance chosen under dstrbuton D. Concept c s a Boolean functon on some spaces of nstances. The probablty Pr[ h c ] s called the error of hypothess h on concept c under the dstrbuton D. D The nstance s drawn from nput space X accordng to the dstrbuton D. If the error s equal to or less than then h s called -close to the target concept c under D. In the bnary classfcaton Pr[ h c ] label of nstance. D s equvalent to Pr [ yh < ] D where y s target The true error concept s ntroduced here to demonstrate the compensaton condton n the worst case of negatve learnng. The error s related to the dstrbuton of

100 85 eamples and hypothess specfed. The followng s the defnton of the true error. Defnton 5.: The true error err D h of hypothess h wth respect to target label y and dstrbuton D s the probablty that a randomly generated nstance drawn from D s msclassfed. where the notaton errd h Pr[ h y] 5- D Pr s denoted by the probablty taken over the nstance dstrbuton D D and the par y s an underlyng labeled eample. The tranng error err S h s the rato of the number of msclassfed eamples over the total number of eamples on the tranng data set S n terms of a hypothess h that s defned below. errs h S 5-3 S h y y S The epresson π s defned to be f the predcate π holds and otherwse. S s the sze of data set S. Suppose tranng eamples S are chosen..d. accordng to the dstrbuton D. If there ests an algorthm A learnng on S to output a hypothess h that probablty at least δ s -close to the target concept c under D for gven parameter δ [.5 and [.5. It s denoted by h A S δ Pr{ err h > } < δ D S 5-4 The tranng error could be zero. For eample we choose a hgh VC dmenson kernel and a large number of regularaton parameter n the soft margn of SVM. The predctng error s the rato of the number of msclassfed eamples over the total number of eamples on the testng data set.

101 86 The testng error or predctng error err T h s the rato of the number of msclassfed eamples over total number of nstance on the testng data set T n terms of a hypothess h whch s defned below: errt h h y T 5-5 T T The epresson π s defned to be f the predcate π holds and otherwse. T s the sze of testng data set T the label y s the underlyng target value of nstance. A data set S s..d. drawn from D dvded nto four parts as TP TN FP and FN n terms of a fed bnary hypothess h by whch the true error s great than tranng error S err S h such that S TP UTN U FP U FN P TP UTN N FP U FN and err h D S because the S s a sub- set of the whole space wth dstrbuton D. TP TN { y y h >.. l} 5-6 FP FN { y y h <.. l} 5-7 The true error s not tranng error. The true error s a real metrc of generalzaton capacty. A learner s consstent f t outputs hypothess that perfectly fts the tranng eamples. P s the number of eamples n the postve set whereas N s the number of eamples n the negatve data set. For any classfcaton algorthm the result s meanngless f P s less than N. The random learner can gve error of.5. Snce P > N then [.5 where s true error of hypothess h. The true accuracy s. To deal wth mbalanced class problem one could modfy data dstrbuton by usng over-samplng and under-samplng[88]. Over-samplng replcates the mnorty class whle under-samplng removes partal majorty class. Both samplng makes tranng

102 87 dataset balanced. The drawback of under-samplng s to lose some useful data. Oversamplng may brng overfttng. Gven an mbalanced data set D P N where P s the subset of class + and N s the subset of class - the sze of P s greatly larger than the sze of N that s P >> N. We have three methods to allevate the unbalancng ncludng under-samplng over-samplng and the combnaton of above two samplng whch s named hybrd-samplng. The under-samplng technque s shown on the Fgure 5.. The result of under samplng s D P N. The data subset N s smply duplcated from N whle P s etracted randomly from P. Fgure 5. Under-samplng strategy The under-samplng coeffcent and the over-samplng coeffcent are defned as below: P γ 5-8 P N ρ 5-9 N where γ ] and ρ [ +. In under-samplng ρ whle n the S s a sub- set of the whole space wth dstrbuton over-samplngγ. The parameter c s the rato of negatve eamples N over postve eamples P whch descrbes the degree of unbalancng. Parameter r s the recprocal of c.

103 88 N c P c r 5- P N Where c ] means that the samplng technque cannot make N larger than P. The over-samplng technque s shown on Fgure 5.3. The data subset P s smply duplcated from P whle N s created by copyng all eamples n N and duplcatng some eamples n N. Fgure 5.3 Over-samplng strategy The hybrd-samplng combnes the under-samplng and over-samplng strateges as shown on Fgure 5.4. We get P γ P γ and N ρ N ρ. < > Fgure 5.4 Hybrd-samplng strategy Suppose D and D are the possbltes of nstance beng chosen under D and D respectvely as shown on Fgure 5.5. P and P are n class + whereas N and N are n class -. P and N s separated by h SVM S. D P +N s the result of samplng δ from D. The hypothess h h SVM D δ f SVM learnng algorthm s used here. Let p Pr[ h y] be the chance for nstance s msclassfed by hypothess h. The true error of hypothess h s learned from data set S. P s n class +

104 89 correctly classfed by h whle N s n class -. The hypothess h wth error s learned from D whch s the output of samplng from D. The samplng strategy used here s randomly etractng or duplcatng the eamples from D. When doman knowledge s used for eample only support vectors are etracted durng samplng t s not n ths case. Then the followng equaton s obtaned[95]: D p p D [ γ + ρ ] 5- γ + ρ The hypothess h wth error s gotten by tranng on S. We have the followng equaton: Then we get X D [ p ] γ { γ + ρ X γ ρ [ t + u] γ + ρ D [ p ρ ][ p ] + X D p [ p ]} 5- [ γ + ρ uρ] t 5-3 γ The data set samplng s to construct a new data set S from another data set D accordng to the controllng parameterα and samplng technque samplng. S samplng D α such that S α D 5-4 The data set S could be the subset of D f under-samplng technque s used P UnderSamplng P γ. Or data set D s a subset of S on the oversamplng N OverSamplng N. ρ

105 9 S - h P N D s t u v P undersamplngp r N oversamplngn p D - h FN TP +TN FP Fgure 5.5 Possblty of samplng data 5. Introducton to Statstcal Negatve Eamples Learnng Approach The hypothess of the naïve learnng of SNELA s a decson functon whch consders an nstance ether n class + or class -. The etended learnng of SNELA consders the confdence whch s assocated to an eample. The scheme of two stages SNELA s shown on Fgure 5.6. The fgure shows a scheme of negatve eample learnng. S s the tranng data set. The learner gets the hypothess h h SVM S by base learnng algorthm SVM. The data sets P and N are postve δ and negatve data of S respectvely separated by h. S s the tranng data set for audt whch s under samplng from postve data P and over samplng from negatve

106 9 data N. The negatve hypothess h s audt h h SVM S. TP and TN can be used to correct testng error created by base hypothess. δ Fgure 5.6 The scheme of two stages learnng ncludng base and negatve learnng. Snce P >> N constructon compensated data set S for negatve learnng s to use under samplng on P whereas over samplng on N. The data set S s not smplfed from the unon of the results of under and over samplng. It was translated nto compensated data set format for tranng by the functon audtsamplng defned n 5-6.

107 9 The secton of SNELA consders that the hypothess outputs a confdence value. Gven a tranng data set S { y }.. l testng data set T { }.. m and a parameter > μ > an SVM s traned on S to output hypothess h. When S s not separable negatve data subset N of S s not empty as shown on the fgure below. Fgure 5.7 The scheme of base learnng. S P U N The testng data set T s predcted by h to output correctly classfed nstances T P and msclassfed nstances T N snce h s not capable to separate all eamples as shown on the fgure below. Fgure 5.8 The scheme of base testng. T P s the correctly predcted nstances and T N s ncorrectly predcted nstances. In ths stage T P and T N are unknown wheret U T P TN.

108 93 If some of nstances n the T N were known n advance those nstances could be nversed nto the opposte class. For eample the target label of an nstance 5 s class +. The nstance 5 s a msclassfed nstance n the T by usng hypothess h so nstance 5 s predcted to be class -. In ths stage of practcal applcaton whether the nstance 5 s correctly classfed or msclassfed s unknown f we do not have known target label. If 5 can be predcted to be msclassfed nstance by another hypothess h wth a certan confdence say 7% possblty 5 can be corrected from class - nto class +. For that reason the key pont s how to construct the hypothess h. In order to construct the hypothess h a tranng data set S named compensated data set s requred to be constructed. The nformaton ncludng tranng data set S hypothess h postve eamples P and negatve eamples N are avalable before h s constructed. The man dea s to etract partal eamples from P and all eamples from N because P >> N. Another reason to nclude all negatve eamples s because the goal of h s to predct msclassfed nstances whch are strong related to the negatve N eample. P N S Fgure 5.9 Constructon of compensated tranng data S for h usng under-samplng strategy The compensated tranng data set S s defned below: Let P undersamplng P and N N 5-5 γ

109 94 audtsamplng S S { h + y P } { h y N } 5-6 The hypothess h s learned from tranng data set S. If over-samplng s used let N oversamplng N. Testng data set T s predcted by both hypotheses h ρ and h. T { y T T { y T y h } y h } 5-7 Assume the error of hypotheses h and h are and respectvely. If and meet some crterons whch wll be dscussed n the sectons 5.4 h can mprove the predctng of h on the testng data set T. If h has less confdence on predctng nstance to be y whle h has large confdence to say s negatve then the predcted label y by h should be nversed; let y be -y. 5.3 Analyss of Two Stages Learnng Theoretcally true error s less than predctng error T T whle predctng error T s greater than tranng error S practcally S T f the hypothess s fed. Ths s resulted from the appromaton and estmaton error of learnng machne. The true error s hard to get wthout eact data dstrbuton nformaton. Therefore n most cases predctng error s the substtuton of true error as appromate estmaton because the true error s mpossble to get n almost all applcatons. When we work on the worst case of negatve learnng tranng error s used to replace predctng error or true error. The reason s f a hypothess wth small tranng error cannot make fnal predctng get desred accuracy the true error must be greater than the desred error. Therefore the

110 95 followng dscusson assumes true error tranng error and predctng error are not dscrmnated. The tranng data set S has l eamples l S. The true errors.5 [.5 are the error of hypotheses of h [ and h. Then P > N and P > N for the base tranng data sets S P U N and compensated tranng data set S P U N. The parameter c s the rato of negatve eamples N over postve eamples P. Parameter r s the recprocal of c. N c r 5-8 P c 5.3. Under-Samplng In under-samplng P s a subset of P where all eamples n P are etracted from P randomly. N s small data set comparng to P. Therefore N keeps all eamples from N ecept outlers: P P N P N r N 5-9 The mnmum of c and mamum of r n the tranng phase of base leanng n terms of hypothess h are also defned below. N could be consdered as fed value. c r mn ma N P ma c mn mn N P l l c [] r [ + 5-

111 96 In the negatve learnng stage under-samplng strategy s adopted. So let N N l because S l and N << P. For eample N. 5 P f the accuracy of h s 8%. The tranng data of negatve learnng s S P U N. The sze of S and the predcted data subsets n terms of hypothess h are lsted below: P 5- r N lr N l FP lr TN l TP lr FN l 5- To make negatve learnng useful the number of correctly classfed eamples must be greater than the number of msclassfed eamples ncludng msclassfed negatve eamples and postve eamples. Then the followng nequalty needs be met. TN > FP + l > r + < + r FN > lr + l 5-3 The eplanaton s shown n Fgure Error! Reference source not found.. The strong condton TN > FP + FN needs to be met n order to mprove the accuracy of base learnng. The weak condton TN > FN must be met too. The FP eamples are msclassfed and stll are not judged correctly n negatve learnng. Ths tells us that the upper bound of true error n negatve learnng stage s. If the tranng error of negatve learnng s greater than the upper bound negatve + r learnng should not be processed. As shown n the Fgure 5. the relatonshp of true error of negatve learnng and the rato of the sze of postve and negatve eamples n

112 97 base learnng ndcates that goal s not hard to acheve because the area under curve s small comparng to the area over curve. Fgure 5. The number of correctly judged eamples n the negatve leanng s TN. The FN eamples are correctly classfed n base learnng but not be judged correctly. Fgure 5. r relatonshp dagram

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general