On classifier behavior in the presence of mislabeling noise

Size: px

Start display at page:

Download "On classifier behavior in the presence of mislabeling noise"

Ernest Banks
6 years ago
Views:

1 Data Min Knwl Disc DOI /s On classifier behavir in the presence f mislabeling nise Katsiaryna Mirylenka 1 Gerge Giannakpuls 2 Le Minh D 3 Themis Palpanas 4 Received: 12 Nvember 2015 / Accepted: 1 Nvember 2016 The Authr(s) 2016 Abstract Machine learning algrithms perfrm differently in settings with varying levels f training set mislabeling nise. Therefre, the chice f the right algrithm fr a particular learning prblem is crucial. The cntributin f this paper is twards tw, dual prblems: first, cmparing algrithm behavir; and secnd, chsing learning algrithms fr nisy settings. We present the sigmid rule framewrk, which can be used t chse the mst apprpriate learning algrithm depending n the prperties f nise in a classificatin prblem. The framewrk uses an existing mdel f the expected perfrmance f learning algrithms as a sigmid functin f the signal-t-nise rati in the training instances. We study the characteristics f the sigmid functin using Respnsible editr: Jhannes Fuernkranz. Katsiaryna Mirylenka: Wrk mainly dne while at the University f Trent, Italy. B Katsiaryna Mirylenka kmi@zurich.ibm.cm Gerge Giannakpuls ggianna@iit.demkrits.gr Le Minh D leminh.d@studenti.unitn.it Themis Palpanas themis@mi.parisdescartes.fr 1 IBM Research Zurich, Säumerstrasse 4, 8803 Rüschlikn, Switzerland 2 Institute f Infrmatics and Telecmmunicatins f NCSR Demkrits, Ag. Paraskevi, Athens, Greece 3 University f Trent, via Smmarive 5, 39 Trent, Italy 4 Department f Cmputer Science, Paris Descartes University, 45 Rue Des Saints-Peres, Paris, France

2 K. Mirylenka et al. five representative nn-sequential classifiers, namely, Naïve Bayes, knn, SVM, a decisin tree classifier, and a rule-based classifier, and three widely used sequential classifiers based n hidden Markv mdels, cnditinal randm fields and recursive neural netwrks. Based n the sigmid parameters we define a set f intuitive criteria that are useful fr cmparing the behavir f learning algrithms in the presence f nise. Furthermre, we shw that there is a cnnectin between these parameters and the characteristics f the underlying dataset, shwing that we can estimate an expected perfrmance ver a dataset regardless f the underlying algrithm. The framewrk is applicable t cncept drift scenaris, including mdeling user behavir ver time, and mining f nisy time series f evlving nature. Keywrds Classificatin Sequential classifiers Classifier evaluatin Handling nise Cncept drift 1 Intrductin Transfrming vast amunts f cllected pssibly nisy data int useful infrmatin thrugh the prcesses f clustering and classificatin has imprtant applicatins in many dmains. One example is preference detectin f custmers fr marketing. Anther example is detectin f negligent prviders fr Mechanical turk and reductin f their behavir n task results. The machine learning and data mining cmmunities have extensively studied the behavir f classifiers in different settings (e.g., Han and Kamber 2006; Thedridis and Kutrumbas 2003), hwever, the effect f nise n the classificatin task is still an imprtant and pen prblem. The imprtance f studying nisy data settings is augmented by the fact that nise is very cmmn in a variety f large scale data surces, such as sensr netwrks and the Web, while Machine learning algrithms (bth classic and sequential) perfrm differently in settings with varying levels f labeling nise. Thus, there rises a need fr a unified framewrk aimed at studying the behavir f learning algrithms in the presence f nise, depending n the specifics f each particular dataset. Labeling nise is cmmn in cases f cncept drift, where a target cncept shifts ver time, rendering previus training instances bslete. Essentially, in the case f cncept drift, feature nise causes the labels f previus training instances t be ut f date and, thus, equivalent t label nise. Drifting cncepts appear in a variety f settings in the real wrld, such as the state f a free market, r the traits f the mst viewed mvie. Thus, we expect this wrk t ffer intuitin and tls fr meta-learning in similar, nise-prne, real cases f learning. Earlier wrk has shwn that the perfrmance 1 f a classifier in the presence f nise can be effectively apprximated by a sigmid functin, which relates the signal-t-nise rati in training data t the expected perfrmance f the classifier (Giannakpuls and Palpanas 2010). We call this fact the sigmid rule. The sigmid rule states that the expected perfrmance f each learning algrithm can be seen as a sigmid functin. This functin indicates the impact f signal-t-nise rati in the 1 When we write perfrmance f an algrithm, we mean the classificatin accuracy.

3 On classifier behavir in the presence f mislabeling nise Expected Classificatin Perfrmance Signal-t-nise Fig. 1 An example f a sigmid functin relating signal-t-nise (rati f mislabeled instances) in the training set (hrizntal axis) t expected classificatin perfrmance (vertical axis) training set t the perfrmance f the algrithm, under an assumptin f mntnicity f the change f perfrmance (i.e., when cleaner data are prvided the algrithm is expected t perfrm better). The added value f this view is that it allws a single mdel, with a limited number f parameters, t be used as an illustratin f the behavir f all learning algrithms in a nisy setting. We illustrate an example f a sigmid functin in Fig. 1 and elabrate n the cncept in Sect In this wrk, we study the effect f training set label nise 2 n a classificatin task. We further build n this study t achieve tw, dual gals: We shw hw t supprt the cmparisn between algrithms and their behavir in a nisy setting. The wrkflw in this case is as fllws: we select an algrithm; we sample the input fr training data; we estimate a set f parameters (based n the prpsed sigmid rule framewrk we discuss belw) that characterize the behavir f the algrithm in the presence f nise; we cmpare algrithms based n thse parameters. Thus, in Sect. 4, we examine hw much added benefit we can derive frm the sigmid rule mdel, by elabrating n the parameters f the sigmid and illustrating the cnnectin f each such parameter t the learner behavir. Based n the mst prminent parameters we define several dimensins characterizing the behavir f learning algrithms. We, finally, illustrate hw these dimensins can be used t select learning algrithms in different settings, based n user requirements. We shw hw t allw fr the predictin f algrithm perfrmance, given dataset attributes related t a specific classificatin prblem. The wrkflw in this case is as fllws: we calculate a set f dataset attributes (i.e., the number f classes, features and instances and the fractal dimensinality de Susa et al. 2006); we use estimatin (regressin) mdels t predict the expected perfrmance regardless f the chice f the underlying algrithm. We study the crrelatin between dataset characteristics and mdeled perfrmance in Sect. 7. Mrever, we extend ur study t sequential classificatin cases, t illustrate the brad applicability and usefulness f the sigmid rule framewrk (SRF). The SRF uses a mdel f the expected perfrmance f learning algrithms based n a sigmid functin f the signal-t-nise rati in the training instances. SRF is specialized fr nisy settings, prviding mre insights regarding the behavir f the algrithms. Essentially, it describes a meta-mdel (the SRF sig- 2 Thrughut the rest f this wrk we will use the term nise t refer t the label nise, unless therwise indicated.

4 K. Mirylenka et al. mid) n varius algrithms, thus making it a special case f meta-mdeling. Cntributins In summary, we make the fllwing cntributins. We define a set f intuitive criteria based n the SRF that can be used t cmpare the behavir f learning algrithms in the presence f nise. This set f criteria prvides bth quantitative and qualitative supprt fr learner selectin in different settings. We demnstrate that there is a cnnectin between the SRF dimensins and the characteristics f the underlying dataset, using bth a crrelatin study and regressin mdeling based n linear and lgistic regressin. In bth cases we discvered statistically significant relatins between SRF dimensins and dataset characteristics. Our analysis shws that these relatins are strnger fr the sequential classifiers than fr the nn-sequential, traditinal classifiers, where the instances are cnsidered t be independent. Our results are based n an extensive experimental evaluatin, using 10 synthetic and 31 real datasets frm diverse dmains. The hetergeneity f the dataset cllectin validates the general applicability f the SRF fr bth nn-sequential and sequential learners. The rest f this paper 3 is rganized as fllws. We review the related wrk (Sect. 2) and define the prblem studied (Sect. 3). We then fcus n the sigmid functin and define the SRF dimensins (Sect. 4), describing their qualitative value. We test ur framewrk n multiple datasets (Sect. 6), statistically analyzing the cnnectin between dataset prperties and SRF dimensins (Sect. 7) fr bth sequential and nnsequential classificatin tasks. Sectin 8 prvides a summary f the prpsed apprach and the experimental results. Finally, we cnclude in Sect Related wrk Since 1984, when Valiant published his thery f learnable (Valiant 1984), setting the basis f prbably apprximately crrect (PAC) learning, numerus different machine learning algrithms and classificatin prblems have been devised and studied in the machine learning cmmunity. Given the variety f existing learning algrithms, researchers are ften interested in btaining the best algrithm fr their particular tasks. The algrithm selectin is cnsidered t be a part f the meta-learning dmain (Giraud-Carrier et al. 2004). Accrding t the n-free-lunch therems (NFL) described in (Wlpert 2001) and prven in (Wlpert 1996a, b), there is n verall best classificatin algrithm, given the fact that all algrithms have equal perfrmance n average ver all datasets. Nevertheless, we are usually interested in applying machine learning algrithms t a specific dataset with certain characteristics, in which case we are nt limited by the 3 A preliminary versin f this wrk has appeared in Mirylenka et al. (2012).

5 On classifier behavir in the presence f mislabeling nise NFL therems. The results f NFL therems hint at cmparing different classificatin algrithms n the basis f dataset characteristics (Ali and Smith 2006), which is the aim f ur wrk in this research directin. Cncerning the measures f perfrmance that help distinguish amng learners, in Ali and Smith (2006) the authrs cmpared algrithms n a large number f datasets, using measures f perfrmance that take int accunt the distributin f the classes, which is a characteristic f a dataset. The Area Under the receiver perating Curve (AUC) is anther measure used t assess machine learning algrithms and t divide them int grups f classifiers that have statistically significant difference in perfrmance (Bradley 1997). In all the abve studies, the analysis f perfrmance has been applied fr the datasets withut extra nise, while we study the behavir f classificatin algrithms in nisy settings. In ur study, we use the fact shwn by Giannakpuls and Palpanas (2010, 2013) abut cncept drift, which illustrates that a sigmid functin can accurately describe perfrmance in the presence f varying levels f label nise in the training set. In an early influential wrk, Kuh et al. (1990) indicated and studied the prblem f cncept attainment in the presence f nise in the STAGGER system. In this paper, we analytically study the sigmid functin t determine the set f parameters that can be used t supprt the selectin f the learner in different nisy classificatin settings. The behavir f machine learning classifiers in the presence f nise is als cnsidered in the wrk (Kalapanidas et al. 2003). The artificial datasets used fr classificatin were created n the basis f predefined linear and nnlinear regressin mdels, and nise was injected in the features, instead f the class labels as in ur case. Nisy mdels f nn-markvian prcesses using reinfrcement learning algrithms and the methds f tempral difference are analyzed in the paper (Pendrith and Sammut 1994). Chevaleyre and Zucker (2000) examine multiple-instance inductin f rules fr different nise mdels. There are als theretical studies n regressin algrithms fr nisy data (Teytaud 2001) and wrks n denising, like Li et al. (2002), where a wavelet-based nise remval technique was shwn t increase the efficiency f fur learners. Bth nise-related studies (Teytaud 2001; Li et al. 2002) dealt with nise in the attributes. Hwever, we study label-related nise and d nt cnsider a specific nise mdel, which is a different prblem. The label-related nise crrespnds mstly t the cncept drift scenari, as was als discussed in the Sect. 1. Mrever, we estimate perfrmance as a functin f the signal-t-nise rati, cntrary t the wrks described abve. Inthe wrk f Nettletn et al. (2010), the authrs study the effect f different types f nise n the precisin f supervised learning techniques. They address this issue by systematically cmparing hw different degrees f nise affect supervised learners that belng t different paradigms. They cnsider 4 classifiers, namely the Naïve Bayes classifier, the decisin tree classifier, the IBk instance-based learner and the SMO supprt vectr machine. They test them n a cllectin f data sets that are perturbed with nise in the input attributes and nise in the utput class, and bserve that Naïve Bayes is mre rbust t nisy data than the ther three techniques. The perfrmance was evaluated n 13 datasets. Hwever, n general framewrk was prpsed fr mdeling the effect f nise n the perfrmance f the classifiers. In cntrast, we prpse the sigmid rule framewrk fr cmparing the behavir f classifiers. We als

6 K. Mirylenka et al. intrduce different dimensins that can be used t frmulate the criteria fr cmparing the algrithms depending n user needs. Mrever, we als apply this framewrk t sequential classifiers, which were nt cnsidered in abve studies. Previus studies n meta-learning and cncept drift (Klinkenberg 2005; Widmer 1997) describe appraches that adjust t changing user interests, taking int accunt the cntext. Meta-learning can ccur at varius levels, e.g., training instance selectin, learning algrithm and parameters selectin, windw size selectin, based n nline measurements. These measurements may include errr rates, distributin f class instances in a windw f recent instances, and ther related data r classificatin traits. Hwever, these wrks d nt explicitly address label nise, and cannt prvide an a priri estimatin f classifier perfrmance in such situatins, as in ur wrk. Recent wrks n ensemble-based meta-learning (e.g., Cruz et al. 2015) deal with lcal characteristics f the data and ptimize ensembles f classifiers (ECs), but d less t prvide dataset-related and algrithm-independent results, as we d in this wrk (Sect. 7). In Heywd (2015), the challenges f evlutinary mdel building are discussed, including a number f challenges and related appraches in the literature aimed at (evlving) streaming data in classificatin tasks. In this verview, the authr disuses (thrugh a multitude f wrks) a variety f required characteristics f learning methds, and hw evlutinary cmputatin can cntribute twards these characteristics. We believe that ur wrk is cmplementary, in that it can prvide a basis fr future meta-learning and evlutinary algrithms, where fine traits f learning (e.g., stability f perfrmance vs. maximum perfrmance) can priritize selectin in a learning setting. The SRF view f meta-learning prvides anther imprtant tl t the currently available appraches. This view may be even mre effective taking int accunt references t recurring cntexts (Pechenizkiy 2015), where the variance f nise levels may be predictable after a perid f bservatin and, thus, the knwledge f an algrithm s behavir within these levels can cnstitute imprtant knwledge. Furthermre, in this wrk we prpse a nvel directin f meta-learning, mving the fcus frm accuracy t transparency and trust (Pechenizkiy 2015). We achieve this by explaining classificatin expected behavir under nvel, interesting axes f analysis, which take int accunt nt nly the abslute values f the perfrmance f a classifier, but als characterize the behavir f the perfrmance change frm varius perspectives. Other meta-learning literature has fcused n the fllwing tpics: impact f the dataset and algrithm features n the learning perfrmance (Brazdil et al. 2003, 2009; wn Lee and Giraud-Carrier 2008; Rssi et al. 2014; Garcia et al. 2016; Mantvani et al. 2015); identificatin and handling f different types f nise (Brazdil et al. 2003; Rssi et al. 2014) and cncept drift (Rssi et al. 2014; Garcia et al. 2016) in the learning setting; meta-learning framewrks fr fine-tuning f the learning systems (Brazdil et al. 2003, 2009; Smith et al. 2014; Abdulrahman et al. 2015). Cncerning the study f dataset and algrithm features, Brazdil et al. (2003, 2009) describe a number f dataset features and discuss their usefulness n the perfrmance f individual classifiers. Hwever, these wrks d nt include a systematic analysis f all dataset-classifier pairs. Paper (wn Lee and Giraud-Carrier 2008) intrduces a dataset characteristic called hardness and demnstrates that it cntributes significantly t the perfrmance f any classifier, which supprts ur cnclusin that the dataset

7 On classifier behavir in the presence f mislabeling nise characteristics can be used t assess classifier perfrmance a priri. The paper (Mantvani et al. 2015) addresses the prblem f chsing the parameters f a learning algrithm that wuld maximize its perfrmance n a new dataset based n its characteristics. The prpsed slutin uses a meta-classifier with 81 features, which include the number f attributes and classes in the dataset, statistical mments, infrmatin thery measures, and thers. The wrk des nt chse the mst efficient dataset features. Rssi et al. (2014) and Garcia et al. (2016) apply meta-learning t regressin prblems and t nise filters respectively. Similarly t Mantvani et al. (2015), these wrks als use varius dataset statistics as features, withut a systematic analysis f feature imprtance. Cncerning nise, in ur wrk we study the crrelatin between labeling nise (as ppsed t feature nise) and the expected perfrmance. Rssi et al. (2014) and Garcia et al. (2016) als cnsider nisy settings. Similarly t ur wrk, Rssi et al. (2014) fcus n cncept drift scenaris. Hwever, it des nt prvide generally applicable criteria fr selecting the learning algrithm and des nt deal with explicit label nise. Garcia et al. (2016) discuss the perfrmance f nise filters, but des nt prvide a generic mdel f algrithm behavir in the presence f nise. In cmparisn t these wrks, we additinally address the perfrmance f sequential mdels fr classificatin in the presence f nise. We prvide the added value f an estimated analytic cnnectin between nise levels and perfrmance (described by the sigmid curve). The sigmid rule allws an adaptive meta-learning system t have prir knwledge regarding algrithms and their expected perfrmance in different levels f nise and under different requirements (stability, average perfrmance, etc.). Meta-learning framewrks that can estimate nise levels in a dataset will als benefit frm ur analysis. General meta-learning framewrks are described in detail in Brazdil et al. (2003, 2009). Efficiency mdificatins fr fast discvery f ptimal algrithms are prpsed in wrk (Abdulrahman et al. 2015). In ur we fcus n a systematic analysis f the perfrmance f classifiers depending n the dataset and classifier features, and the level f nise. A future study can adpt the architecture f meta-learning framewrk fr classifier selectin in nisy settings. Smith et al. (2014) prpse cllabrative filtering apprach fr meta-learning, arguing that the relevant algrithm and dataset features can be chsen while applying cllabrative filtering. It is nt clear hw this apprach will perfrm in the case f cncept drift scenari and in the case f multi-criteria perfrmance bjectives. In this wrk we analyze the perfrmance f a classifier in the envirnment with varying label nise and study the effect f dataset characteristics n perfrmance behavir. We leave the cmparisn f meta-learning platfrms fr future wrk. Sequential data and especially time series data cllected frm sensrs are ften affected by nise. That is, the quality f the data may be decreased by errrs due t limitatins in the tlerance f the measurement equipment. Sequential classificatin based n Markv chains with nisy training data is cnsidered in the wrk f Dupnt (2006). The paper shws that smthed Markv chains are very rbust t classificatin nise. The same paper discusses the relatin between the classificatin accuracy and the test set perplexity that is ften used t measure predictin quality. Hwever, unlike ur wrk, n specific frmalizatin is prpsed fr analyzing the effect f different nise levels n the perfrmance f the classifier. Develpment f such frmalizatin is becming increasingly imprtant as sequential and time series data classificatin are

8 K. Mirylenka et al. applied in diverse dmains, such as financial analysis, biinfrmatics and infrmatin retrieval. Fr example, a recent survey (Xing et al. 2010) cnsiders varius methds f sequential classificatin in terms f methdlgies and applicatin dmains. The methds fr sequential classificatin we fcus n are intrduced in the wrks f Rabiner and Juang (1986), Suttn and McCallum (2010), and Elman (1990), which describe hidden Markv mdels, cnditinal randm fields, and recurrent neural netwrks crrespndingly. The techniques described by Mirylenka et al. (2013, 2015), such as cnditinal heavy hitters, allw estimating sequential mdels in real time. The efficiency f such techniques enables the applicability f sequential classificatin t sensr data analysis and mving trajectries. 3 Prblem statement and preliminaries We cnsider settings where a classificatin task is perfrmed under varying levels f label nise, similarly t a cncept drift scenari, where the labels are gradually changing and the nise level is nt knwn in advance. Such settings are cmmn and have been detailed in previus wrks (e.g., see definitin f the prblem f the demanding lrd (Giannakpuls and Palpanas 2013), where a user changes her preferences gradually ver time and an autmatic system is called t adapt t this change). Our gal is t recmmend a classifier that is beneficial fr a given dataset, based n custm algrithm ptimality criteria (e.g., stability under nise vs. maximum pssible perfrmance). The nly required knwledge related t the task is n characteristics f the dataset, such as the number f classes, number f attributes, and intrinsic dimensinality. Mre frmally, we define a training set as T. Part f the instances in this dataset have true labels. We define this set f instances S and term them signal instances. The rest f the instances have nisy (i.e., errneus) labels. We define this set as N and term them nisy instances. Hence, S T, N T and S N = T while S N =. Then S = S and N = N represent the signal magnitude and the nise magnitude f a training set T, respectively. We als allw S = S = 0, N = T N = T, when all the training set is nt valid any lnger because a shift has just ccurred. Crrespndingly, N = N = 0, S = T S = T, when n nise is induced in the training set. Given the abve definitins, we define the signal-t-nise rati Z f a given mment in time: Z = lg(1 + S) lg(1 + N) = lg 1 + S 1 + N, (1) where lg( ) represents the natural lgarithm. We add ne t signal and nise magnitudes s that the lgarithm is defined fr their zer values. 3.1 Characteristic transfer functin In rder t describe the perfrmance f a classifier we use the sigmid rule, which was intrduced and discussed in previus studies (Giannakpuls and Palpanas 2010, 2013). The sigmid rule uses a functin that relates signal-t-nise rati f the training

9 On classifier behavir in the presence f mislabeling nise f( Z) d a lg ( 3) Z 1 Z * r a lg Z inf ( 3) Z 2 Z * Z Fig. 2 Sigmid functin and pints f interest (i.e. an inflectin pint, essentially the pint where the curve changes significantly) set t the expected perfrmance f a learner. This functin is called the characteristic transfer functin (CTF) f the learning algrithm. In this wrk, we als refer t it as the sigmid functin f the algrithm, and use the terms CTF and sigmid functin interchangeably. Given that an algrithm has a minimum perfrmance f m and a maximum perfrmance f M fr a given dmain f signal-t-nise values, then the functin that describes the average perfrmance f as a functin f the signal-t-nise rati Z (Eq. 1), is f the frm (Giannakpuls and Palpanas 2010, 2013): 1 f (Z) = m + (M m) 1 + b exp( c(z d)), (2) where m M, m, M [0, 1], b, c R +, d R are the parameters f the sigmid functin. The number f parameters can be reduced, thugh their presence allws fr a mre accurate estimatin f the perfrmance f a classifier, as has been empirically shwn Giannakpuls and Palpanas (2013). An example f a sigmid functin can be seen in Fig. 2. The definitins f the main cncepts used in the paper are given in Table 1. The basis f the CTF lies n prbably apprximately crrect learning (Haussler 1990; Valiant 1984) and, mre specifically, n the expectatin that in PAC learnable settings enugh (randmly drawn) samples can help a system select a hypthesis identifying cncepts. Thus, in PAC learning, mre crrectly labeled instances decrease the prbability f errr, reaching a limit ɛ. In the CTF mdeling f learning we have tried t mdel what happens in the presence f mislabeled instances, while keeping the limit cases aligned t PAC learning. Let us cnsider the case where a set f cncepts C is (PAC) learnable. This means that, prvided with enugh learning instances, the system will end up learning (a hypthesis H identifying) the cncepts. In this case, the selected, crrectly labeled

10 K. Mirylenka et al. Table 1 The definitins f the main cncepts Cncept Signal-t-nise rati Perfrmance Characteristic transfer functin (CTF) Sigmid rule framewrk (SRF) Definitin The rati f crrectly labeled training instances, t the incrrectly labeled training instances A measure f the accuracy f an algrithm in a nisy training setting. In a classificatin prblem, we use accuracy Is a functin cnnecting the signal-t-nise in the training dataset t the expected perfrmance f a learning algrithm Is an apprach that mdels the expected perfrmance f a learning algrithm, using a sigmid functin as the CTF fr any algrithm instances that were input t the learning system as training are all signal, with n nise. The errr f this system tends t be belw ɛ. Nw, let us cnsider the case where there exists a false labeling prbability 1 P f 0, when the input instances are prvided t the learner. These instances cnstitute a ppulatin f training instances that d nt reflect the riginal, true instance distributins fr C. In this case, the system will (try t) learn a different set f cncepts C, which is a functin f the distributins f C and the false cncepts set distributins F = C. It is easy t deduce that, if the system learns C, i.e., it frms a hypthesis H = H that can map instances t C with an errr prbability belw ɛ after enugh instances, then a prtin f these mappings/predictins will stand in C but nt in C. The mre mislabeled instances are prvided as input t the learner, the higher the prbability that the predictins will be wrng, with respect t C. In the extreme case that all instances f C are prvided as inputs and the learner learns all the space, the prbability f errr can be exactly P f, since the learner will have mistakenly and perfectly learned all the falsely-labeled instances. If all the input instances are prvided mislabeled, then the learner will have learned the full set f false cncepts C = F and C will have n effect n the learning. In this case the perfrmance f the system with respect t C will be minimum. Different values f P f are expected t vary the errr rates between the minimum and the maximum value. If we are interested in the average errr rate (and dually the perfrmance) f the learner ver several experiments in the presence f nise, we can nly assume that, after enugh tries, increasing P f will lead t higher errr ɛ = ɛ + P f >ɛ,p f > 0. At this pint we prvide an alternative view f P f : it is directly cnnected the percentage f nise N = P f T and signal S = (1 P f ) T in a signal-t-nise rati Z frm Eq. 1, where T is a training set as is intrduced abve. T summarize, the intuitin behind the sigmid in Eq. 2 is based n the fact that a (nn-trivial) learning algrithm starts t perfrm well after a certain rati f gd t bad examples has been bserved. Frm that mment n, the perfrmance f the algrithm cnstantly imprves (n average) as the rati is imprved (mntnicity

11 On classifier behavir in the presence f mislabeling nise assumptin), until the pint where the best perfrmance is reached. Then, n matter hw much the rati f gd t bad examples increases, there is little change, because the algrithm cannt d much better, since it has learned the input cncepts (and als pssibly due t its generalizatin prperty, where such a prperty is applicable). We cnsider the CTF t be characteristic f an algrithm fr a given dataset (i.e., set f cncepts, instance and hypthesis space). It has been demnstrated that the sigmid can be estimated frm training sets f varying Z and, then, it can be used as a knwn functin fr a given algrithm (Giannakpuls and Palpanas 2013). Inthis wrk, we demnstrate that the characteristics f a dataset (e.g., Vapnik Chervnenkis r VC dimensin; Vapnik 1998) have a strng influence ver the sigmid parameters (refer t Sect. 6). This prperty f the sigmid allws reusing sigmids acrss datasets with similar characteristics and suggesting the mst prmising learner fr a new dataset, based n custm ptimality criteria. The behavir f different machine learning algrithms in the presence f nise can be cmpared alng several axes f cmparisn, based n the parameters f the sigmid functin. The parameters related t perfrmance include: (a) The minimal perfrmance m, (b) The maximal perfrmance M, (c) The span f the perfrmance range r alg = M m. Withregard t the sensitivity f perfrmance t the change in the signal-t-nise rati, we cnsider: (a) Within which range f nise levels, change in the nise leads t significant change in perfrmance; (b) Hw we can tell apart algrithms that imprve their perfrmance even when the signal-t-nise levels are lw, frm thse that nly imprve in high ranges f the signal-t-nise rati; (c) Hw we can measure the stability f perfrmance f an algrithm against varying nise; (d) At what nise level an algrithm reaches its average perfrmance; (f) Whether reducing the nise in a dataset is expected t have a significant impact n the perfrmance. T answer these questins we perfrm an analytic study f the sigmid functin f a particular algrithm by the means f traditinal functin analysis. This analysis helps t devise measurable dimensins that can answer ur questins abut the parameters related t perfrmance f classifiers. In the next sectin we intrduce the SRF which prpsed several axes fr algrithm cmparisn, based n the CTF. The set f parameters f the CTF and the prpsed SRF is summarized in Table 2 fr quick reference. 4 The sigmid rule framewrk We investigate the prperties f the sigmid functin (Eq. 2) in rder t determine hw each f the parameters m, M, b, c, and d affects the shape f the sigmid, and hw this translates t the expected perfrmance f the learning algrithm. We start by a direct

12 K. Mirylenka et al. Table 2 Ntatin f sigmid-related variables and SRF dimensins Parameter Descriptin Sigmid-related variables S Signal magnitude in training set N Nise magnitude in training set Z Signal-t-nise rati m, M, b, c, d Parameters f sigmid f (Z) Sigmid functin Z inf Pint f inflectin f sigmid (zer f the 2nd derivative) Z (3) 1,2 Zeres f the 3d derivative f sigmid d s Distance between Z inf and Z (3) 1, characterizes the slpe f sigmid [Z, Z ] Active nise range f an algrithm, where perfrmance changes significantly f 1 ( ) Inverse sigmid functin SRF dimensins m Minimal perfrmance f a classificatin algrithm M Maximal perfrmance f a classificatin algrithm r alg Span f the perfrmance range c Slpe indicatr f sigmid functin f an algrithm d alg Width f the active area f algrithm perfrmance r alg /d alg Perfrmance imprvement f an algrithm ver signal-t-nise rati change analysis f the sigmid functin and its parameters. In Fig. 2 we see an illustratin f a sigmid, where we have indicated the pints related t the curve parameters. The reader is referred t Table 2 fr a summary f the symbls. The dmain f the sigmid is, in the general case, Z (, + ). The range f values is (m, M), as als can be seen n Fig. 2. Then, we cnsider the derivatives f the sigmid functin in rder t study its characteristics. Since the first rder derivative is always psitive, f (Z) >0fr Z (, + ), the functin f (Z) is mntnically increasing, which crrespnds t the mntnicity assumptin made in Giannakpuls and Palpanas (2013). Fr the secnd rder derivative, f (Z), it hlds that f (Z) = 0ifZ = d + 1 c lg b, and the pint Z inf = d + 1 c lg b is the pint f inflectin, which shws when the curvature f the functin changes its sign. In the case f the sigmid, the pint f inflectin, Z inf (the middle pint in Fig. 2), is the pint f symmetry, crrespnding t the average perfrmance f the algrithm. Furthermre, Z inf indicates the shift f the sigmid with respect t the rigin, which

13 On classifier behavir in the presence f mislabeling nise (a) (b) Fig. 3 Varying c parameter. c = 1(left), c = 3(right). Other parameters are the same: m = 0, M = 1, b = 2.5, d = 3. is equal t d + 1 c lg b. S, the shift f the sigmid and the average perfrmance fr the signal-t-nise range depends n three parameters: b, c, and d. Learning algrithms can be cmpared based n the slpe f their sigmids. The slpe f the sigmid reflects the imprvement f an expected perfrmance f an algrithm per change in the signal-t-nise rati. T estimate the slpe, we use the distance d s between the pint f inflectin Z inf (zer f the secnd derivative) and the zers f the third derivative (bth zers f the third derivative are at the same distance frm the pint f inflectin). Figure 2 shws the sigmid curve alng with its pints f interest. The zers f the third rder derivative are Z (3) 1,2 = d 1 c lg 2± 3 b. Thus, we have fr d s : d s = Z (3) 1 Z inf = Z inf Z (3) 2 = 1 c lg 1 2. The larger this distance 3 1 is, the mre expanded the sigmid curve is. Since lg , d 3 s is in inverse prprtin t parameter c: d s = a c, where a We find that the parameter c directly influences the slpe f the sigmid curve. We term c as the slpe indicatr f the CTF. Figure 3 depicts sigmids with different c values, illustrating gradual and sharp slpes f the sigmid functin. In the fllwing sectin, we frmulate and discuss dimensins that characterize the behavir f algrithms, based n the axes f cmparisn discussed abve. 4.1 SRF dimensins In this sectin we determine several SRF dimensins based n the sigmid prperties, in additin t m, M, r alg, and c, discussed in Sect. 4. We define active nise range as the range [Z, Z ] in which the change in nise induces a measurable change in the perfrmance. In rder t calculate [Z, Z ] we assume that there is a gd-enugh perfrmance n a given task, appraching M fr a given algrithm. We knw that f (Z) (m, M), and we say that the perfrmance is gd enugh if f (Z) = M (M m) p, p = We define the learning imprvement f the algrithm as the 4 Instead f 0.05, ne can use any value clse t 0, describing a nrmalized measure f distance frm the ptimal perfrmance. In the case f p = 0.05, the distance frm the ptimal perfrmance is 5%.

14 K. Mirylenka et al. size f the signal-t-nise interval in which f (Z) [m+(m m) p, M (M m) p]. Then, using the inverse f 1 (y) = d 1 c lg ( 1 b ( )) M m y m 1, where y = f (Z), we calculate the pints Z and Z, which respectively are the bttm and the tp pints in Fig. 2 fr a given p. We term the distance d alg = Z Z as the width f the active area f the machine learning classifier (see Fig. 2). Then we define a rati between the width f the active area and the span f the perfrmance range r alg d alg, which describes (and is termed as) the learning perfrmance imprvement ver signal-t-nise rati change. In the fllwing paragraphs we describe hw this analysis allws us t cmpare the perfrmance f learning algrithms in the presence f nise. 5 Using SRF t cmpare algrithms Given the perfrmance dimensins described abve, we can cmpare algrithms as fllws. Fr perfrmance-related cmparisn we can use minimal perfrmance m, maximal perfrmance M, and the span f perfrmance range r alg. Algrithms nt affected by the presence r absence f nise will have a minimal r alg value. In a setting with a randmly changing level f nise, this parameter is related t the pssible variance in perfrmance. Related t the sensitivity f perfrmance t the change f the signal-t-nise rati, we can use the fllwing dimensins. 1. The active nise range [Z, Z ], which shws hw early the algrithm starts perating (measured by Z ) and hw early it reaches gd-enugh perfrmance (measured by Z ). We say that the algrithm perates when the level f nise in the data is within the active nise range f the algrithm. 2. The width f the active area d alg = Z Z f the algrithm, which is related t the speed f changing perfrmance fr a given r alg in the dmain f nise. A high d alg value indicates that the algrithm varies its perfrmance in a brad range f signal-t-nise ratis, implying less perfrmance stability in an envirnment with heavily varying degrees f nise. 3. The parameter c f the sigmid functin (slpe indicatr), which is related t the distance d s = 1.32/c. The distance has similar prperties as the inversin f r alg /d alg, but it is independent f the predefined parameter p, which shws the percentage f perfrmance cnsidered as gd-enugh fr a given task and is nt directly cnnected t the active nise range. d s reflects the change in the slpe f the functin. If c is large, then d s is small and indicates higher stability f the algrithm in the presence f nise, and vice versa. 4. The pint f inflectin Z inf, which shws the signal-t-nise rati fr which an algrithm gives the average perfrmance. The parameter can be used t chse the algrithm that reaches its average perfrmance earlier in a nisy envirnment, r the algrithm whse speed f imprvement changes earlier.

15 On classifier behavir in the presence f mislabeling nise A parameter related t bth perfrmance and sensibility is the learning perfrmance imprvement ver signal-t-nise rati change, r alg /d alg. It can be used t determine whether reducing the nise in a dataset is expected t have a significant impact n the perfrmance. An algrithm with a high value f r alg /d alg wuld imply that it makes sense t put mre effrt int reducing nise. Furthermre, using this dimensin a decisin maker can chse mre stable algrithm, when the variance f nise is knwn. In this case, the algrithm with the lwest value f r alg /d alg shuld be chsen in rder t limit the crrespnding variance in perfrmance. Based n the abve discussin, we favr the classifiers with the fllwing prperties: Higher maximal perfrmance M, Larger span f perfrmance range r alg, Higher learning perfrmance imprvement ver signal-t-nise rati change r alg /d alg, Shrter width f the active area f the algrithm d alg, and Larger slpe indicatr c. We expect t get high perfrmance frm an algrithm if the level f nise in the dataset is very lw, and lw perfrmance if the level f nise in the dataset is very high. Decisin makers can easily frmulate different criteria, based n the prpsed dimensins (summarized in Table 2). Based n the abve, the wrkflw t select an algrithm given sme user preference is as fllws: we select an algrithm; we sample the input data fr training; we estimate the SRF parameters; we repeat the prcess fr different algrithms; finally, we analyze the SRF parameters in rder t select the algrithm with the mst apprpriate behavir, i.e., the ne that best matches the user preferences. 6 Experimental evaluatin In the fllwing paragraphs we describe the experimental setup, the datasets used, and the results f ur experiments. We first cnsider nn-sequential and then sequential classifiers. 6.1 Experimental setup fr nn-sequential classifiers In ur study we applied the fllwing machine learning algrithms, implemented in Weka (Hall et al. 2009): (a) IBk k-nearest neighbr classifier; (b) Naïve Bayes classifier; (c) SMO supprt vectr classifier (cf. Keerthi et al. 2001); (d) NbTree a decisin tree with Naïve Bayes classifiers at the leaves; (e) JRip a RIPPER (Chen 1995) rule learner implementatin. We have chsen representative algrithms frm different families f classificatin appraches, cvering ppular classificatin schemes. The experiments were perfrmed n a 2.4 GHz machine with 4GB RAM.

16 K. Mirylenka et al. Classes <7 7 lw high Fig. 4 Dataset gruping labels Attributes <10 10 lw high lw Instances < x<5000 medium 5000 high Number f features Number f classes Fractal Dimensin Lg f instance cunt Fig. 5 Distributin f the dataset characteristics. Real data: triangles, artificial: circles We used a ttal f 24 datasets fr ur experiments. Mst f the real datasets cme frm the UCI machine learning repsitry (Lichman 2013), and ne frm the study (Giannakpuls and Palpanas 2010). Furteen f the dataset are real, while ten are synthetic. All the datasets are divided int grups accrding t the number f classes, attributes (features) and instances in the dataset, as shwn in Fig. 4. There are 12 pssible grups that include all cmbinatins f the parameters. Tw datasets frm each grup were emplyed fr the experiments. We created artificial datasets in the cases were real datasets with a certain set f characteristics were nt available. The distributin f the dataset characteristics is illustrated in Fig. 5. The traits f the datasets illustrated are the number f classes, the number f attributes (features), the number f instances, and the estimated intrinsic (fractal) dimensin. Table 3 cntains the descriptin f the datasets and their surces. We build ten artificial datasets using the fllwing prcedure. Having randmly sampled the number f classes, features and instances, we sample the parameters f each feature distributin. We assume that the features fllw the Gaussian distributin with mean value (μ) in the interval [ 100, 100] and standard deviatin(σ ) in the interval [0.1, 30]. The μ and σ intervals allw verlapping features acrss classes. The descriptin f the parameters f the generated datasets is shwn in Table 3. The nise was induced as fllws. We created the stratified training sets, equal in size t the stratified test sets. T induce nise, we created nisy versins f the training sets

17 On classifier behavir in the presence f mislabeling nise Table 3 The descriptin f the datasets used Dataset Parameters # f classes # f attr # f inst Fract dim x 1 x 2 x 3 x 4 1. Wine a Australian credit apprval b Real frm Giannakpuls and Palpanas (2010) 4. Yeast c Artificial Artificial Artificial Artificial Artificial Breast cancer Wiscnsin d Breast tissue e Letter recgnitin f , Artificial Artificial , Artificial Vehicle Silhuettes g MAGIC gamma , telescpe h 18. Wavefrm i Shuttle j , Artificial , Libras mvement k Image l Wine quality m Artificial x 1 is a number f classes, x 2 is the number f attributes, x 3 is the number f instances, and x 4 is the fractal dimensinality f a dataset a See b See c See d See e See f See g See h See i See j See k See l See m See

18 K. Mirylenka et al. by mislabeling the instances. Using different levels l n f nise, l n = 0, 0.05,...,0.95 5, a training set with l n nise is a set with fractin l n f mislabeled instances. Hence we btained 20 versins f the datasets with varying nise levels. The equal percentages f instances frm different classes (fr stratificatin) were put int test and training sets (as in the riginal dataset with 0% f nise), using the fllwing prcedure: 1. Calculate the number N f bservatins in the riginal dataset; the number f bservatins in the test r training set shuld be the integer part f N/2; 2. Calculate the percentages f each class in the riginal dataset; 3. Calculate the number f bservatins frm each class that shuld be in the training and test sets based n the percentages frm the previus step (runding the numbers if necessary); 4. Divide the bservatins accrding t the classes t btain class datasets, i.e., sets f instances with nly ne class per set; 5. Frm each class dataset, sample bservatins (based n the results f step 3) fr the training set, and place the remaining bservatins f each class dataset int the test set; 6. Make nisy training datasets frm the riginal training dataset with different levels l n f nise. 7. Get the perfrmance f the mdel trained n each nisy dataset l n = 0, 0.05,...,0.95 and tested n a nise-free stratified test set. Given the perfrmance f a classifier fr a particular dataset, we estimate the parameters f the sigmid functin fr each algrithm-dataset pair using a genetic algrithm (described in detail in Appendix 1). The idea is that the algrithm tries t find parameters that minimize the distance between the distributin f values f the real data t that f the estimated sigmid. T this end, we use as a fitness measure a functin f the D statistic f the Klmgrv Smirnv test (Massey 1951), which measures the maximum distance between the Cumulative distributin functins f the tw sample sets. Having the sets f estimated sigmids describing the perfrmance fr particular algrithms and datasets with certain characteristics, we can reasn abut the prperties f the algrithms alng the SRF dimensins. 6.2 Experimental setup fr sequential classifiers We nw cnsider sequential datasets, where the rder f instances is imprtant. Using these datasets, we study whether sequential classifiers adhere t the sigmid rule framewrk. The main difference t the nn-sequential case is that fr the classificatin task we als use the infrmatin abut the previus instances in additin t the features f the current instance. Fr sequential data, we apply three classificatin algrithms based n hidden Markv mdels (HMM) (Rabiner and Juang 1986), cnditinal ran- 5 We nte that high levels f nise such as 95% are ften bserved in the presence f cncept drift, e.g., when learning cmputer-user brwsing habits in a netwrk envirnment with a single IP, and several different users sharing it.

19 On classifier behavir in the presence f mislabeling nise True label Nisy label Features Factrs Fig. 6 Linear chain CRF architecture dm fields (CRF) (Lafferty et al. 2001), and recurrent neural netwrks (RNN) (Elman 1990). Fr the experiments, we cnsider the whle sequence f bservatins as an instance. Fr each algrithm, we define the rder f dependency, and cnsider sequences f the length crrespnding t that rder. Fr the implementatin f these sequential classifiers, we use the JAHMM (Françis 2013), Mallet (McCallum 2002), and Weka (Hall et al. 2009) libraries fr HMM, CRF, and RNN, respectively. The experiments and the libraries are implemented in Java, and the classificatin algrithms are called as library functins. The experiments were perfrmed n a 2.67 GHz machine with 4GB RAM. Hidden Markv mdels Fr HMM-based classificatin, we perfrm the experiments in tw settings by using HMMs f the 3rd and the furth rder. In Fig. 8a and b, we depict the sigmids f the HMM algrithm fr the Rbt walk 4 dataset. The green slid line crrespnds t the true measurements f the perfrmance f the HMM-based classifier fr different values f signal-t-nise rati Z, while the dashed red line crrespnds t the estimated sigmid, the parameters being assessed with a generic algrithm (see Appendix 1). The Pearsn s crrelatin test between the btained perfrmance and the estimated perfrmance based n the sigmid functin shws statistically significant crrelatin values, with cef > 0.99 fr significance level α = This hlds fr bth the first and the secnd setting (the third and the furth rder mdels, crrespndingly). Similar results were btained fr the rest f the datasets and sequential learners. These results cnfirm that the sequential learners adhere t the sigmid rule framewrk. Cnditinal randm fields T better cver sequential classifiers, we emply and study linear and quadratic CRFs. These g beynd class labels, which HMMs use, and als take int accunt data attributes t determine a class. We experiment with CRF in tw different settings, accrding t the rder f the cnnectin between the instances. T label an unseen instance, the mst likely Viterbi labeling y = arg max y p(y (i) x (i) ) is cmputed. In the first setting we cnsider a linear chain CRF, in which an instance depends nly n the previus ne, as depicted in Fig. 6. Since we cnsider nly label nise, we induce nise nly in the labels f the instances. In the secnd setting, we cnsider the mdel f the secnd rder, where an instance depends n the tw previus instances. As an example, the sigmids f the CRF algrithm fr the Rbt walk 4 dataset are shwn in Fig. 8c and d. Like in the case with HMMs, the sigmids fit the perfrmance curves f the CRF classifier well fr all the datasets. Recurrent neural netwrks Whereas HMM takes int accunt nly the data labels f the sequence, and CRF uses bth class labels and the attributes f the previus instances in the sequence in rder t

K. Mirylenka et al. Fig. 7 The architecture f RNN classify the current instance, RNN uses the attributes f the current instance, and nly the label f the previus instances.

20 K. Mirylenka et al. Fig. 7 The architecture f RNN classify the current instance, RNN uses the attributes f the current instance, and nly the label f the previus instances. Fr RNN-based classificatin, we prvide the previus label as input t calculate the utput f the current instances (see Fig. 7), as in the wrk (Bdén 2002), and we use tw settings with secnd and third rder dependencies: in the first setting, the tw previus labels are used, while in the secnd setting the three previus labels are used. The sigmids f the RNN algrithms fr the Rbt walk 4 dataset are shwn in Fig. 8e and f. Accrding t Pearsn s crrelatin test fr setting 1 (cef = , p-value = ) and setting 2 (cef = , p-value = ) between the btained perfrmance and estimated sigmid functin, we can say that the classifier based n RNN als fllws the sigmid rule. As in the previus cases f HMMs and CRFs, the sigmids fit well the perfrmance curves fr all the datasets fr the RNN classifier. Thus, the sigmid rule fits RNN-based classifiers as well Datasets We used 18 real datasets fr ur sequential experiments. All the datasets cntain time series where the instances appear in the chrnlgical rder. The label f the last bservatin is assigned t the sequence t imply causality. The distributin f the characteristics f these datasets is illustrated in Fig. 9. These characteristics are: the number f classes, the number f attributes, the number f instances, and the largest lag value that crrespnds t significant autcrrelatin f the labels. Autcrrelatin is a linear dependence f a variable with itself at tw pints in time (Bx et al. 2015). We use the autcrrelatin functin (ACF) implemented in R 6 t calculate autcrrelatin with maximum lag equal t 50 fr all datasets. We chse the first lag that gives the smallest autcrrelatin falling ut f the significance band, as shwn, fr example, in Fig. 10 fr the Pineer gripper dataset, where we chse lag equal t 37. Descriptin f the datasets are shwn in Table 4. We induce nise int sequential data in a way similar t the independent data. We randmly split all data int equally sized training and test parts. Mislabeled instances are in the training data nly. We induce varying levels f nise frm 0 t 100% 6

21 On classifier behavir in the presence f mislabeling nise (a) 1 (b) (c) 1 (d) (e) 1 (f) Fig. 8 Sigmid CTF f the tw settings f the HMM, CRF and RNN-based classificatin algrithm fr the Rbt walk 4 dataset, indicating different algrithm behavirs. Green slid line true measurements, dashed red line estimated sigmid. a Sigmid CTF f HMM 1st setting, b Sigmid CTF f HMM 2nd setting, c Sigmid CTF f CRF 1st setting, d Sigmid CTF f CRF 2nd setting, e Sigmid CTF f RNN, 1st setting, f Sigmid CTF f RNN, 2nd setting (Clr figure nline) as fllws. The nisy versin f a class C i f a training set is created by selecting the level f nise l n, and replacing sme instance with an instance f anther class with prbability l n. Then sequential classifiers are trained using the training data with varius levels f nise, and are used t predict the test data. Given the classificatin f all the test instances, we cmpute the verall perfrmance f the algrithm using classificatin accuracy: P = #CrrectClassificatins #TtalClassificatins

22 K. Mirylenka et al. Fig. 9 Distributins f the characteristics f sequential datasets Fig. 10 Autcrrelatin plt f the Pineer gripper dataset Autcrrelatin value Lag We perfrm repeated randm sub-sampling validatin with 20 splits by randmly chsing half f the instances as training data and the remaining instances as test data, and calculate the average perfrmance f the classifier. Fr the sequential datasets the crrect rder f the instances is kept fr bth training and test phases. The parameters f the crrespnding sigmid are assessed using the values f signal-t-nise rati and crrespnding average perfrmance levels. In the fllwing sectins we cnsider sequential classifiers in mre detail. 6.3 Experiments n classifier behavir cmparisn We perfrm experiments f nisy classificatin using repeated randm sub-sampling validatin 7 with 10 splits per algrithm per dataset per nise level, and calculate the average perfrmance fr each setting. Repeated randm sub-sampling is a cmmn technique f crss-validatin, in which a dataset is randmly split int the training and testing parts and the accuracy results are averaged ver the splits. In the case f sequential learners, we use 20 splits. Given the 20 levels f the signal-t-nise rati and the crrespnding algrithm perfrmance, (i.e., classificatin accuracy) we estimated the parameters f the sigmid. The search in the space f sigmid parameters is 7 Repeated randm sub-sampling validatin is als knwn as Mnte Carl crss-validatin (Kuhn and Jhnsn 2013).

23 On classifier behavir in the presence f mislabeling nise Table 4 Dataset descriptins Dataset Parameters # f classes # f inst Autcrr lag # f attr x 1 x 2 x 3 x 4 1. Climate a Human activity b Pineer gripper c Rbt walk 4 d Diabetes Ozne level 1 h e Pineer turn c Ozne level 8 h f Opprtunity g Rbt walk 2 d Rbt walk 24 d Pineer mve c Callt2 h 2 10, Electric 2 a 2 10, PAMAP i 3 16, Electric 3 a 2 27, Daphnet freezing j 3 38, Ddger k 2 50, x 1 is the number f classes, x 2 is the number f instances, x 3 is the autcrrelatin lag, and x 4 is the number f attributes (features) a Dataset frm Giannakpuls and Palpanas (2010) b See c See d See e See f See g See h See i See j See k See perfrmed by a genetic algrithm, 8 as was prpsed in Giannakpuls and Palpanas (2010). A sample f the true and sigmid-estimated perfrmance graphs fr varying levels f nise fr a nn-sequential classifier can be seen in Fig. 11. The crrespnding graphs fr sequential classifiers are shwn in Fig. 8. Althugh in ur experiments the parameters f the sigmid were estimated ffline, the sigmid rule framewrk can be applied in an nline scenari, fllwing a training perid. 8 The settings f the genetic algrithm can be fund in the appendix.

24 K. Mirylenka et al. (a) (b) Fig. 11 Sigmid CTF f SMO and IBk algrrithms fr Wine dataset. Green slid line true measurements, dashed red line estimated sigmid (Clr figure nline) (a) (b) (c) m M Ibk JRip NB NBTree SMO Ibk JRip NB NBTree SMO c Ibk JRip NB NBTree SMO (d) (e) (f) ralg dalg ralg/dalg Ibk JRip NB NBTree SMO Ibk JRip NB NBTree SMO Ibk JRip NB NBTree SMO Fig. 12 Estimated SRF parameters per algrithm. X axis labels (left-t-right): IBk, JRip, NB, NBTree, SMO. a Min perfrmance m, b Max perfrmance M, c Sigmid slpe c, d Perfrmance span r alg, e Active r area width d alg, f Perf. imprvement alg d alg Figure 12 illustrates the values f the mean and the standard deviatin f the SRF parameters per algrithm, ver all 24 datasets fr the nn-sequential classifiers. As an example f an interpretatin f the figure using SRF, the plts indicate that (fr the range f datasets studied) SMO is expected t imprve its perfrmance faster n average than all ther algrithms, when the signal-t-nise rati increases. This cnclusin is based n the r alg d alg dimensin and the values f the slpe indicatr c (refer t Fig. 12c, f). Figure 11 ffers a visual cmparisn f the perfrmances f the SMO and IBk algrithms fr the Wine dataset, where we can see that SMO imprves its perfrmance twice as fast as IBk. At the same time, the cnfidence interval fr r alg d alg and c f the Naïve Bayes algrithm verlaps with that f the SMO, s these algrithms can als be recmmended if the

25 On classifier behavir in the presence f mislabeling nise m (a) (b) (c) M CRF1 CRF2 HMM1 HMM2 RNN1 RNN2 CRF1 CRF2 HMM1 HMM2 RNN1 RNN2 CRF1 CRF2 HMM1 HMM2 RNN1 RNN2 (d) (e) (f) c r d r/d CRF1 CRF2 HMM1 HMM2 RNN1 RNN2 CRF1 CRF2 HMM1 HMM2 RNN1 RNN2 CRF1 CRF2 HMM1 HMM2 RNN1 RNN2 Fig. 13 SRF parameters per algrithm. X axis labels (left-t-right): CRF 1st setting and 2nd setting, HMM 1st setting and 2nd setting, RNN 1st setting and 2nd setting. a Min perfrmance m, b Max perfrmance M, c Sigmid slpe c, d Perfrmance span r alg, e Active area width d alg, f Perf. imprvement r alg d alg same criteria are ptimized. Nevertheless, SMO and Naïve Bayes algrithms can be well differentiated by the d alg criterin (see Fig. 12e). IBk has a smaller ptential fr imprvement f perfrmance (but als smaller ptential fr lss) than SMO, when nise levels change, given that the span f the perfrmance range r alg is higher fr SMO (Fig. 12d). This difference can als be seen in Fig. 11, where the distance between the minimum and the maximum perfrmance values is larger fr SMO. Figure 13 illustrates the values f the mean and the standard deviatin fr the SRF parameters per algrithm, ver all 18 datasets fr the sequential case. The plts indicate that HMM is expected t imprve its perfrmance faster than the ther algrithms, when the signal-t-nise rati increases (Fig. 13c, f). Thugh the tw settings f RNN perfrm similarly alng these parameters, they are clearly distinct frm the rest f the algrithms. On the ther hand, CRF has smaller ptential fr imprvement f its perfrmance, but als smaller risk fr lw perfrmance than HMM, when nise levels change (Fig. 13d, f). An example f the fast imprvement f HMM when cmpared t ther families f algrithms can be seen n Fig. 8. As can be seen frm Fig. 13, all the families f the sequential classifiers have their specific behavir alng the devised SRF dimensins, with certain variatins depending n the rder f the mdel. We stress that the parameter estimatin des nt require previus knwledge f the nise levels, but is dataset dependent. In the special case f a classifier selectin prcess, having an estimate f the nise level in the dataset helps t reach a decisin thrugh the use f SRF. In the fllwing, we demnstrate the existence f this cnnectin between the parameters f the sigmid curve and the characteristics f the

26 K. Mirylenka et al. datasets. Therefre, SRF can help in chsing the better learner, prvided that we knw the characteristic f the dataset. 7 Mdeling expected perfrmance in new datasets We nw study the cnnectin between the dataset characteristics and the sigmid parameters, irrespective f the chice f the algrithm. In essence, this study shws if and hw the framewrk can be used t estimate perfrmance f algrithms in a nisy setting, in the presence f a new dataset. We cnsider the results btained frm all the algrithms as different samples f the SRF parameters fr a particular dataset. We use regressin analysis t bserve the cumulative effect f the dataset characteristics n a single parameter, and we use crrelatin analysis t detect any cnnectin between each pair f the dataset characteristic and the sigmid parameter. We examine the cnnectins between the dataset characteristics and the sigmid parameters bth individually, and all tgether, in rder t draw the cmplete picture. 7.1 Linear regressin We use the lm methd (fitting linear mdels) implemented in the package stats in R t estimate the values f the parameters f the linear regressin mdel fr each chsen SRF parameter y: y = α 0 + α 1 x 1 + α 2 x 2 + α 3 x 3 + α 4 x 4 + ε, where x i, i = 1, 2,... crrespnd t the features f a dataset, such as the number f classes, the number f instances, and ther dataset characteristics specific t nnsequential and sequential cases, which are ging t be discussed belw; ε is the errr term, E(ε) = 0. Fr chsing the best mdel, we use a mdificatin f the bestlm functin frm nsrfa package in R. This functin als explits the functin leaps f the R package leaps. The criterin fr the best mdel is the adjusted R 2 statistic (the clser the adjusted R 2 is t ne, the better the estimatin is). R 2 defines the prprtin f variatin in the dependent variable (y) that can be explained by the predictrs (X variables) in the regressin mdel. As sme variatin f y can be predicted by chance, the adjusted value f R 2 a is used. Adjusted R 2 takes int accunt the number f bservatins (N) and the number f predictrs (k), and is cmputed using the frmula: Ra 2 = 1 (1 R2 )(N 1). N k 1 After applying the linear mdel, we exclude the nn-significant parameters, and search fr the mdel that maximizes the adjusted R 2 statistic (i.e., the best linear mdel). In additin t Ra 2, we applied a statistical test with significance level α = 0.05 fr each chsen regressin mdel in rder t check if the mdel fits the data well. If the

27 On classifier behavir in the presence f mislabeling nise Table 5 Predictin errr f linear regressin mdels fr nn-sequential classifiers Parameters Errr measures Average (R 2 a ) MSE MAE RMSE RMAE NRMSE Min perfrmance m Max perfrmance M Perfrmance span r alg Active area width d alg Perf. imprvement r alg d alg Sigmid slpe c mdel describes the distributin f the values statistically significantly, then, after the hypthesis testing, the p-value shuld be within the significance level. We applied a leave-ne-ut validatin prcess, where ne dataset is left ut frm training and used fr testing n every run. We perfrmed this analysis separately with each f the variables m, M, r alg, d alg, r alg d alg, and c as a dependent variable y. Nn-sequential case Fr the nn-sequential classifiers, the fllwing dataset parameters were chsen: a number f classes (x 1 ), a number f features (x 2 ), a number f instances (x 3 ), and the intrinsic dimensinality (x 4 ) f a dataset 9, i.e., the fractal crrelatin dimensin (Camastra and Vinciarelli 2002; de Susa et al. 2006), calculated using a bx cunting prcess. The results f mdel fitting and predictin f the SRF dimensins are reprted in Table 5, where average errrs between the bserved and predicted SRF dimensins are shwn. Fr each SRF dimensin chsen, we have bserved 5 values (since 5 machine learning algrithms were used), and having estimated them fr 24 datasets, we ended up with 120 predictins fr a single SRF dimensin. We calculated fur 1 types f errrs: (1) MSE rt mean square errr, MSE = ni=1 n (ŷ i y i ) 2 ; (2) MAE mean abslute errr, MAE = n 1 ni=1 ŷ i y i ; (3) RMSE relative 1 rt mean square errr, RMSE = ni=1 n ( ŷi y i y i ) 2 ;(4)RMAE 10 relative mean abslute errr, RMAE = n 1 ni=1 ŷi y i y i ; and (5) NRMSE nrmalized rt mean square errr, NRMSE = y MSE max y min. In the abve frmulas, ŷ i stands fr the predicted value f the parameter, while y i is the actual value, and n is the number f predictins. The last clumn f Table 5 shws the average f the adjusted R 2 statistic fr mdels that where estimated fr all the SRF dimensins (averaged n the 24 datasets). Figure 14 illustrates hw ur mdels fit the test data, shwing that in mst cases the true values f the sigmid parameters fr each dataset (illustrated by circles that 9 We wuld like t thank Christs Falutss fr kindly prviding the cde fr the fractal dimensinality estimatin. 10 RMAE is als knwn as mean abslute percentage errr.

28 K. Mirylenka et al. m M Value Value Dataset Dataset r alg d alg Value Value Dataset r alg/d alg Dataset c Value Value Dataset Dataset Fig. 14 Real and estimated values f the sigmid parameters fr nn-sequential classifiers. Real values: black rectangles, estimated values: circles, gray zne: 95% predictin cnfidence interval crrespnd t 5 algrithms fr each test dataset i, i = 1, 2,...,24) are within the 95% cnfidence level zne arund the estimated values. This finding further supprts the cnnectin between the dataset parameters and the SRF dimensins. Accrding t the results, the chsen parameters f the datasets can be used t reasn abut the parameters f the sigmid. This allws us t recmmend algrithms fr a new dataset with knwn characteristics, if we have estimated the SRF dimensins fr a set f algrithms n the datasets with similar characteristics. The relatin between nise level and average classificatin perfrmance ver the Climate dataset is illustrated in Fig. 15. There, we depict the curve f the estimated perfrmance, based n the regressin mdels, as well as the curves f the actual perfrmances f several algrithms. The plt shws that the estimatin is quite gd fr the IBk, NBTree and JRip algrithms. The estimatin fr the ther algrithms is nt as gd, which may be attributed t the fact that in ur experiments, we trained the regressin mdels using all available datasets that have very different characteristics. We expect that training mdels with a large number f datasets with similar characteristics can further imprve the accuracy. Sequential case In this set f experiments, nt all the sequential classificatin algrithms use the features f the data instances. Therefre, we examine the influence f the dataset n the CTF parameters by studying the number f classes (x 1 ), the number f instances (x 2 ), and the maximal autcrrelatin lag f a dataset (x 3 ), which is described in mre detail in Sect. 6.2.

29 On classifier behavir in the presence f mislabeling nise Fig. 15 Sigmid with regressin-estimated parameters with actual sigmids f the crrespnding classificatin algrithms fr Climate dataset Table 6 Predictin errr f linear regressin mdels fr sequential data Parameters Errr measures Average (R 2 a ) MSE MAE RMSE RMAE NRMSE Min perfrmance m Max perfrmance M Perfrmance span r alg Active area width d alg Perf. imprvement r alg d alg Sigmid slpe c We als applied a leave-ne-ut methd fr six dependent variables, as in the nnsequential dataset case. The results f mdel fitting and predictins f SRF dimensins are reprted in Table 6. Fr each SRF dimensin chsen, we have bserved 6 values, since 3 sequential learning algrithms were used, and each algrithm is applied in tw different settings f dependency rder. These 6 SRF dimensins are estimated fr 18 datasets, thus 108 predictins were dne fr each SRF dimensin. We als calculated fur types f errrs, as with the nn-sequential datasets (Table 6). Just like the nn-sequential data, fr sequential data classificatin, we can see the relatinship between the dataset parameters and the SRF dimensins. Figure 16 illustrates hw ur mdels fit the test data. The results shw that mst f the true parameter values are within the 95% predictin cnfidence levels. Thus, we can use the dataset parameters in rder t reasn abut the parameters f the sigmid f the algrithms. This enables us t recmmend the algrithm f chice in advance fr a new dataset with knwn characteristics. We can additinally recmmend the rder f the sequential classifier, if we have cmputed the SRF dimensins fr a set f algrithms fr datasets with similar characteristics.

30 K. Mirylenka et al. Fig. 16 Real and estimated values f the sigmid parameters fr sequential classifiers. Real values: black rectangles, estimated values: circles, gray zne: 95% predictin cnfidence interval 7.2 Lgistic regressin We als emplyed lgistic regressin, by applying the same leave-ne-ut prcess, fr the purpse f predicting the SRF dimensins. While applying lgistics regressin, we use the Akaike infrmatin criterin (AIC) instead f the adjusted Radj 2 in rder t chse the best mdel. Being a measure f the relative quality f a statistical mdel fr given data, AIC prvides the means fr mdel selectin. In the general case, the AIC is calculated as fllws AIC = 2k 2ln(L), where k is a number f parameters in the statistical mdel, and L is the maximized value f the likelihd functin fr the estimated mdel (Akaike 1974). We chse the best predictin mdel by minimizing the average AIC. We als calculate fur types f errrs, as in linear regressin. The results f mdel fitting and predictin f SRF dimensin fr nn-sequential data sets are shwn in Table 7, while the results fr sequential data sets are shwn in Table 8. Accrding t the experimental evaluatin, the lgistic regressin mdel gives better predictin fr sme SRF dimensins, prducing relatively smaller errrs cmpared t the linear regressin results. Fr bth the nn-sequential (Tables 5 and 7) and sequential (Tables 6 and 8) cases, the parameters M and r alg are better predicted with the lgistic regressin mdel. Additinally, fr the nn-sequential case, c is better predicted using the lgistic regressin mdel. The accurate predictin f the SRF dimensins supprts

The blessing of dimensionality for kernel methods

The blessing of dimensionality for kernel methods fr kernel methds Building classifiers in high dimensinal space Pierre Dupnt Pierre.Dupnt@ucluvain.be Classifiers define decisin surfaces in sme feature space where the data is either initially represented