Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

MACHINE LEARNING Vasant Honavar Bonformatcs and Computatonal Bology Program Center for Computatonal Intellgence, Learnng, & Dscovery Iowa State Unversty honavar@cs.astate.edu www.cs.astate.edu/~honavar/ www.cld.astate.edu/ 1

A Recpe for Learnng P ( h D) P( D h) P( h) P( D) P (h) pror probablty of hypothess h P (D) pror probablty of tranng data D P (h D) probablty of h gven D P (D h) probablty of D gven h

3 Iowa State Unversty A Recpe for learnng ) ( arg max ), ( ) (, If Maxmum lkelhood hypothess ) ( ) ( arg max ) ( ) ( ) ( arg max ) ( arg max hypothess posteror Maxmum a h D P h h P h P H h h h P h D P D P h P h D P D h P h H h ML H h H h H h MAP

Brute Force MAP Hypothess Learner For each hypothess h n H, calculate the posteror probablty P( D h) P( h) P ( h D) P( D) Output the hypothess wth the hghest posteror probablty h arg max P( D h ) P( h ) h MAP ML h H arg max P( D h h H ) 4

h MAP log - log arg mn ( h ) P( D h Brute Force MAP Hypothess Learner arg max P( D h arg max P h H h H h H ) P( h ( log P( D h ) + log P( h )) ( log P( D h ) log P( h )) ) descrpton length of under optmal Bayesan learnng mples Occam's razor! descrpton length of h of hypotheses ) under optmal encodng data D gven h encodng based on P( D h ) 5

MAP Learnng of -class classfers Consder a -class learnng problem defned over an Instance space X, hypothess space H, tranng examples D {e (X, d ) X X; d c(x )} Consder a learnng algorthm whch outputs a hypothess that s consstent wth the examples n D Let V H,D be the subset of hypotheses n H that are consstent wth D (the verson space) What would Bayes rule produce as the MAP hypothess? 6

MAP Learnng of Bnary Concepts (-class classfers) Assumptons 1. The hypothess space s fnte..the tranng examples are..d. 3.The tranng data s nose free. d h H, c P ( X ) 4.The target concept c H 5. d, ( h) All hypotheses are a pror equally lkely : 1 H 7

8 Iowa State Unversty MAP Learnng of -class classfers ( ) ( ) ( ) ( ) ( ) ( ) ( ) 0 otherwse ), ( ; ) ( f 1 ), ( ), (, are ndependent of Assumng,,. samples mply.. X h d P d X h X h d P X P X h d P h D P h X h X P X h d P h d X P h e P h D P d

MAP Learnng of -class classfers If {1... m}, d If h s consstent wth D, that s, P {1... m}such that h ( X ) ( D h) P( X ) 1 h s consstent wth D, that s, P( D h) d 0 m then h ( X ) then 9

10 Iowa State Unversty MAP Learnng of -class classfers ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 0 ) ( we have, and 1 ) ( we have, 0 1 ) ( ) (,,,,,, + D h P V h V X P V H H X P D P h P D h P D h P V h X P H V X P H h P h D P D P D H D H D H D H D H H h V h H h D H Every hypothess n H that s consstent wth D s a MAP hypothess

Bayesan Recpe for Learnng If the tranng examples are ndependently dentcally dstrbuted nose-free And f each canddate hypothess n H s equally a pror equally lkely Then every hypothess that s consstent wth the tranng data (that s, correctly classfes each tranng example) maxmzes P(h D) In ths settng, all that the learner has to do s to produce a hypothess from H that s consstent wth the tranng data 11

Effect of Data (Evdence) on the Verson Space 1

Bayesan Recpe for Learnng h MAP ( ) ( ) arg mn log 1 4443 P D h log P h h H 4 14 43 The number of bts needed to encode the Data D gven the hypothess h (under optmal encodng) n other words, error of h or exceptons to h The number of bts needed to encode the hypothess h (under optmal encodng) 13

Bayesan Recpe for Learnng Suppose the tranng data are d and nose-free but each hypothess n H s not equally a pror equally lkely h MAP arg max P( D h ) P( h ) h H arg max h H h H arg mn ( log P( D h ) + log P( h )) ( log P( D h ) log P( h )) 14

Bayesan Recpe for Learnng If the tranng data are d and nose-free but each hypothess n H s not equally a pror equally lkely then the learner has to produce a hypothess from H that trades off the error on the tranng data aganst complexty of the hypothess Decson tree classfers, whch we wll look at next, provde an example of ths approach 15

Learnng Decson Tree Classfers On average, the nformaton needed to convey the class H P r membershp of a random nstance drawn from nature s ( ) Nature Instance Tranng Data S S 1 S S m Classfer Class label ) H 1 ) where P s an estmate of P m ( P) ( pˆ ) log ( pˆ ) Hˆ ( X ) and X s a random varable wth dstrbuton P S s the mult-set of examples belongng to class C 16

Learnng Decson Tree Classfers The task of the learner then s to extract the needed nformaton from the tranng set and store t n the form of a decson tree for classfcaton Informaton gan based decson tree learner Start wth the entre tranng set at the root Recursvely add nodes to the tree correspondng to tests that yeld the greatest expected reducton n entropy (or the largest expected nformaton gan) untl some termnaton crteron s met ( e.g., the tranng data at every leaf node has zero entropy ) 17

Learnng Decson Tree Classfers - Example Instances ordered 3-tuples of attrbute values correspondng to Heght (tall, short) Har (dark, blonde, red) Eye (blue, brown) Tranng Data Instance Class label I 1 (t, d, l) + I (s, d, l) + I 3 (t, b, l) I 4 (t, r, l) I 5 (s, b, l) I 6 (t, b, w) + I 7 (t, d, w) + I 8 (s, b, w) + 18

t Learnng Decson Tree Classfers - Example I 1 I 8 3 3 5 5 Heght s Hˆ ( X ) log log 0. 954bts 8 8 8 8 Hˆ 3 3 ( X Heght t) log log 0. 971bts 5 5 5 5 S t I 1 I 3 I 4 I 6 I 7 S s I I 5 I 8 3 5 3 Hˆ 5 ( X Heght) Hˆ ( X Heght t) + Hˆ ( X Heght s) ( 0. 971) + ( 0. 918) 0. 95bts 8 8 8 8 Smlarly, Hˆ ( X Eye) 0. 607bts Hˆ 1 1 ( X Heght s) log log 0. 918bts 3 3 3 3 and Hˆ ( X Har) 0. 5bts Har s the most nformatve because t yelds the largest reducton n entropy. Test on the value of Har s chosen to correspond to the root of the decson tree 19

Learnng Decson Tree Classfers - Example d Har + Eye - - l b w + r Compare the result wth Naïve Bayes In practce, we need some way to prune the tree to avod overfttng the tranng data more on ths later. 0

Learnng, generalzaton, overfttng Consder the error of a hypothess h over tranng data: Error Tran (h) entre dstrbuton D of data: Error D (h) Hypothess h H over fts tranng data f there s an alternatve hypothess h H such that and Error Tran ( h) < Error Tran ( h' ) Error D ( h) > Error D ( h' ) 1

Over fttng n decson tree learnng

Causes of over fttng As we move further away from the root, the data set used to choose the best test becomes smaller poor estmates of entropy Nosy examples can further exacerbate over fttng 3

Mnmzng over fttng Use roughly the same sze sample at every node to estmate entropy when there s a large data set from whch we can sample Stop when further splt fals to yeld statstcally sgnfcant nformaton gan (estmated from valdaton set) Grow full tree, then prune mnmze sze (tree) + sze (exceptons (tree)) 4

Reduced error prunng Each decson node n the tree s consdered as a canddate for prunng Prunng a decson node conssts of removng the sub tree rooted at that node, makng t a leaf node, and assgnng t the most common label at that node or storng the class dstrbuton at that node (for probablstc classfcaton) 5

Reduced error prunng Example A 100 Node A Accuracy gan by Prunng -0% Aa 1 Aa + B 40 60 B +10% A Aa 1 Aa 55 5 - + Before Prunng + - After Prunng 6

Reduced error prunng Do untl further prunng s harmful: Evaluate mpact on valdaton set of prunng each canddate node Greedly select a node whch most mproves the performance on the valdaton set when the sub tree rooted at that node s pruned Drawback holdng back the valdaton set lmts the amount of tranng data avalable not desrable when data set s small 7

Reduced error prunng Pruned Tree 8

Prunng Evaluate Canddate splt to decde f the resultng nformaton gan s sgnfcantly greater than zero as determned usng a sutable hypothess testng method at a desred sgnfcance level 1 50 1 50 L 0 50 1 0 R 50 n n n n 1 1L 1e 1e χ χ 50, n nl 50, nl 0, nl 50, p 0.5 N number of class1nstances (random splt) pn 1 5, n 5 ( n n ) ( n n ) 1L n 50, N n 1e 1e e + 1 pn L + n n e e 100 5 + 5 50 Ths splt s sgnfcantly better than random wth confdence > 99% because χ > 6.64 9

Prunng based on whether nformaton gan s sgnfcantly greater than zero Evaluate Canddate splt to decde f the resultng nformaton gan s sgnfcantly greater than zero as determned usng a sutable hypothess testng method at a desred sgnfcance level χ Example: statstc In the -class, bnary (L,R) splt case, n 1 of class 1, n of class ; Nn 1 +n Splt sends pn to L and (1-p)N to R χ ( n ) L ne 1 Random splt would send pn 1 of class 1 to L and pn of class to L The crtcal value of χ depends on the degrees of freedom whch s 1 n ths case (for a gven p, n 1L fully specfes n L n 1R and n R ) In general, the number of degrees of freedom can be > 1 n e 30

Prunng based on whether nformaton gan s sgnfcantly greater than zero The greater the value of χ Branches 1Classes ( n ne ) the less lkely t s that the splt s random. For 1 1 n a suffcently hgh value χ N n p [ p n e 1 1 + n p p n... p +... n Classes Branches ]; e Branches 1 p 1 of χ, the dfference between the expected (random) splt s statstcally sgnfcant and we reect the null hypothess that the splt s random. Degrees of freedom ( Classes 1)( Branches 1) 31

Rule post-prunng Convert tree to equvalent set of rules IF THEN IF THEN... (Outlook Sunny) (Humdty Hgh) PlayTenns No (Outlook Sunny) (Humdty Hgh) PlayTenns Yes 3

Rule post-prunng 1. Convert tree to equvalent set of rules. Prune each rule ndependently of others by droppng a condton at a tme f dong so does not reduce estmated accuracy (at the desred confdence level) 3. Sort fnal rules n order of lowest to hghest error Advantage can potentally correct bad choces made close to the root Post prunng based on valdaton set s the most commonly used method n practce Development of pre prunng methods wth comparable performance that do not requre a valdaton set s an open problem 33

Classfcaton of nstances Unque classfcaton possble when each leaf has zero entropy and there are no mssng attrbute values Most lkely classfcaton based on dstrbuton of classes at a node when there are no mssng attrbute values Probablstc classfcaton based on dstrbuton of classes at a node when there are no mssng attrbute values 34

Handlng dfferent types of attrbute values Types of attrbutes Nomnal values are names Ordnal values are ordered Cardnal (Numerc) values are numbers (hence ordered). 35

Handlng numerc attrbutes Attrbute T 40 48 50 54 60 70 Class N N Y Y Y N Canddate splts Sort nstances by value of numerc attrbute under consderaton For each attrbute, fnd the test whch yelds the lowest entropy Greedly choose the best test across all attrbutes T > ( 48 + 50)? T > ( 60 + 70)? 4 3 3 1 E( S T > 49?) (0) + log log 6 6 4 4 4 1 4 36

Handlng numerc attrbutes Axs-parallel splt Oblque splt C C C 1 C 1 Oblque splts cannot be realzed by unvarate tests 37

Two-way versus mult-way splts Entropy crteron favors many-valued attrbutes Pathologcal behavor what f n a medcal dagnoss data set, socal securty number s one of the canddate attrbutes? A value versus A value Solutons Only two-way splts (CART) Entropy rato (C4.5) EntropyRat o( S SpltEntro py( S A) A) Entropy ( S A) SpltEntro py( S A) Values ( A) S S log 1 S S 38

Alternatve splt crtera Consder splt of set S based on attrbute A Impurty( S A) Values( A) 1 ImpurtyS ( ) Classes Z Z Entropy Impurty( Z) log 1 Z Z Classes Z Z Z Gn Impurty( Z) 1 Z Z 1 Z (Expected rate of error f class label s pcked randomly accordng to dstrbuton of nstances n a set) 39

Alternatve splt crtera One-sded splt crtera often useful for exploratory analyss of data Impurty( S A) Mn Values( A) { Impurty( S )} 40

Incorporatng Attrbute costs Not all attrbute measurements are equally costly or rsky In Medcal dagnoss Blood-Test has cost $150 Exploratory-Surgery may have a cost of $3000 Goal: Learn a Decson Tree Classfer whch mnmzes cost of classfcaton Tan and Schlmmer (1990) Nunez (1988) Gan( S, A) 1 ( Cost ( A) + 1) Gan ( S, A) Cost( A) where w [0, 1] determnes mportance of cost w 41

Incorporatng Dfferent Msclassfcaton Costs for dfferent classes Not all msclassfcatons are equally costly An occasonal false alarm about a nuclear power plant meltdown s less costly than the falure to alert when there s a chance of a meltdown Weghted Gn Impurty Impurty( S) S S λ S S λ s the cost of wrongly assgnng an nstance belongng to class to class 4

Dealng wth Mssng Attrbute Values (Soluton 1) Sometmes, the fact that an attrbute value s mssng mght tself be nformatve Mssng blood sugar level mght mply that the physcan had reason not to measure t Introduce a new value (one per attrbute) to denote a mssng value Decson tree constructon and use of tree for classfcaton proceed as before 43

Dealng wth Mssng Attrbute Values Soluton Durng decson tree constructon Replace a mssng attrbute value n a tranng example wth the most frequent value found among the nstances at the node Durng use of tree for classfcaton Replace a mssng attrbute value n an nstance to be classfed wth the most frequent value found among the tranng nstances at the node 44

Dealng wth Mssng Attrbute Values (Soluton 3) Durng decson tree constructon Replace a mssng attrbute value n a tranng example wth the most frequent value found among the nstances at the node that have the same class label as the tranng example Durng use of tree for classfcaton Assgn to a mssng attrbute the most frequent value found at the node (based on the tranng sample) Sort the nstance through the tree to generate the class label 45

Dealng wth Mssng Attrbute Values (Soluton 4) Durng decson tree constructon Generate several fractonally weghted tranng examples based on the dstrbuton of values for the correspondng attrbute at the node Durng use of tree for classfcaton Generate multple nstances by assgnng canddate values for the mssng attrbute based on the dstrbuton of nstances at the node Sort each such nstance through the tree to generate canddate labels and assgn the most probable class label or probablstcally assgn class label 46

Dealng wth Mssng Attrbute Values A 1 0 n + 60, n - 40 (n + A1)50 + B (n + A0) 10; (n - A0 ) 40 1 0 (n - A0, B1)40 - + (n + A0, B0) 10 Suppose B s mssng Replacement wth most frequent value at the node B1 Replacement wth most frequent value f class s + B0 47

Dealng wth Mssng Attrbute Values A 1 0 n + 60, n - 40 Suppose B s mssng (n + A1)50 + B (n + A0) 10; (n - A0 ) 40 1 0 (n (n - A0, B1)40 + A0, B0) 10 - + Fractonal nstance based on dstrbuton at the node.. 4/5 for B1, 1/5 for B0 Fractonal nstance based on dstrbuton at the node for class +.. 1/5 for B0, 0 for B1 48

Recent Developments n Decson Tree Classfers Learnng Decson Trees from Dstrbuted Data (Caragea, Slvescu and Honavar, 004) Learnng Decson Trees from Attrbute Value Taxonomes and partally specfed data (Zhang and Honavar, 003; 004; Zhang et al., 005) Learnng Decson Trees from Relatonal Data (Atramentov, Leva, and Honavar, 003) Decson Trees for Mult-label Classfcaton Tasks Learnng Decson Trees from Very Large Data Sets Learnng Decson Trees from Data Streams 49

Summary of Decson Trees Smple Fast (Lnear n sze of the tree, lnear n the sze of the tranng set, lnear n the number of attrbutes) Produce easy to nterpret rules Good for generatng smple predctve rules from data wth lots of attrbutes 50

Neural Networks Decson trees are good at modelng nonlnear nteractons among a small subset of attrbutes Sometmes we are nterested n lnear nteractons among all attrbutes Smple neural networks are good at modelng such nteractons The resultng models have close connectons wth naïve Bayes. 51

Measurng classfer performance To smplfy matters, assume that class labels are bnary M-class problem s turned nto M -class problems 5

Measurng Classfer Performance N: Total number of nstances n the data set TP : Number of True postves for class FP : Number of False postves for class TN : Number of True Negatves for class FN : Number of False Negatves for class Accuracy TP + TN N P class ( c label c ) Perfect classfer Accuracy 1 Popular measure Based n favor of the maorty class! Should be used wth cauton! 53

Measurng Classfer Performance: Senstvty Senstvty TP TP ( label c class c ) Count( class c ) ( c class c ) Count + FN P label Perfect classfer Senstvty 1 Probablty of correctly labelng members of the target class Also called recall or ht rate 54

Measurng Classfer Performance: Specfcty Specfcty TP TP ( label c class c ) Count( label c ) ( c label c ) Count + FP P class Perfect classfer Specfcty 1 Also called precson Probablty that a postve predcton s correct 55

C 1 Classfer Learnng -- Measurng Performance Class Label C 1 C 1 TP 55 FP5 C 1 FN10 TN30 N TP + FN + TN + TP senstvty1 TP + FN TP specfcty1 TP + FP TP + TN accuracy1 N FP 100 55 65 55 65 85 100 56

Measurng Performance: Precson, Recall, and False Alarm Rate Precson Specfcty TP TP + FP Recall Senstvty TP TP + FN Perfect classfer Precson1 Perfect classfer Recall1 FalseAlarm TN FP ( label c class c ) Count ( label c ) ( c class c ) Count P label + FP Perfect classfer False Alarm Rate 0 57

Measurng Performance Correlaton Coeffcent ( TP TN ) ( FP FN ) CC TP + FN TP + FP TN + FP TN + FN ( )( )( )( ) J 1 CC 1 Perfect classfer CC 1, Random guessng CC0 Corresponds to the standard measure of correlaton between two random varables Label and Class estmated from labels L and the correspondng class values C for the specal case of bnary (0/1) valued labels and classes CC where class d D label ( label label)( class class) 1ff σ JLABEL σ the true class of JCLASS 1ff the classfer assgns d d s class c to class c 58

Beware of termnologcal confuson n the lterature! Some bonformatcs authors use accuracy ncorrectly to refer to recall.e. senstvty or precson.e. specfcty In medcal statstcs, specfcty sometmes refers to senstvty for the negatve class.e. TN TN + FP Some authors use false alarm rate to refer to the probablty that a postve predcton s ncorrect.e. When you wrte provde the formula n terms of TP, TN, FP, FN When you read check the formula n terms of TP, TN, FP, FN FP FP + TP 1 Precson 59

Measurng Classfer Performance TP, FP, TN, FN provde the relevant nformaton No sngle measure tells the whole story A classfer wth 90% accuracy can be useless f 90 percent of the populaton does not have cancer and the 10% that do are msclassfed by the classfer Use of multple measures recommended Beware of termnologcal confuson! 60

Measurng Classfer Performance Mcro-averages Mcro averagng gves equal mportance to each nstance Classes wth large number of nstances domnate TP TP McroAverage Precson McroAvera ge McroAvera ge CC TP + TP + FP TP TN FN TP + FP McroAverage Recall FalseAlarm 1 McroAverage Precson McroAverage Accuracy TP N Etc. FP FN TN + FP TP TN + + FN FN 61

Measurng Classfer Performance Macro-averagng Macro averagng gves equal mportance to each of the M classes MacroAverage Senstvty 1 M Senstvty MacroAverageCorrelatonCoeff 1 M CorrelatonCoeff MacroAverage Specfcty 1 M Specfcty 6

Recever Operatng Characterstc (ROC) Curve We can often trade off recall versus precson e.g., by adustng classfcaton threshold θ e.g., label c f P ( c X ) ( c X ) P > θ ROC curve s a plot of Senstvty aganst False Alarm Rate whch characterzes ths tradeoff for a gven classfer 63

Recever Operatng Characterstc (ROC) Curve 1 Perfect classfer TP Senstvty TP + FN 0 0 1 False Alarm Rate FP FP + TN 64

Measurng Performance of Classfers ROC curves ROC curves offer a more complete pcture of the performance of the classfer as a functon of the classfcaton threshold A classfer h s better than another classfer g f ROC(h) domnates the ROC(g) ROC(h) domnates ROC(g) AreaROC(h) > AreaROC(g) 65

Evaluatng a Classfer How well can a classfer be expected to perform on novel data? We can estmate the performance (e.g., accuracy, senstvty) of the classfer usng a test data set (not used for tranng) How close s the estmated performance to the true performance? References: Evaluaton of dscrete valued hypotheses Chapter 5, Mtchell Emprcal Methods for Artfcal Intellgence, Cohen Better yet, take Stat 430x 66

Estmatng the performance of a classfer The true error of a hypothess h wth respect to a target functon f and an nstance dstrbuton D s [ f ( x) h( )] Error ( h) Pr x D x D The sample error of a hypothess h wth respect to a target functon f and an nstance dstrbuton D s Error S ( h) δ ( a, b) 1ff 1 S x S δ ( f ( x) h( x)) a b; δ ( a, b) 0 otherwse 67

68 Iowa State Unversty Estmatng classfer performance ( ) ( ) ( ) [ ] 4 1 8 1 8 1 0 0 1 1 0 1 1 0 4 1 8 1 1 8 1 + + ) ( ) ( ) ( ) ( Pr,,, ) ( },,, { ) ( c X D a X D x f x h h error x f x h d c b a x X D d c b a X Doman D D

Evaluatng the performance of a classfer Sample error estmated from tranng data s an optmstc estmate Bas E Error ( h) Error ( h [ ] ) S D For an unbased estmate, h must be evaluated on an ndependent sample S (whch s not the case f S s the tranng set!) Even when the estmate s unbased, t can vary across samples! If h msclassfes 8 out of 100 samples 8 Error S ( h) 100 How close s the sample error to the true error? 0. 08 69

How close s the estmated error to the true error? Choose a sample S of sze n accordng to dstrbuton D Measure Error S (h) (h) Error S s a random varable (outcome of a random experment) Gven Error S ( h), what can we conclude about Error D ( h)? More generally, gven the estmated performance of a hypothess, what can we say about ts actual performance? 70

Evaluaton of a classfer wth lmted data There s extensve lterature on how to estmate classfer performance from samples and how to assgn confdence to estmates (See Mtchell, Chapter 5) Holdout method use part of the data for tranng, and the rest for testng We may be unlucky tranng data or test data may not be representatve Soluton Run multple experments wth dsont tranng and test data sets n whch each class s represented n roughly the same proporton as n the entre data set 71

Estmatng the performance of the learned classfer K-fold cross-valdaton Partton the data (mult) set S nto K equal parts S 1..S K wth roughly the same class dstrbuton as S. Errorc 0 For 1 to K do Error { S α Test S STran S S ; Learn( S Tran ) Errorc Errorc + Error( α, STest ) Errorc K ; Output( Error) } 7

Leave-one-out cross-valdaton K-fold cross valdaton wth K n where n s the total number of samples avalable n experments usng n-1 samples for tranng and the remanng sample for testng Leave-one-out cross-valdaton does not guarantee the same class dstrbuton n tranng and test data! Extreme case: 50% class 1, 50% class Predct maorty class label n the tranng data True error 50%; Leave-one-out error estmate 100%!!!!! 73

Estmatng classfer performance Recommended procedure Use K-fold cross-valdaton (K5 or 10) for estmatng performance estmates (accuracy, precson, recall, ponts on ROC curve, etc.) and 95% confdence ntervals around the mean Compute mean values of performance estmates and standard devatons of performance estmates Report mean values of performance estmates and ther standard devatons or 95% confdence ntervals around the mean Be skeptcal repeat experments several tmes wth dfferent random splts of data nto K folds! 74

Evaluatng a hypothess or a learnng algorthm How well can the decson tree be expected to perform on novel data? We can estmate the performance (e.g., accuracy) of the decson tree usng a test data set (not used for tranng) How close s the estmated performance to the true performance? Reference: Evaluaton of dscrete valued hypotheses Chapter 5, Mtchell 75

Evaluatng performance when we can afford to test on a large ndependent test set The true error of a hypothess h wth respect to a target functon f and an nstance dstrbuton D s [ f ( x) h( )] Error ( h) Pr x D x D The sample error of a hypothess h wth respect to a target functon f and an nstance dstrbuton D s Error S ( h) δ ( a, b) 1ff 1 S x S δ ( f ( x) a b; δ ( a, b) h( x)) 0 otherwse 76

Evaluatng Classfer performance Bas E Error [ ( h) ] Error ( h) S Sample error estmated from tranng data s an optmstc estmate For an unbased estmate, h must be evaluated on an ndependent sample S (whch s not the case f S s the tranng set!) Even when the estmate s unbased, t can vary across samples! 8 If h msclassfes 8 out of 100 samples Error S ( h) How close s the sample error to the true error? D 100 0. 08 77

How close s estmated error to ts true value? Choose a sample S of sze n accordng to dstrbuton D Measure Error S (h) (h) Error S s a random varable (outcome of a random experment) Gven Error S ( h), what can we conclude about Error D ( h)? More generally, gven the estmated performance of a hypothess, what can we say about ts actual performance? 78

How close s estmated accuracy to ts true value? Queston: How close s p (the true probablty) to pˆ? Ths problem s an nstance of a well-studed problem n statstcs the problem of estmatng the proporton of a populaton that exhbts some property, gven the observed proporton over a random sample of the populaton. In our case, the property of nterest s that h correctly (or ncorrectly) classfes a sample. Testng h on a sngle random sample x drawn accordng to D amounts to performng a random experment whch succeeds f h correctly classfes x and fals otherwse. 79

How close s estmated accuracy to ts true value? The output of a hypothess whose true error s p as a bnary random varable whch corresponds to the outcome of a Bernoull tral wth a success rate p (the probablty of correct predcton) The number of successes r observed n N trals s a random varable Y whch follows the Bnomal dstrbuton n! P( r) 1 r!( n r)! r n r p ( p ) 80

Error S (h) s a Random Varable Probablty of observng r msclassfed examples n a sample of sze n: n! P( r) 1 r!( n r)! r n r p ( p ) r 81

Recall basc statstcs Consder a random experment wth dscrete valued outcomes The expected value of the correspondng random M varable Y s The varance of Y s y 1, y,... y M E ( Y) y Pr( Y y ) 1 [( Y E[ ]) ] Var( Y ) E Y The standard devaton of Y s σ Y Var(Y ) 8

How close s estmated accuracy to ts true value? The mean of a Bernoull tral wth success rate p p Varance p (1-p) If N trals are taken from the same Bernoull process, the observed success rate pˆ has the same mean p and varance p ( 1 p ) For large N, the dstrbuton of dstrbuton N pˆ follows a Gaussan 83

Bnomal Probablty Dstrbuton n! P( r) 1 r!( n r)! r n r p ( p ) Probablty P(r) of r heads n n con flps, f p Pr(heads) Expected, or mean value of X, E[X], s E[ X ] P( ) np 0 Varance of X s Var( X ) E[( X E[ X ]) ] np( 1 p) Standard devaton of X, σ X, s X N σ E[( X E[ X ]) ] np( 1 p) 84

Estmators, Bas, Varance, Confdence Interval r ErrorS ( h) n Error ( h) p D σ σ p( 1 p n ) Error S ( h) ErrorD ( h)( 1 ErrorD ( h n )) Error S ( h) σ ErrorS ( h)( 1 ErrorS ( h n )) Error S ( h) An N% confdence nterval for some parameter p that s the nterval whch s expected wth probablty N% to contan p 85

Normal dstrbuton approxmates bnomal Error S (h) follows a Bnomal dstrbuton, wth mean μ ErrorS ( h ) Error D ( h) standard devaton σ Errors S ( h) Error D ( h)( 1 Errors n D ( h)) We can approxmate ths by a Normal dstrbuton wth the same mean and varance when np(1-p) 5 86

p Normal dstrbuton 1 x μ 1 ( ) σ ( x) e πσ The probablty that X wll fall n the nterval (a, b) s gven by p ( x) dx b a Expected, or mean value of X s gven by E[X] μ Varance of X s gven by Var(X) σ Standard devaton of X s gven by σ X σ 87

How close s the estmated accuracy to ts true value? Let the probablty that a Gaussan random varable X, wth zero mean, takes a value between z and z, Pr[-z X z] c Pr[ X z] z Pr[ X z] 5% mples there s a 5% chance that X les more than 1.65 standard devatons from the mean, or Pr [-1.65 X 1.65 ] 90% 0.001 0.005 0.01 0.05 0.10 3.09.58.33 1.65 1.8 88

How close s the estmated accuracy to ts true value? pˆ But does not have zero mean and unt varance so we normalze to get pˆ p Pr z < < z c p( 1 p) n 89

How close s the estmated accuracy to ts true value? To fnd confdence lmts: Gven a partcular confdence fgure c, use the table to fnd the z correspondng to the probablty ½ (1-c). Use lnear nterpolaton for values not n the table p pˆ + z n ± z 1 + pˆ n z n pˆ n + z 4n 90

How close s the estmated accuracy to ts true value? Example pˆ 0.75; n 1000; c 0.80; z 1.8 Then wth 80% confdence, we can say that the value of p les n the nterval [0.733,0.768] Note: the normal dstrbuton assumpton s vald only for large n (.e. np(1-p) 5 or n > 30) so estmates based on smaller values of n should be taken wth a generous dose of salt 91

Estmatng confdence ntervals 80% of area (probablty) les n μ ±1.8σ N% of area (probablty) les n μ ±z N σ 9

Confdence ntervals If S contans n examples, drawn ndependently of h and each other and n 30 or np(1-p) 5, Then Wth approxmately N% probablty, Error S (h) les n nterval Error D ( h) ± Z N Error equvalently, Error D (h) les n nterval D ( h)( 1 Errors n D ( h)) Errors whch s approxmately S Errors ( h) ± Z S N ( h) ± Z N Error D ( h)( 1 Error n D ( h)) ErrorS ( h)( 1 Errors n S ( h)) 93

One sded confdence ntervals What s the probablty that Error D (h) s at most U? Symmetry of Gaussan dstrbuton mples that confdence nterval wth 1001 ( α)% confdence wth lower bound L and upper bound U corresponds to a confdence nterval wth confdence 100( 1 α )% and wth upper bound U but no lower bound (or vce versa) 94

General approach to dervng confdence ntervals 1. Identfy the populaton parameter p to be estmated e.g., Error D (h ). Defne a sutable estmator W preferably unbased, mnmum varance 3. Determne the dstrbuton D W obeyed by W, and the mean and varance of W 4. Determne the confdence nterval by fndng the thresholds L and U such that N% of the mass of the probablty dstrbuton D Y falls wthn the nterval [L,U]. 95

Central Lmt Theorem Smplfes Confdence Interval Calculatons Consder a set of ndependent, dentcally dstrbuted random varables Y 1... Y n, all governed by an arbtrary probablty dstrbuton wth mean μ and fnte varance σ. Defne the sample mean, Y n 1 n Y 1 Central Lmt Theorem As n, the dstrbuton governng approaches Y a Normal dstrbuton, wth mean μ and varance σ /n 96

Evaluaton of a classfer wth lmted data Holdout method use part of the data for tranng, and the rest for testng We may be unlucky tranng data or test data may not be representatve Soluton Run multple experments wth dsont tranng and test data sets n whch each class s represented n roughly the same proporton as n the entre data set 97

Estmatng the performance of the learned classfer K-fold cross-valdaton Partton the data (mult) set S nto K equal parts S 1..S K where each part has roughly the same class dstrbuton as S. A 0 For 1 to K do STran S S ; { STest S α Learn( S Tran ) A A + Accuracy( α, STest ) } Accuracy A/K; Output (Accuracy) 98

K-fold cross-valdaton Recommended procedure for evaluatng classfers when data are lmted Use K-fold cross-valdaton (K5 or 10) Better stll, repeat K-fold cross-valdaton R tmes and average the results 99

Dfference n error between two hypotheses We wsh to estmate d ErrorD ( h1 ) ErrorD ( h ) Suppose h 1 has been tested on a sample S 1 of sze n 1 drawn accordng to D and h has been tested on a sample S of sze n drawn accordng to D ) An unbased estmator d ErrorS ( h1 ) Error ( ) 1 S h For large n 1 and large n the correspondng error estmates follow Normal dstrbuton Dfference of two Normal dstrbutons yelds a normal dstrbuton wth varance equal to the sum of the varances of the ndvdual dstrbutons 100

Dfference between errors of two hypotheses d dˆ Error D ( h1 ) ErrorD ( h ) ErrorsS ( h1 ) Errors ( h 1 S ) σ dˆ Errors S ( h )( 1 Error ( h )) Error ( h )( 1 1 1 S1 1 S S n 1 + Error n ( h )) dˆ ± z N Error S ( h )( 1 Error ( h )) Error ( h )( 1 1 1 S1 1 S S n 1 Error When S 1 S, the varance of s smaller and the confdence nterval correct but overly conservatve + dˆ n ( h )) 101

Hypothess testng Is one hypothess lkely to be better than another? What s the probablty that ErrorD ( h1 ) > ErrorD ( h )? Suppose Error ( h ). 30; Error ( h ) 0. 0; dˆ 0 10 0 S1 1 S. What s the probablty that d > 0 gven that dˆ 0. 10? Pr( d 0 dˆ 0. 10) Pr( dˆ < μ ˆ + 0. 10) > d 10

Hypothess testng Pr( d If n1 n 100, σ ˆ 0. 061 d > 0 dˆ 0. 10) Pr( dˆ < μ ˆ + 1. 64σ ˆ ) d d We accept the hypothess that ErrorD ( h1 ) > ErrorD ( h wth 95% confdence ) 0. 95 Equvalently, we reect the opposte hypothess the null hypothess at a (1-0.95) 0.05 level of sgnfcance 103

Comparng learnng algorthms L A and L B Whch learnng algorthm s better at learnng f? Unlmted data Run L A and L B on large tranng set S tran drawn accordng to D Test the resultng hypotheses on a large ndependent test set S Test drawn accordng to D Estmate Pr[ Error ( L ( S )) > Error ( L ( S ))] D A Tran D B Tran Usng Error S Test ( LA( STran) ) and ErrorS ( LB( STran)) Test 104

Comparng learnng algorthms L A and L B Estmate the expected value of the dfference n errors of L A and L B where expectaton s taken over tranng sets S Tran drawn accordng to D S E Tran D [ Error L ( S )) Error ( L ( S ))] D( A Tran D B Tran We have a lmted data set S drawn from an unknown D!! 105

Comparng learnng algorthms L A and L B Lmted data Pared t-test Run L A and L B on large tranng set S Tran drawn accordng to D Test the resultng hypotheses on a large ndependent test set S Test drawn accordng to D Estmate Pr[ Error ( L ( S )) > Error S D Test A Tran Error ( L D B ( S Tran ))] ( LA( STran) ) and ErrorS ( LB( STran)) Test usng 106

Comparng learnng algorthms L A and L B Partton S nto k dsont test sets T 1, T,..., T k of equal sze For from 1 to k do { S Test T ; S Tran S - T δ } Return Error δ S Test ( LA( STran )) ErrorS ( LB( STran )) k 1 k δ 1 Pared t-test Test 107

Comparng learnng algorthms L A and L B For large test sets, each has Normal dstrbuton δ has Normal dstrbuton f δ are ndependent Can we estmate confdence nterval for as before? δ are not exactly ndependent because of samplng from S as opposed to the dstrbuton D (but we wll pretend that they are) We don t know the standard devaton of ths dstrbuton. So we estmate t from sample..but when the estmated varance s used, the dstrbuton s no longer Normal unless K s large (whch typcally t s not) δ δ 108

Comparng learnng algorthms L A and L B Approxmate N% confdence nterval for S E Tran S s gven by where [ Error L ( S )) Error ( L ( S ))] δ ± t N, k 1ψ δ ψ δ D( A Tran D B Tran k 1 ( δ δ ) k( k 1) 1 s the estmate of standard devaton of the t dstrbuton governng Z N and t plays a role analogous to that of N, K 1. δ As K, tn, K 1 Z N and ψ σ δ δ 109

Performance evaluaton summary Rgorous statstcal evaluaton s extremely mportant n expermental computer scence n general and machne learnng n partcular How good s a learned hypothess? Is one hypothess better than another? Is one learnng algorthm better than another on a partcular learnng task? (No learnng algorthm outperforms all others on all tasks No free lunch theorem) Dfferent procedures for evaluaton are approprate under dfferent condtons (large versus lmted versus small sample) Important to know when to use whch evaluaton method and be aware of pathologcal behavor (tendency to grossly overestmate or underestmate the target value under specfc condtons 110