In Homewok, you ae (supposedly) Choosing a data set 2 Extacting a test set of size > 3 3 Building a tee on the taining set 4 Testing on the test set 5 Repoting the accuacy (Adapted fom Ethem Alpaydin and Tom Mitchell) Does the epoted accuacy exactly match the genealization pefomance of the tee? If a tee has eo % and an A has eo %, is the tee absolutely bette? / 35 sscott@cse.unl.edu 2 / 35 Why o why not? How about the algoithms in geneal? Setting of pefomance evaluation eo and confidence intevals Paied t tests and coss-validation to compae leaning algoithms pefomance measues Confusion matices ROC analysis Pecision-ecall cuves Befoe setting up an expeiment, need to undestand exactly what the goal is Estimate the genealization pefomance of a hypothesis Tuning a leaning algoithm s paametes two leaning algoithms on a specific task Etc. Will neve be able to answe the question with % cetainty Due to vaiances in taining set selection, test set selection, etc. Will choose an estimato fo the quantity in question, detemine the pobability distibution of the estimato, and bound the pobability that the estimato is way off Estimato needs to wok egadless of distibution of taining/testing data 3 / 35 4 / 35 Setting (cont d) Types of 5 / 35 eed to note that, in addition to statistical vaiations, what we detemine is limited to the application that we ae studying E.g., if naïve Bayes bette than ID3 on spam filteing, that means nothing about face ecognition In planning expeiments, need to ensue that taining data not used fo evaluation I.e., don t test on the taining set! Will bias the pefomance estimato Also holds fo validation set used to pune DT, tune paametes, etc. Validation set seves as pat of taining set, but not used fo model building Types of 6 / 35 Fo now, focus on staightfowad, / classification eo Fo hypothesis h, ecall the two types of classification eo fom Chapte 2: Empiical eo (o sample eo) is faction of set V that h gets wong: eo V (h) X (C(x) 6= h(x)), V x2v whee (C(x) 6= h(x)) is if C(x) 6= h(x), and othewise Genealization eo (o tue eo) is pobability that a new, andomly selected, instance is misclassified by h eo D (h) P [C(x) 6= h(x)], x2d whee D is pobability distibution instances ae dawn fom Why do we cae about eo V (h)?
Tue Tue (cont d) Types of Bias: If T is taining set, eo T (h) is optimistically biased bias E[eo T (h)] eo D (h) Fo unbiased estimate (bias = ), h and V must be chosen independently ) don t test on the taining set! (By the way, this is distinct fom inductive bias) Vaiance: Even with unbiased V, eo V (h) may still vay fom eo D (h) Types of Expeiment: Choose sample V of size accoding to distibution D 2 Measue eo V (h) eo V (h) is a andom vaiable (i.e., esult of an expeiment) eo V (h) is an unbiased estimato fo eo D (h) Given obseved eo V (h), what can we conclude about eo D (h)? 7 / 35 8 / 35 (cont d) If V contains examples, dawn independently of h and each othe 3 If V contains examples, dawn independently of h and each othe 3 Then with appoximately 95% pobability, eo D (h) lies in Then with appoximately c% pobability, eo D (h) lies in Types of 9 / 35 eov (h)( eo V (h)) eo V (h) ±.96 E.g. hypothesis h misclassifies 2 of the 4 examples in test set V: eo V (h) = 2 4 =.3 Then with appox. 95% confidence, eo D (h) 2 [58,.442] Types of / 35 Why? eov (h)( eo V (h)) eo V (h) ± z c %: 5% 68% 8% 9% 95% 98% 99% z c :.67..28.64.96 2.33 2.58 eo V (h) is a Random Vaiable Binomial Pobability Distibution Types of / 35 Repeatedly un the expeiment, each with diffeent andomly dawn V (each of size ) Pobability of obseving misclassified examples: P() 4 2.8.6.4.2 P() = Binomial distibution fo n = 4, p =.3 5 5 2 25 3 35 4 eo D (h) ( eo D (h)) I.e., let eo D (h) be pobability of heads in biased coin, then P() =pob. of getting heads out of flips Types of 2 / 35 P() = p ( p)! =!( )! p ( p) Pobability P() of heads in coin flips, if p = P(heads) Expected, o mean value of X, E[X] (= # heads on flips = # mistakes on test exs), is X E[X] ip(i) =p = eo D (h) Vaiance of X is i= Va(X) E[(X E[X]) 2 ]=p( p) Standad deviation of X, X, is q E[(X E[X]) 2 ]= p p( p) X
Appoximate Binomial Dist. with omal omal Pobability Distibution Types of 3 / 35 eo V (h) =/ is binomially distibuted, with mean µ eov (h) = eo D (h) (i.e., unbiased est.) standad deviation eov (h) eod (h)( eo D (h)) eo V (h) = (inceasing deceases vaiance) Want to compute confidence inteval = inteval centeed at eo D (h) containing c% of the weight unde the distibution Appoximate binomial by nomal (Gaussian) dist: mean µ eov (h) = eo D (h) standad deviation eov (h) eov (h)( eo V (h)) eo V (h) Types of 4 / 35.4.35.3.25.2 5.5 omal distibution with mean, standad deviation -3-2 - 2 3! p(x) = p exp x µ 2 2 2 2 The pobability that X will fall into the inteval (a, b) is given by R b a p(x) dx Expected, o mean value of X, E[X], is E[X] =µ Vaiance is Va(X) = 2, standad deviation is X = omal Pobability Distibution (cont d) omal Pobability Distibution (cont d) Types of.4.35.3.25.2 5.5-3 -2-2 3 8% of aea (pobability) lies in µ ±.28 c% of aea (pobability) lies in µ ± z c c%: 5% 68% 8% 9% 95% 98% 99% z c :.67..28.64.96 2.33 2.58 Types of Can also have one-sided bounds:.4.35.3.25.2 5.5-3 -2-2 3 c% of aea lies <µ+ z c o >µ z c, whee z c = z ( c)/2 c%: 5% 68% 8% 9% 95% 98% 99% z c:..47.84.28.64 2.5 2.33 5 / 35 6 / 35 Revisited Cental Limit Theoem Types of 7 / 35 If V contains 3 examples, indep. of h and each othe Then with appoximately 95% pobability, eo V (h) lies in eod (h)( eo D (h)) eo D (h) ±.96 Equivalently, eo D (h) lies in eod (h)( eo D (h)) eo V (h) ±.96 which is appoximately eov (h)( eo V (h)) eo V (h) ±.96 (One-sided bounds yield uppe o lowe eo bounds) Types of 8 / 35 How can we justify appoximation? Conside set of iid andom vaiables Y,...,Y, all fom abitay pobability distibution with mean µ and finite vaiance 2. Define sample mean Ȳ (/) P n i= Y i Ȳ is itself a andom vaiable, i.e., esult of an expeiment (e.g., eo S (h) =/) Cental Limit Theoem: As!, the distibution govening Ȳ appoaches nomal distibution with mean µ and vaiance 2 / Thus the distibution of eo S (h) is appoximately nomal fo lage, and its expected value is eo D (h) (Rule of thumb: 3 when estimato s distibution is binomial; might need to be lage fo othe distibutions)
Calculating Types of Pick paamete to estimate: eo D (h) 2 Choose an estimato: eo V (h) 3 Detemine pobability distibution that govens estimato: eo V (h) govened by binomial distibution, appoximated by nomal when 3 4 Find inteval (L, U) such that c% of pobability mass falls in the inteval Could have L = o U = Use table of z c o z c values (if distibution nomal) Distibution What if we want to compae two leaning algoithms L and L 2 (e.g., ID3 vs k-neaest neighbo) on a specific application? Insufficient to simply compae eo ates on a single test set Use K-fold coss validation and a paied t test 9 / 35 2 / 35 K-Fold Coss Validation K-Fold Coss Validation (cont d) Distibution 2 / 35 Patition data set X into K equal-sized subsets X, X 2,...,X K, whee X i 3 2 Fo i fom to K, do (Use X i fo testing, and est fo taining) V i = X i 2 T i = X\X i 3 Tain leaning algoithm L on T i to get h i 4 Tain leaning algoithm L 2 on T i to get h 2 i 5 Let p j i be eo of hj i on test set V i 6 p i = p i p 2 i 3 diffeence estimate p =(/K) P K i p i Distibution 22 / 35 ow estimate confidence that tue expected eo diffeence < ) Confidence that L is bette than L 2 on leaning task Use one-sided test, with confidence deived fom student s t distibution with K degees of feedom With appoximately c% pobability, tue diffeence of expected eo between L and L 2 is at most whee p + t c,k s p v u s p t KX (p i p) 2 K(K ) i= Distibution (One-Sided Test) Caveat Distibution If p + t c,k s p < ou assetion that L has less eo than L 2 is suppoted with confidence c So if K-fold CV used, compute p, look up t c,k and check if p < t c,k s p One-sided test; says nothing about L 2 ove L Distibution Say you want to show that leaning algoithm L pefoms bette than algoithms L 2, L 3, L 4, L 5 If you use K-fold CV to show supeio pefomance of L ove each of L 2,...,L 5 at 95% confidence, thee s a 5% chance each one is wong ) Thee s a 2% chance that at least one is wong ) Ou oveall confidence is only 8% eed to account fo this, o use moe appopiate test 23 / 35 24 / 35
Moe Specific Confusion Matices Confusion Matices ROC Cuves Pecision-Recall Cuves So fa, we ve looked at a single eo ate to compae hypotheses/leaning algoithms/etc. This may not tell the whole stoy: test examples: 2 positive, 98 negative h gets 2/2 pos coect, 965/98 neg coect, fo accuacy of (2 + 965)/(2 + 98) =.967 Petty impessive, except that always pedicting negative yields accuacy =.98 Would we athe have h 2, which gets 9/2 pos coect and 93/98 neg, fo accuacy =.949? Depends on how impotant the positives ae, i.e., fequency in pactice and/o cost (e.g., cance diagnosis) Confusion Matices ROC Cuves Pecision-Recall Cuves Beak down eo into type: tue positive, etc. Pedicted Class Tue Class Positive egative Total Positive tp : tue positive fn : false negative p egative fp : false positive tn : tue negative n Total p n Genealizes to multiple classes Allows one to quickly assess which classes ae missed the most, and into what othe class 25 / 35 26 / 35 ROC Cuves ROC Cuves Plotting tp vesus fp Confusion Matices ROC Cuves Pecision-Recall Cuves 27 / 35 Conside an A o SVM omally theshold at, but what if we changed it? Keeping weight vecto constant while changing theshold = holding hypeplane s slope fixed while moving along its nomal vecto b ped all! ped all + I.e., get a set of classifies, one pe labeling of test set Simila situation with any classifie with confidence value, e.g., pobability-based Confusion Matices ROC Cuves Pecision-Recall Cuves 28 / 35 Conside the always hyp. What is fp? What is tp? What about the always + hyp? In between the extemes, we plot TP vesus FP by soting the test examples by the confidence values Ex Confidence label Ex Confidence label x 69.752 + x 6 2.64 x 2 9.2 + x 7 29.24 x 3 9.2 x 8 83.222 x 4.95 + x 9 9.554 + x 5 2.75 + x 28.22 ROC Cuves Plotting tp vesus fp (cont d) ROC Cuves Convex Hull TP x5 x TP ID3 naive Bayes Confusion Matices ROC Cuves Pecision-Recall Cuves x FP Confusion Matices ROC Cuves Pecision-Recall Cuves The convex hull of the ROC cuve yields a collection of classifies, each optimal unde diffeent conditions If FP cost = F cost, then daw a line with slope / P at (, ) and dag it towads convex hull until you touch it; that s you opeating point Can use as a classifie any pat of the hull since can andomly select between two classifies FP 29 / 35 3 / 35
ROC Cuves Convex Hull ROC Cuves Miscellany Confusion Matices ROC Cuves Pecision-Recall Cuves 3 / 35 TP ID3 naive Bayes Can also compae cuves against single-point classifies when no cuves In plot, ID3 bette than ou SVM iff negatives scace; nb neve bette FP Confusion Matices ROC Cuves Pecision-Recall Cuves 32 / 35 What is the wost possible ROC cuve? One metic fo measuing a cuve s goodness: aea unde cuve (AUC): P x +2P P x 2 I(h(x +) > h(x )) P i.e., ank all examples by confidence in + pediction, count the numbe of times a positively-labeled example (fom P) is anked above a negatively-labeled one (fom ), then nomalize What is the best value? Distibution appoximately nomal if P, >, so can find confidence intevals Catching on as a bette scala measue of pefomance than eo ate ROC analysis possible (though ticky) with multi-class poblems ROC Cuves Miscellany (cont d) Pecision-Recall Cuves Confusion Matices ROC Cuves Pecision-Recall Cuves Can use ROC cuve to modify classifies, e.g., e-label decision tees What does ROC stand fo? Receive Opeating Chaacteistic fom signal detection theoy, whee binay signals ae coupted by noise Use plots to detemine how to set theshold to detemine pesence of signal Theshold too high: miss tue hits (tp low), too low: too many false alams (fp high) Altenative to ROC: cost cuves Confusion Matices ROC Cuves Pecision-Recall Cuves Conside infomation etieval task, e.g., web seach pecision = tp/p = faction of etieved that ae positive ecall = tp/p = faction of positives etieved 33 / 35 34 / 35 Pecision-Recall Cuves (cont d) As with ROC, can vay theshold to tade off pecision against ecall Can compae cuves based on containment Confusion Matices ROC Cuves Pecision-Recall Cuves 35 / 35 Use F -measue to combine at a specific point, whee weights pecision vs ecall: F ( + 2 pecision ecall ) ( 2 pecision)+ecall