CSCE 478/878 Lecture 4: Experimental Design and Analysis. Stephen Scott. 3 Building a tree on the training set Introduction. Outline.

Similar documents
Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline. Machine. Learning. Problems. Measuring. Performance.

Stephen Scott.

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

4/18/2005. Statistical Learning Theory

Goodness-of-fit for composite hypotheses.

The Substring Search Problem

Nuclear Medicine Physics 02 Oct. 2007

Multiple Experts with Binary Features

Revision of Lecture Eight

6 Matrix Concentration Bounds

MULTILAYER PERCEPTRONS

Numerical Integration

Random Variables and Probability Distribution Random Variable

F-IF Logistic Growth Model, Abstract Version

Notes on McCall s Model of Job Search. Timothy J. Kehoe March if job offer has been accepted. b if searching

3.1 Random variables

Value Prediction with FA. Chapter 8: Generalization and Function Approximation. Adapt Supervised Learning Algorithms. Backups as Training Examples [ ]

Δt The textbook chooses to say that the average velocity is

Chapter 8: Generalization and Function Approximation

Alternative Tests for the Poisson Distribution

Surveillance Points in High Dimensional Spaces

Estimation of the Correlation Coefficient for a Bivariate Normal Distribution with Missing Data

Lecture 8 - Gauss s Law

and the correct answer is D.

1. Review of Probability.

Central Coverage Bayes Prediction Intervals for the Generalized Pareto Distribution

Hypothesis Test and Confidence Interval for the Negative Binomial Distribution via Coincidence: A Case for Rare Events

6 PROBABILITY GENERATING FUNCTIONS

A Probabilistic Approach to Susceptibility Measurement in a Reverberation Chamber

Fresnel Diffraction. monchromatic light source

Spatial Point Patterns. Potential questions. Application areas. Examples in pictures:

Information Retrieval (Relevance Feedback & Query Expansion)

FALL 2006 EXAM C SOLUTIONS

Multiple Criteria Secretary Problem: A New Approach

Lecture 7 Topic 5: Multiple Comparisons (means separation)

A Converse to Low-Rank Matrix Completion

Some technical details on confidence. intervals for LIFT measures in data. mining

On the Poisson Approximation to the Negative Hypergeometric Distribution

LET a random variable x follows the two - parameter

Pearson s Chi-Square Test Modifications for Comparison of Unweighted and Weighted Histograms and Two Weighted Histograms

Stanford University CS259Q: Quantum Computing Handout 8 Luca Trevisan October 18, 2012

Chem 453/544 Fall /08/03. Exam #1 Solutions

Determining solar characteristics using planetary data

2. Electrostatics. Dr. Rakhesh Singh Kshetrimayum 8/11/ Electromagnetic Field Theory by R. S. Kshetrimayum

33. 12, or its reciprocal. or its negative.

QIP Course 10: Quantum Factorization Algorithm (Part 3)

ST 501 Course: Fundamentals of Statistical Inference I. Sujit K. Ghosh.

( ) F α. a. Sketch! r as a function of r for fixed θ. For the sketch, assume that θ is roughly the same ( )

Psychometric Methods: Theory into Practice Larry R. Price

Physics 235 Chapter 5. Chapter 5 Gravitation

Teachers notes. Beyond the Thrills excursions. Worksheets in this book. Completing the worksheets

Topic 5. Mean separation: Multiple comparisons [ST&D Ch.8, except 8.3]

Analysis of spatial correlations in marked point processes

Temporal-Difference Learning

Directed Regression. Benjamin Van Roy Stanford University Stanford, CA Abstract

ASTR415: Problem Set #6

Lecture 28: Convergence of Random Variables and Related Theorems

1) (A B) = A B ( ) 2) A B = A. i) A A = φ i j. ii) Additional Important Properties of Sets. De Morgan s Theorems :

Math 301: The Erdős-Stone-Simonovitz Theorem and Extremal Numbers for Bipartite Graphs

Markscheme May 2017 Calculus Higher level Paper 3

Quantum Fourier Transform

Research Design - - Topic 17 Multiple Regression & Multiple Correlation: Two Predictors 2009 R.C. Gardner, Ph.D.

MATHEMATICS GRADE 12 SESSION 38 (LEARNER NOTES)

Exploration of the three-person duel

B. Spherical Wave Propagation

INTRODUCTION. 2. Vectors in Physics 1

Bayesian Congestion Control over a Markovian Network Bandwidth Process

EM-2. 1 Coulomb s law, electric field, potential field, superposition q. Electric field of a point charge (1)

Topic 4a Introduction to Root Finding & Bracketing Methods

PROBLEM SET #1 SOLUTIONS by Robert A. DiStasio Jr.

Related Rates - the Basics

Contact impedance of grounded and capacitive electrodes

Bayesian Analysis of Topp-Leone Distribution under Different Loss Functions and Different Priors

Substances that are liquids or solids under ordinary conditions may also exist as gases. These are often referred to as vapors.

10/04/18. P [P(x)] 1 negl(n).

763620SS STATISTICAL PHYSICS Solutions 2 Autumn 2012

Introduction to Mathematical Statistics Robert V. Hogg Joeseph McKean Allen T. Craig Seventh Edition

Auchmuty High School Mathematics Department Advanced Higher Notes Teacher Version

Classical Worm algorithms (WA)

Information Retrieval Advanced IR models. Luca Bondi

Boundary Layers and Singular Perturbation Lectures 16 and 17 Boundary Layers and Singular Perturbation. x% 0 Ω0æ + Kx% 1 Ω0æ + ` : 0. (9.

MEASURES OF BLOCK DESIGN EFFICIENCY RECOVERING INTERBLOCK INFORMATION

arxiv: v1 [cs.si] 4 Jul 2012

n 1 Cov(X,Y)= ( X i- X )( Y i-y ). N-1 i=1 * If variable X and variable Y tend to increase together, then c(x,y) > 0

Motion in One Dimension

A Deep Convolutional Neural Network Based on Nested Residue Number System

Centripetal Force OBJECTIVE INTRODUCTION APPARATUS THEORY

MEASURING CHINESE RISK AVERSION

ON INDEPENDENT SETS IN PURELY ATOMIC PROBABILITY SPACES WITH GEOMETRIC DISTRIBUTION. 1. Introduction. 1 r r. r k for every set E A, E \ {0},

Hopefully Helpful Hints for Gauss s Law

Newton s Laws, Kepler s Laws, and Planetary Orbits

PAPER 39 STOCHASTIC NETWORKS

working pages for Paul Richards class notes; do not copy or circulate without permission from PGR 2004/11/3 10:50

TESTING THE VALIDITY OF THE EXPONENTIAL MODEL BASED ON TYPE II CENSORED DATA USING TRANSFORMED SAMPLE DATA

Chapter 5 Force and Motion

Modeling Fermi Level Effects in Atomistic Simulations

you of a spring. The potential energy for a spring is given by the parabola U( x)

2 x 8 2 x 2 SKILLS Determine whether the given value is a solution of the. equation. (a) x 2 (b) x 4. (a) x 2 (b) x 4 (a) x 4 (b) x 8

Gradient-based Neural Network for Online Solution of Lyapunov Matrix Equation with Li Activation Function

16 Modeling a Language by a Markov Process

Transcription:

In Homewok, you ae (supposedly) Choosing a data set 2 Extacting a test set of size > 3 3 Building a tee on the taining set 4 Testing on the test set 5 Repoting the accuacy (Adapted fom Ethem Alpaydin and Tom Mitchell) Does the epoted accuacy exactly match the genealization pefomance of the tee? If a tee has eo % and an A has eo %, is the tee absolutely bette? / 35 sscott@cse.unl.edu 2 / 35 Why o why not? How about the algoithms in geneal? Setting of pefomance evaluation eo and confidence intevals Paied t tests and coss-validation to compae leaning algoithms pefomance measues Confusion matices ROC analysis Pecision-ecall cuves Befoe setting up an expeiment, need to undestand exactly what the goal is Estimate the genealization pefomance of a hypothesis Tuning a leaning algoithm s paametes two leaning algoithms on a specific task Etc. Will neve be able to answe the question with % cetainty Due to vaiances in taining set selection, test set selection, etc. Will choose an estimato fo the quantity in question, detemine the pobability distibution of the estimato, and bound the pobability that the estimato is way off Estimato needs to wok egadless of distibution of taining/testing data 3 / 35 4 / 35 Setting (cont d) Types of 5 / 35 eed to note that, in addition to statistical vaiations, what we detemine is limited to the application that we ae studying E.g., if naïve Bayes bette than ID3 on spam filteing, that means nothing about face ecognition In planning expeiments, need to ensue that taining data not used fo evaluation I.e., don t test on the taining set! Will bias the pefomance estimato Also holds fo validation set used to pune DT, tune paametes, etc. Validation set seves as pat of taining set, but not used fo model building Types of 6 / 35 Fo now, focus on staightfowad, / classification eo Fo hypothesis h, ecall the two types of classification eo fom Chapte 2: Empiical eo (o sample eo) is faction of set V that h gets wong: eo V (h) X (C(x) 6= h(x)), V x2v whee (C(x) 6= h(x)) is if C(x) 6= h(x), and othewise Genealization eo (o tue eo) is pobability that a new, andomly selected, instance is misclassified by h eo D (h) P [C(x) 6= h(x)], x2d whee D is pobability distibution instances ae dawn fom Why do we cae about eo V (h)?

Tue Tue (cont d) Types of Bias: If T is taining set, eo T (h) is optimistically biased bias E[eo T (h)] eo D (h) Fo unbiased estimate (bias = ), h and V must be chosen independently ) don t test on the taining set! (By the way, this is distinct fom inductive bias) Vaiance: Even with unbiased V, eo V (h) may still vay fom eo D (h) Types of Expeiment: Choose sample V of size accoding to distibution D 2 Measue eo V (h) eo V (h) is a andom vaiable (i.e., esult of an expeiment) eo V (h) is an unbiased estimato fo eo D (h) Given obseved eo V (h), what can we conclude about eo D (h)? 7 / 35 8 / 35 (cont d) If V contains examples, dawn independently of h and each othe 3 If V contains examples, dawn independently of h and each othe 3 Then with appoximately 95% pobability, eo D (h) lies in Then with appoximately c% pobability, eo D (h) lies in Types of 9 / 35 eov (h)( eo V (h)) eo V (h) ±.96 E.g. hypothesis h misclassifies 2 of the 4 examples in test set V: eo V (h) = 2 4 =.3 Then with appox. 95% confidence, eo D (h) 2 [58,.442] Types of / 35 Why? eov (h)( eo V (h)) eo V (h) ± z c %: 5% 68% 8% 9% 95% 98% 99% z c :.67..28.64.96 2.33 2.58 eo V (h) is a Random Vaiable Binomial Pobability Distibution Types of / 35 Repeatedly un the expeiment, each with diffeent andomly dawn V (each of size ) Pobability of obseving misclassified examples: P() 4 2.8.6.4.2 P() = Binomial distibution fo n = 4, p =.3 5 5 2 25 3 35 4 eo D (h) ( eo D (h)) I.e., let eo D (h) be pobability of heads in biased coin, then P() =pob. of getting heads out of flips Types of 2 / 35 P() = p ( p)! =!( )! p ( p) Pobability P() of heads in coin flips, if p = P(heads) Expected, o mean value of X, E[X] (= # heads on flips = # mistakes on test exs), is X E[X] ip(i) =p = eo D (h) Vaiance of X is i= Va(X) E[(X E[X]) 2 ]=p( p) Standad deviation of X, X, is q E[(X E[X]) 2 ]= p p( p) X

Appoximate Binomial Dist. with omal omal Pobability Distibution Types of 3 / 35 eo V (h) =/ is binomially distibuted, with mean µ eov (h) = eo D (h) (i.e., unbiased est.) standad deviation eov (h) eod (h)( eo D (h)) eo V (h) = (inceasing deceases vaiance) Want to compute confidence inteval = inteval centeed at eo D (h) containing c% of the weight unde the distibution Appoximate binomial by nomal (Gaussian) dist: mean µ eov (h) = eo D (h) standad deviation eov (h) eov (h)( eo V (h)) eo V (h) Types of 4 / 35.4.35.3.25.2 5.5 omal distibution with mean, standad deviation -3-2 - 2 3! p(x) = p exp x µ 2 2 2 2 The pobability that X will fall into the inteval (a, b) is given by R b a p(x) dx Expected, o mean value of X, E[X], is E[X] =µ Vaiance is Va(X) = 2, standad deviation is X = omal Pobability Distibution (cont d) omal Pobability Distibution (cont d) Types of.4.35.3.25.2 5.5-3 -2-2 3 8% of aea (pobability) lies in µ ±.28 c% of aea (pobability) lies in µ ± z c c%: 5% 68% 8% 9% 95% 98% 99% z c :.67..28.64.96 2.33 2.58 Types of Can also have one-sided bounds:.4.35.3.25.2 5.5-3 -2-2 3 c% of aea lies <µ+ z c o >µ z c, whee z c = z ( c)/2 c%: 5% 68% 8% 9% 95% 98% 99% z c:..47.84.28.64 2.5 2.33 5 / 35 6 / 35 Revisited Cental Limit Theoem Types of 7 / 35 If V contains 3 examples, indep. of h and each othe Then with appoximately 95% pobability, eo V (h) lies in eod (h)( eo D (h)) eo D (h) ±.96 Equivalently, eo D (h) lies in eod (h)( eo D (h)) eo V (h) ±.96 which is appoximately eov (h)( eo V (h)) eo V (h) ±.96 (One-sided bounds yield uppe o lowe eo bounds) Types of 8 / 35 How can we justify appoximation? Conside set of iid andom vaiables Y,...,Y, all fom abitay pobability distibution with mean µ and finite vaiance 2. Define sample mean Ȳ (/) P n i= Y i Ȳ is itself a andom vaiable, i.e., esult of an expeiment (e.g., eo S (h) =/) Cental Limit Theoem: As!, the distibution govening Ȳ appoaches nomal distibution with mean µ and vaiance 2 / Thus the distibution of eo S (h) is appoximately nomal fo lage, and its expected value is eo D (h) (Rule of thumb: 3 when estimato s distibution is binomial; might need to be lage fo othe distibutions)

Calculating Types of Pick paamete to estimate: eo D (h) 2 Choose an estimato: eo V (h) 3 Detemine pobability distibution that govens estimato: eo V (h) govened by binomial distibution, appoximated by nomal when 3 4 Find inteval (L, U) such that c% of pobability mass falls in the inteval Could have L = o U = Use table of z c o z c values (if distibution nomal) Distibution What if we want to compae two leaning algoithms L and L 2 (e.g., ID3 vs k-neaest neighbo) on a specific application? Insufficient to simply compae eo ates on a single test set Use K-fold coss validation and a paied t test 9 / 35 2 / 35 K-Fold Coss Validation K-Fold Coss Validation (cont d) Distibution 2 / 35 Patition data set X into K equal-sized subsets X, X 2,...,X K, whee X i 3 2 Fo i fom to K, do (Use X i fo testing, and est fo taining) V i = X i 2 T i = X\X i 3 Tain leaning algoithm L on T i to get h i 4 Tain leaning algoithm L 2 on T i to get h 2 i 5 Let p j i be eo of hj i on test set V i 6 p i = p i p 2 i 3 diffeence estimate p =(/K) P K i p i Distibution 22 / 35 ow estimate confidence that tue expected eo diffeence < ) Confidence that L is bette than L 2 on leaning task Use one-sided test, with confidence deived fom student s t distibution with K degees of feedom With appoximately c% pobability, tue diffeence of expected eo between L and L 2 is at most whee p + t c,k s p v u s p t KX (p i p) 2 K(K ) i= Distibution (One-Sided Test) Caveat Distibution If p + t c,k s p < ou assetion that L has less eo than L 2 is suppoted with confidence c So if K-fold CV used, compute p, look up t c,k and check if p < t c,k s p One-sided test; says nothing about L 2 ove L Distibution Say you want to show that leaning algoithm L pefoms bette than algoithms L 2, L 3, L 4, L 5 If you use K-fold CV to show supeio pefomance of L ove each of L 2,...,L 5 at 95% confidence, thee s a 5% chance each one is wong ) Thee s a 2% chance that at least one is wong ) Ou oveall confidence is only 8% eed to account fo this, o use moe appopiate test 23 / 35 24 / 35

Moe Specific Confusion Matices Confusion Matices ROC Cuves Pecision-Recall Cuves So fa, we ve looked at a single eo ate to compae hypotheses/leaning algoithms/etc. This may not tell the whole stoy: test examples: 2 positive, 98 negative h gets 2/2 pos coect, 965/98 neg coect, fo accuacy of (2 + 965)/(2 + 98) =.967 Petty impessive, except that always pedicting negative yields accuacy =.98 Would we athe have h 2, which gets 9/2 pos coect and 93/98 neg, fo accuacy =.949? Depends on how impotant the positives ae, i.e., fequency in pactice and/o cost (e.g., cance diagnosis) Confusion Matices ROC Cuves Pecision-Recall Cuves Beak down eo into type: tue positive, etc. Pedicted Class Tue Class Positive egative Total Positive tp : tue positive fn : false negative p egative fp : false positive tn : tue negative n Total p n Genealizes to multiple classes Allows one to quickly assess which classes ae missed the most, and into what othe class 25 / 35 26 / 35 ROC Cuves ROC Cuves Plotting tp vesus fp Confusion Matices ROC Cuves Pecision-Recall Cuves 27 / 35 Conside an A o SVM omally theshold at, but what if we changed it? Keeping weight vecto constant while changing theshold = holding hypeplane s slope fixed while moving along its nomal vecto b ped all! ped all + I.e., get a set of classifies, one pe labeling of test set Simila situation with any classifie with confidence value, e.g., pobability-based Confusion Matices ROC Cuves Pecision-Recall Cuves 28 / 35 Conside the always hyp. What is fp? What is tp? What about the always + hyp? In between the extemes, we plot TP vesus FP by soting the test examples by the confidence values Ex Confidence label Ex Confidence label x 69.752 + x 6 2.64 x 2 9.2 + x 7 29.24 x 3 9.2 x 8 83.222 x 4.95 + x 9 9.554 + x 5 2.75 + x 28.22 ROC Cuves Plotting tp vesus fp (cont d) ROC Cuves Convex Hull TP x5 x TP ID3 naive Bayes Confusion Matices ROC Cuves Pecision-Recall Cuves x FP Confusion Matices ROC Cuves Pecision-Recall Cuves The convex hull of the ROC cuve yields a collection of classifies, each optimal unde diffeent conditions If FP cost = F cost, then daw a line with slope / P at (, ) and dag it towads convex hull until you touch it; that s you opeating point Can use as a classifie any pat of the hull since can andomly select between two classifies FP 29 / 35 3 / 35

ROC Cuves Convex Hull ROC Cuves Miscellany Confusion Matices ROC Cuves Pecision-Recall Cuves 3 / 35 TP ID3 naive Bayes Can also compae cuves against single-point classifies when no cuves In plot, ID3 bette than ou SVM iff negatives scace; nb neve bette FP Confusion Matices ROC Cuves Pecision-Recall Cuves 32 / 35 What is the wost possible ROC cuve? One metic fo measuing a cuve s goodness: aea unde cuve (AUC): P x +2P P x 2 I(h(x +) > h(x )) P i.e., ank all examples by confidence in + pediction, count the numbe of times a positively-labeled example (fom P) is anked above a negatively-labeled one (fom ), then nomalize What is the best value? Distibution appoximately nomal if P, >, so can find confidence intevals Catching on as a bette scala measue of pefomance than eo ate ROC analysis possible (though ticky) with multi-class poblems ROC Cuves Miscellany (cont d) Pecision-Recall Cuves Confusion Matices ROC Cuves Pecision-Recall Cuves Can use ROC cuve to modify classifies, e.g., e-label decision tees What does ROC stand fo? Receive Opeating Chaacteistic fom signal detection theoy, whee binay signals ae coupted by noise Use plots to detemine how to set theshold to detemine pesence of signal Theshold too high: miss tue hits (tp low), too low: too many false alams (fp high) Altenative to ROC: cost cuves Confusion Matices ROC Cuves Pecision-Recall Cuves Conside infomation etieval task, e.g., web seach pecision = tp/p = faction of etieved that ae positive ecall = tp/p = faction of positives etieved 33 / 35 34 / 35 Pecision-Recall Cuves (cont d) As with ROC, can vay theshold to tade off pecision against ecall Can compae cuves based on containment Confusion Matices ROC Cuves Pecision-Recall Cuves 35 / 35 Use F -measue to combine at a specific point, whee weights pecision vs ecall: F ( + 2 pecision ecall ) ( 2 pecision)+ecall