Empirical Risk Minimization, Model Selection, and Model Assessment

Similar documents
Statistical Learning Theory: Generalization Error Bounds


CS 543 Page 1 John E. Boon, Jr.

Empirical Evaluation (Ch 5)

Notes on Machine Learning for and

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Performance Evaluation and Comparison

Decision trees COMS 4771

Decision Trees. Tirgul 5

EECS 349:Machine Learning Bryan Pardo

Stephen Scott.

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Bayesian Learning (II)

Dan Roth 461C, 3401 Walnut

Lecture 3: Decision Trees

Holdout and Cross-Validation Methods Overfitting Avoidance

the tree till a class assignment is reached

Midterm, Fall 2003

Machine Learning 3. week

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Generalization, Overfitting, and Model Selection

Machine Learning (CS 567) Lecture 2

CS 6375 Machine Learning

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Tree Learning Lecture 2

Learning with multiple models. Boosting.

Classification: Decision Trees

Machine Learning

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Introduction to Machine Learning CMU-10701

Statistical Machine Learning from Data

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Generalization and Overfitting

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

10-701/ Machine Learning, Fall

Machine Learning

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Decision Trees (Cont.)

Decision Trees Lecture 12

Ensembles of Classifiers.

Lecture 7: DecisionTrees

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

The PAC Learning Framework -II

Decision Trees.

Machine Learning 2nd Edi7on

CHAPTER-17. Decision Tree Induction

UVA CS 4501: Machine Learning

Machine Learning & Data Mining

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

CS145: INTRODUCTION TO DATA MINING

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

Performance Evaluation and Hypothesis Testing

Supervised Learning via Decision Trees

Machine Learning

Decision Tree Learning

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Machine Learning (CS 567) Lecture 3

Machine Learning

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Evaluation. Andrea Passerini Machine Learning. Evaluation

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Hypothesis Evaluation

Lecture 3: Decision Trees

C4.5 - pruning decision trees

Machine Learning

Decision Trees.

Evaluation requires to define performance measures to be optimized

Lecture 2 Machine Learning Review

CHAPTER EVALUATING HYPOTHESES 5.1 MOTIVATION

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

CS 6375 Machine Learning

Decision Tree Learning

Chapter 3: Decision Tree Learning

SF2930 Regression Analysis

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Data classification (II)

Machine Learning Linear Classification. Prof. Matteo Matteucci

Decision Tree Learning

How do we compare the relative performance among competing models?

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Generative Models for Classification

Introduction to Machine Learning (67577) Lecture 5

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

ECE521 week 3: 23/26 January 2017

Transcription:

Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich, T. G., (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10 (7) 1895-1924. (http://sci2s.ugr.es/keel/pdf/algorithm/articulo/dietterich1998.pdf

Learning as Prediction: Batch Learning Model Real-world Process P(X,Y) Train Sample S train S Test Sample S test (x 1,y 1 ),, (x n,y n ) train h Learner (x n+1,y n+1 ), Definition: A particular Instance of a Learning Problem is described by a probability distribution P X, Y. Definition: Any Example X i, Y i is a random variable that is independently identically distributed according to P X, Y.

Training / Validation / Test Sample Definition: A Training / Test / Validation Sample S = (x 1, y 1,, x n, y n ) is drawn iid from P X, Y. P S = x 1, y 1,, x n, y n = P X i = x i, Y i = y i n i=1

Risk Definition: The Risk / Prediction Error / True Error / Generalization Error of a hypothesis h for a learning task P X, Y is Err P h = Δ y, h x P X = x, Y = y x,y Definition: The Loss Function Δ y, y R measures the quality of prediction y if the true label is y.

Empirical Risk Definition: The Empirical Risk / Error of hypothesis h on sample is S = (x 1, y 1,, x n, y n ) n Err S h = Δ y i, h x i i=1

Learning as Prediction drawn i.i.d. Real-world Process P(X,Y) Overview drawn i.i.d. Train Sample S train (x 1,y 1 ),, (x n,y n ) S Test Sample S test train h Learner (x n+1,y n+1 ), Goal: Find h with small prediction error Err P (h) with respect to P(X,Y). Training Error: Error Err Strain (h) on training sample. Test Error: Error Err Stest (h) on test sample is an estimate of Err P (h).

Bayes Risk Given knowledge of P(X,Y), the true error of the best possible h is Err P h bayes = E x~p(x) [min y Y 1 P Y = y X = x ] for the 0/1 loss.

Generative Model: Three Roadmaps for Designing ML Methods Learn P X, Y from training sample. Discriminative Conditional Model: Learn P Y X from training sample. Discriminative ERM Model: Learn h directly from training sample.

Empirical Risk Minimization Definition [ERM Principle]: Given a training sample S = ( x 1, y 1,, (x n, y n ) and a hypothesis space H, select the rule h ERM H that minimizes the empirical risk (i.e. training error) on S n h ERM = min h H i=1 Δ(y i, h(y i ))

MODEL SELECTION

Overfitting Note: Accuracy = 1.0-Error [Mitchell]

Decision Tree Example: revisited yes complete correct partial original no no yes presentation unclear clear yes no guessing no

Controlling Overfitting in Decision Trees Early Stopping: Stop growing the tree and introduce leaf when splitting no longer reliable. Restrict size of tree (e.g., number of nodes, depth) Minimum number of examples in node Threshold on splitting criterion Post Pruning: Grow full tree, then simplify. Reduced-error tree pruning Rule post-pruning

Model Selection via Validation Sample Real-world Process drawn i.i.d. drawn i.i.d. split randomly Train Sample S train split randomly Test Sample S test Train Sample S train S train Learner 1 ĥ 1 Val. Sample S val ĥ Learner m ĥ k Training: Run learning algorithm m times (e.g. different parameters). Validation Error: Errors Err Sval (ĥ i ) is an estimates of Err P (ĥ i ) for each h i. Selection: Use h i with min Err Sval (ĥ i ) for prediction on test examples.

Reduced-Error Pruning

Text Classification Example Corporate Acquisitions Results Unpruned Tree (ID3 Algorithm): Size: 437 nodes Training Error: 0.0% Test Error: 11.0% Early Stopping Tree (ID3 Algorithm): Size: 299 nodes Training Error: 2.6% Test Error: 9.8% Reduced-Error Pruning (C4.5 Algorithm): Size: 167 nodes Training Error: 4.0% Test Error: 10.8% Rule Post-Pruning (C4.5 Algorithm): Size: 164 tests Training Error: 3.1% Test Error: 10.3% Examples of rules IF vs = 1 THEN - [99.4%] IF vs = 0 & export = 0 & takeover = 1 THEN + [93.6%]

MODEL ASSESSMENT

Evaluating Learned Real-world Process Hypotheses split randomly Sample S drawn i.i.d. split randomly Training Sample S train S (x 1,y 1 ),, (x n,y n ) train Learner (incl. ModSel) ĥ Test Sample S test (x 1,y 1 ), (x k,y k ) Goal: Find h with small prediction error Err P (h) over P(X,Y). Question: How good is Err P (ĥ) of ĥ found on training sample S train. Training Error: Error Err Strain (ĥ) on training sample. Test Error: Error Err Stest (ĥ) is an estimate of Err P (ĥ).

What is the True Error of a Hypothesis? Given Sample of labeled instances S Learning Algorithm A Setup Partition S randomly into S train and S test Train learning algorithm A on Strain, result is ĥ. Apply ĥ to S test and compare predictions against true labels. Test Error on test sample Err S test (ĥ) is estimate of true error Err P(ĥ). Compute confidence interval. Training Sample S train S train ĥ Test Sample S test (x 1,y 1 ),, (x n,y n ) Learner (x 1,y 1 ), (x k,y k )

Binomial Distribution The probability of observing x heads (i.e. errors) in a sample of n independent coin tosses (i.e. examples), where in each toss the probability of heads (i.e. making an error) is p, is Normal approximation: For np(1-p)>=5 the binomial can be approximated by the normal distribution with Expected value: E(X)=np Variance: Var(X)=np(1-p) With probability, the observation x falls in the interval 50% 68% 80% 90% 95% 98% 99% z 0.67 1.00 1.28 1.64 1.96 2.33 2.58

Given Is Rule h 1 More Accurate than h 2? Sample of labeled instances S Learning Algorithms A 1 and A 2 Setup Test Partition S randomly into S train and S test Train learning algorithms A 1 and A 2 on S train, result are ĥ 1 and ĥ 2. Apply ĥ 1 and ĥ 2 to S test and compute Err Stest (ĥ 1 ) and Err Stest (ĥ 2 ). Decide, if Err P (ĥ 1 ) Err P (ĥ 2 )? Null Hypothesis: Err Stest (ĥ 1 ) and Err Stest (ĥ 2 ) come from binomial distributions with same p. Binomial Sign Test (McNemar s Test)

Is Learning Algorithm A 1 better than A 2? Given k samples S 1 S k of labeled instances, all i.i.d. from P(X,Y). Learning Algorithms A 1 and A 2 Setup For i from 1 to k Partition S i randomly into S train and S test Train learning algorithms A 1 and A 2 on S train, result are ĥ 1 and ĥ 2. Apply ĥ 1 and ĥ 2 to S test and compute Err Stest (ĥ 1 ) and Err Stest (ĥ 2 ). Test Decide, if E S (Err P (A 1 (S train ))) E S (Err P (A 2 (S train )))? Null Hypothesis: Err Stest (A 1 (S train )) and Err Stest (A 2 (S train )) come from same distribution over samples S. t-test or Wilcoxon Signed-Rank Test

Approximation via K-fold Cross Validation Given Sample of labeled instances S Learning Algorithms A 1 and A 2 Compute Randomly partition S into k equally sized subsets S 1 S k For i from 1 to k Train A 1 and A 2 on S 1 S i-1 S i+1. S k and get ĥ 1 and ĥ 2. Apply ĥ 1 and ĥ 2 to S i and compute Err Si (ĥ 1 ) and Err Si (ĥ 2 ). Estimate Average Err Si (ĥ 1 ) is estimate of E S (Err P (A 1 (S train ))) Average Err Si (ĥ 2 ) is estimate of E S (Err P (A 2 (S train ))) Count how often Err Si (ĥ 1 )>Err Si (ĥ 2 ) and Err Si (ĥ 1 )<Err Si (ĥ 2 )