ORIE 4741: Learning with Big Messy Data. Generalization

Similar documents
ORIE 4741: Learning with Big Messy Data. Train, Test, Validate

Machine Learning. Lecture 9: Learning Theory. Feng Li.


The PAC Learning Framework -II

ORIE 4741: Learning with Big Messy Data. Feature Engineering

ORIE 4741 Final Exam

Stochastic Gradient Descent

Statistical Learning Theory: Generalization Error Bounds

Machine Learning

Learning From Data Lecture 5 Training Versus Testing

Machine Learning

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

Bias-Variance Tradeoff

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Introduction to Machine Learning

Active Learning and Optimized Information Gathering

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Advanced Introduction to Machine Learning CMU-10715

Week 5: Logistic Regression & Neural Networks

Understanding Generalization Error: Bounds and Decompositions

ECS171: Machine Learning

IFT Lecture 7 Elements of statistical learning theory

Learning From Data Lecture 3 Is Learning Feasible?

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

2 Upper-bound of Generalization Error of AdaBoost

Computational Learning Theory. CS534 - Machine Learning

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Computational Learning Theory

COMS 4771 Introduction to Machine Learning. Nakul Verma

1 Review of The Learning Setting

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Generalization bounds

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning Foundations

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Homework # 3. Due Monday, October 22, 2018, at 2:00 PM PDT

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Computational Learning Theory. Definitions

Generalization Bounds and Stability

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Active Learning: Disagreement Coefficient

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Statistical and Computational Learning Theory

Introduction to Machine Learning (67577) Lecture 3

Machine Learning (CS 567) Lecture 5

Generalization, Overfitting, and Model Selection

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Generalization and Overfitting

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Computational Learning Theory (VC Dimension)

Statistical learning. Chapter 20, Sections 1 4 1

Part of the slides are adapted from Ziko Kolter

Lecture : Probabilistic Machine Learning

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Lecture 3: Introduction to Complexity Regularization

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Lecture 8. Instructor: Haipeng Luo

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Does Unlabeled Data Help?

18.9 SUPPORT VECTOR MACHINES

Logistic Regression. Machine Learning Fall 2018

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Computational Learning Theory

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Introduction to Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

Computational and Statistical Learning theory

CSC 411 Lecture 3: Decision Trees

Machine Learning, Fall 2009: Midterm

Deconstructing Data Science

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Machine Learning Theory (CS 6783)

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

CMU-Q Lecture 24:

Machine Learning

Learning From Data Lecture 2 The Perceptron

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Generalization Bounds

Lecture 20 Random Samples 0/ 13

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

Machine Learning

Introduction to Machine Learning (67577) Lecture 5

Machine Learning. Ensemble Methods. Manfred Huber

Ensemble Methods for Machine Learning

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Dan Roth 461C, 3401 Walnut

Introduction to Machine Learning Midterm Exam Solutions

CS 6375 Machine Learning

Machine Learning CSE546

Bayesian Machine Learning

Introduction to Machine Learning Midterm Exam

Machine Learning (CS 567) Lecture 2

PAC-learning, VC Dimension and Margin-based Bounds

CSC321 Lecture 4 The Perceptron Algorithm

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Transcription:

ORIE 4741: Learning with Big Messy Data Generalization Professor Udell Operations Research and Information Engineering Cornell September 23, 2017 1 / 21

Announcements midterm 10/5 makeup exam 10/2, by arrangement with instructors (by email) homework due next Tuesday substitute next week: Sumanta Basu on decision trees! I won t hold OH next week 2 / 21

Generalization and Overfitting goal of model is not to predict well on D goal of model is to predict well on new data if the model has training set error and test set error, we say the model: low test set error high test set error low training set error generalizes overfits high training set error?!?! underfits 3 / 21

Simplest case: generalizing from a mean exit polling sample n voters leaving polling places 1 for each voter i, define the Boolean random variable { 1 if voter i voted for Clinton z i = 0 otherwise sample mean: ν = 1 n n i=1 z i true mean: µ = E i US electorate z i is Clinton s expected vote share 1 and suppose no one votes by mail or early or..., so these are iid samples from the US electorate 4 / 21

Simplest case: generalizing from a mean exit polling sample n voters leaving polling places 1 for each voter i, define the Boolean random variable { 1 if voter i voted for Clinton z i = 0 otherwise sample mean: ν = 1 n n i=1 z i true mean: µ = E i US electorate z i is Clinton s expected vote share what does the sample mean ν tell us about the true mean µ? 1 and suppose no one votes by mail or early or..., so these are iid samples from the US electorate 4 / 21

Hoeffding inequality Theorem (Hoeffding Inequality) Let z i {0, 1}, i = 1,..., n, be independent Boolean random variables with mean Ez i = µ. Define the sample mean ν = 1 n n i=1 z i. Then for any ɛ > 0, P[ ν µ > ε] 2 exp ( 2ε 2 n ). an example of a concentration inequality 5 / 21

Hoeffding inequality Theorem (Hoeffding Inequality) Let z i {0, 1}, i = 1,..., n, be independent Boolean random variables with mean Ez i = µ. Define the sample mean ν = 1 n n i=1 z i. Then for any ɛ > 0, P[ ν µ > ε] 2 exp ( 2ε 2 n ). an example of a concentration inequality µ can t be much higher than ν µ can t be much lower than ν more samples n improve estimate exponentially quickly 5 / 21

Back to the learning problem fix a hypothesis h : X Y. take { 1 y i = h(x i ) z i = 0 otherwise = 1(y i = h(x i )) 6 / 21

Back to the learning problem fix a hypothesis h : X Y. take { 1 y i = h(x i ) z i = 0 otherwise = 1(y i = h(x i )) example. build a model of voting behavior: y i = f (x i ) is 1 if voter i voted for Clinton, 0 otherwise h(x i ) is our guess of how the voter will vote, using hypothesis h z i is 1 if we guess correctly for voter i, 0 otherwise z i depends on x i, y i, and h 6 / 21

Adding in probability make our model probabilistic: fix a probability distribution P(x, y) sample (x i, y i ) iid from P(x, y) form data set D by sampling: for i = 1,..., n sample (x i, y i ) P(x, y) set D = {(x1, y 1 ),..., (x n, y n )} 7 / 21

Adding in probability make our model probabilistic: fix a probability distribution P(x, y) sample (x i, y i ) iid from P(x, y) form data set D by sampling: for i = 1,..., n sample (x i, y i ) P(x, y) set D = {(x1, y 1 ),..., (x n, y n )} special case. y = f (x) is deterministic conditioned on x: { 1 y = f (x) P(y x) = 0 otherwise 7 / 21

Hoeffding for the noisy learning problem fix a hypothesis h : X Y. draw samples (x i, y i ) iid from P(x, y) to form D = {(x 1, y 1 ),..., (x n, y n )} take z i = 1(y i = h(x i )) z i are iid (since (x i, y i ) are iid, and h is fixed) Ez = E (x,y) P(x,y) 1(y = h(x)) so we can apply Hoeffding! for any ɛ > 0, [ ] 1 n P z i Ez n > ε 2 exp ( 2ε 2 n ) i=1 8 / 21

Hoeffding for the noisy learning problem fix a hypothesis h : X Y. draw samples (x i, y i ) iid from P(x, y) to form D = {(x 1, y 1 ),..., (x n, y n )} take z i = 1(y i = h(x i )) z i are iid (since (x i, y i ) are iid, and h is fixed) Ez = E (x,y) P(x,y) 1(y = h(x)) so we can apply Hoeffding! for any ɛ > 0, [ ] 1 n P z i Ez n > ε 2 exp ( 2ε 2 n ) i=1 Q: what is the probability over? 8 / 21

Hoeffding for the noisy learning problem fix a hypothesis h : X Y. draw samples (x i, y i ) iid from P(x, y) to form D = {(x 1, y 1 ),..., (x n, y n )} take z i = 1(y i = h(x i )) z i are iid (since (x i, y i ) are iid, and h is fixed) Ez = E (x,y) P(x,y) 1(y = h(x)) so we can apply Hoeffding! for any ɛ > 0, [ ] 1 n P z i Ez n > ε 2 exp ( 2ε 2 n ) i=1 Q: what is the probability over? A: other data sets D = {(x 1, y 1 ),..., (x n, y n )} drawn iid according to P(x, y) 8 / 21

Hoeffding for the noisy learning problem fix a hypothesis h : X Y. draw samples (x i, y i ) iid from P(x, y) to form D = {(x 1, y 1 ),..., (x n, y n )} take z i = 1(y i = h(x i )) z i are iid (since (x i, y i ) are iid, and h is fixed) Ez = E (x,y) P(x,y) 1(y = h(x)) so we can apply Hoeffding! for any ɛ > 0, [ ] 1 n P z i Ez n > ε 2 exp ( 2ε 2 n ) i=1 Q: what is the probability over? A: other data sets D = {(x 1, y 1 ),..., (x n, y n )} drawn iid according to P(x, y) Q: is 1 n n i=1 z i more like training set error or test set error? 8 / 21

Hoeffding for the noisy learning problem fix a hypothesis h : X Y. draw samples (x i, y i ) iid from P(x, y) to form D = {(x 1, y 1 ),..., (x n, y n )} take z i = 1(y i = h(x i )) z i are iid (since (x i, y i ) are iid, and h is fixed) Ez = E (x,y) P(x,y) 1(y = h(x)) so we can apply Hoeffding! for any ɛ > 0, [ ] 1 n P z i Ez n > ε 2 exp ( 2ε 2 n ) i=1 Q: what is the probability over? A: other data sets D = {(x 1, y 1 ),..., (x n, y n )} drawn iid according to P(x, y) Q: is 1 n n i=1 z i more like training set error or test set error? A: test set error, since h is independent of D 8 / 21

In-sample and out-of-sample error some new terminology: in-sample error. E in (h) = fraction of D where y i and h(x i ) disagree = 1 n 1(y i h(x i )) n i=1 out-of-sample error. E out (h) = probability that y and h(x) disagree = P [y h(x)] (x,y) P(x,y) notice E out (h) = E [E in (h)] 9 / 21

Hoeffding for the noisy learning problem fix a hypothesis h : X Y. consider (x i, y i ) as samples drawn from P(x, y) take z i = 1(y i = h(x i )) z i are iid (since (x i, y i ) are iid, and h is fixed) Ez = E (x,y) P(x,y) 1(y = h(x)) apply Hoeffding: for any ε > 0, P[ E in (h) E out (h) > ε] 2 exp ( 2ε 2 n ) 10 / 21

Does Hoeffding work for our learned model? two scenarios: 1. Without looking at any data, pick a model h : X Y to predict who will vote for Clinton. Then sample data D = {(x 1, y 1 ),..., (x n, y n )}, and set z i = 1(y i = h(x i )). 2. Sample the data D = {(x 1, y 1 ),..., (x n, y n )}, and use it to develop a model g : X Y to predict who will vote for Clinton (e.g., using perceptron). Set z i = 1(y i = g(x i )). Is the sample mean 1 n n i=1 z i a good estimate for the expected performance Ez? Is 1 n n i=1 z i a good estimate for Ez? 11 / 21

Does Hoeffding work for our learned model? two scenarios: 1. Without looking at any data, pick a model h : X Y to predict who will vote for Clinton. Then sample data D = {(x 1, y 1 ),..., (x n, y n )}, and set z i = 1(y i = h(x i )). 2. Sample the data D = {(x 1, y 1 ),..., (x n, y n )}, and use it to develop a model g : X Y to predict who will vote for Clinton (e.g., using perceptron). Set z i = 1(y i = g(x i )). Is the sample mean 1 n n i=1 z i a good estimate for the expected performance Ez? Is 1 n n i=1 z i a good estimate for Ez? Q: Are the z i s iid? What about the z i s? 11 / 21

Does Hoeffding work for our learned model? two scenarios: 1. Without looking at any data, pick a model h : X Y to predict who will vote for Clinton. Then sample data D = {(x 1, y 1 ),..., (x n, y n )}, and set z i = 1(y i = h(x i )). 2. Sample the data D = {(x 1, y 1 ),..., (x n, y n )}, and use it to develop a model g : X Y to predict who will vote for Clinton (e.g., using perceptron). Set z i = 1(y i = g(x i )). Is the sample mean 1 n n i=1 z i a good estimate for the expected performance Ez? Is 1 n n i=1 z i a good estimate for Ez? Q: Are the z i s iid? What about the z i s? A: z i s are iid. z i s are not independent: they depend on g, which depends on the whole data set D = {(x 1, y 1 ),..., (x n, y n )}. 11 / 21

Does Hoeffding work for our learned model? two scenarios: 1. Without looking at any data, pick a model h : X Y to predict who will vote for Clinton. Then sample data D = {(x 1, y 1 ),..., (x n, y n )}, and set z i = 1(y i = h(x i )). 2. Sample the data D = {(x 1, y 1 ),..., (x n, y n )}, and use it to develop a model g : X Y to predict who will vote for Clinton (e.g., using perceptron). Set z i = 1(y i = g(x i )). Is the sample mean 1 n n i=1 z i a good estimate for the expected performance Ez? Is 1 n n i=1 z i a good estimate for Ez? Q: Are the z i s iid? What about the z i s? A: z i s are iid. z i s are not independent: they depend on g, which depends on the whole data set D = {(x 1, y 1 ),..., (x n, y n )}. Q: Does Hoeffding apply to the first? the second? 11 / 21

Does Hoeffding work for our learned model? two scenarios: 1. Without looking at any data, pick a model h : X Y to predict who will vote for Clinton. Then sample data D = {(x 1, y 1 ),..., (x n, y n )}, and set z i = 1(y i = h(x i )). 2. Sample the data D = {(x 1, y 1 ),..., (x n, y n )}, and use it to develop a model g : X Y to predict who will vote for Clinton (e.g., using perceptron). Set z i = 1(y i = g(x i )). Is the sample mean 1 n n i=1 z i a good estimate for the expected performance Ez? Is 1 n n i=1 z i a good estimate for Ez? Q: Are the z i s iid? What about the z i s? A: z i s are iid. z i s are not independent: they depend on g, which depends on the whole data set D = {(x 1, y 1 ),..., (x n, y n )}. Q: Does Hoeffding apply to the first? the second? A: Hoeffding applies to first, not to second. Extreme case for second scenario: model memorizes the data. 11 / 21

Recall validation procedure how to decide which model to use? split data into training set D train and test set D test pick m different interesting model classes e.g., different φs: φ 1, φ 2,..., φ m fit ( train ) models on training set D train get one model h : X Y for each φs, and set H = {h 1, h 2,..., h m } compute error of each model on test set D test and choose lowest: g = argmin E Dtest (h) h H 12 / 21

Recall validation procedure how to decide which model to use? split data into training set D train and test set D test pick m different interesting model classes e.g., different φs: φ 1, φ 2,..., φ m fit ( train ) models on training set D train get one model h : X Y for each φs, and set H = {h 1, h 2,..., h m } compute error of each model on test set D test and choose lowest: g = argmin E Dtest (h) h H Q: Are {z i = 1(y i = g(x i ) : (x i, y i ) D test )} independent? 12 / 21

Recall validation procedure how to decide which model to use? split data into training set D train and test set D test pick m different interesting model classes e.g., different φs: φ 1, φ 2,..., φ m fit ( train ) models on training set D train get one model h : X Y for each φs, and set H = {h 1, h 2,..., h m } compute error of each model on test set D test and choose lowest: g = argmin E Dtest (h) h H Q: Are {z i = 1(y i = g(x i ) : (x i, y i ) D test )} independent? A: No; g was trained on D test! Hoeffding does not directly apply: E Dtest (g) may not accurately estimate E out (g) 12 / 21

The union bound recall the union bound: for two random events A and B, P(A B) P(A) + P(B) 13 / 21

The union bound recall the union bound: for two random events A and B, P(A B) P(A) + P(B) Q: When is this bound tight? i.e., when is P(A B) = P(A) + P(B)? 13 / 21

The union bound recall the union bound: for two random events A and B, P(A B) P(A) + P(B) Q: When is this bound tight? i.e., when is P(A B) = P(A) + P(B)? A: When A B = 13 / 21

Rescuing Hoeffding: the union bound let s suppose H is finite, with m hypotheses in it the hypothesis g is one of those m hypotheses so if (given a data set D) E in (g) E out (g) > ε, then for some h H, E in (h) E out (h) > ε (g depends on the data set; we might choose different hs for different data sets) 14 / 21

so Rescuing Hoeffding: the union bound let s suppose H is finite, with m hypotheses in it the hypothesis g is one of those m hypotheses so if (given a data set D) E in (g) E out (g) > ε, then for some h H, E in (h) E out (h) > ε (g depends on the data set; we might choose different hs for different data sets) P[ E in (g) E out (g) > ε] h H P[ E in (h) E out (h) > ε] h H 2 exp ( 2ε 2 n ) = 2m exp ( 2ε 2 n ) 14 / 21

Hoeffding for learning we just proved that our learning algorithm generalizes! Theorem (Generalization bound for learning) Let g be a hypothesis chosen from among m different hypotheses. Then P[ E in (g) E out (g) > ε] 2m exp ( 2ε 2 n ). 15 / 21

Hoeffding for learning we just proved that our learning algorithm generalizes! Theorem (Generalization bound for learning) Let g be a hypothesis chosen from among m different hypotheses. Then P[ E in (g) E out (g) > ε] 2m exp ( 2ε 2 n ). Q: do you think this bound is tight? 15 / 21

Hoeffding for learning we just proved that our learning algorithm generalizes! Theorem (Generalization bound for learning) Let g be a hypothesis chosen from among m different hypotheses. Then P[ E in (g) E out (g) > ε] 2m exp ( 2ε 2 n ). Q: do you think this bound is tight? A: no, it can overcount badly. for random events A and B, if P(A B) is large, then P(A B) P(A) + P(B) 15 / 21

Hoeffding for learning we just proved that our learning algorithm generalizes! Theorem (Generalization bound for learning) Let g be a hypothesis chosen from among m different hypotheses. Then P[ E in (g) E out (g) > ε] 2m exp ( 2ε 2 n ). Q: do you think this bound is tight? A: no, it can overcount badly. for random events A and B, if P(A B) is large, then P(A B) P(A) + P(B) more information. look up the Vapnik-Chervoninkis (VC) dimension, e.g., in Learning from Data, by Abu-Mostafa, Magdon-Ismail, and Lin. 15 / 21

A tradeoff for learning we want H to be big to make E in small we want H to be small to ensure E out is close to E in 16 / 21

A tradeoff for learning we want H to be big to make E in small we want H to be small to ensure E out is close to E in what does this tell us about the difficulty of learning complicated functions f? 16 / 21

Generalization for regression Theorem (Generalization bound for learning) Let g be a hypothesis chosen from among m different hypotheses. Then P[ E in (g) E out (g) > ε] 2m exp ( 2ε 2 n ). to apply Hoeffding to real-valued outputs: pick some small ɛ > 0 1((y i h(x i )) 2 ε)) is 0 if hypothesis h predicts well, 1 if hypothesis h predicts poorly define error of hypothesis h on data set D as E D (h) = 1 D (x,y) D 1((y h(x)) 2 ɛ) 17 / 21

Recap we introduced a probabilistic framework for generating data we showed that the in-sample error predicts the out-of-sample error for a single hypothesis we showed that the in-sample error predicts the out-of-sample error for a learned hypothesis, when H is finite we stopped there, because the math gets much more complicated but indeed, generalization is possible! the practical lesson: (especially for complex models), don t learn and estimate your error on the same data set 18 / 21

in, out, train, test the in-sample error and training error do not obey the Hoeffding inequality (without using, e.g., a union bound) if test set is only used for testing (not for model selection), the test error does obey the Hoeffding inequality P[ E test (g) E out (g) > ε] 2 exp ( 2ε 2 D test ). so we can use the test error to predict generalization 19 / 21

Overfitting to the test set if test set is used for model selection, the test error obeys the Hoeffding inequality with the union bound for each model family, optimal model trained on D is a hypothesis h so finite number of models m = finite hypothesis space H hypotheses h H are independent of test set D let g D be the hypothesis h H with lowest error on test set D Hoeffding with union bound applies! P[ E D (g D ) E out (g D ) > ε] 2m exp ( 2ε 2 D ). 20 / 21

References Concentration bounds for infinite model classes: see introduction to VC dimension in Learning from Data by Abu-Mostafa et al. Concentration bounds for cross validation: https://arxiv.org/pdf/1706.05801.pdf Concentration bounds for time series: see papers by Cosma Shalizi 21 / 21