ORIE 4741: Learning with Big Messy Data Generalization Professor Udell Operations Research and Information Engineering Cornell September 23, 2017 1 / 21
Announcements midterm 10/5 makeup exam 10/2, by arrangement with instructors (by email) homework due next Tuesday substitute next week: Sumanta Basu on decision trees! I won t hold OH next week 2 / 21
Generalization and Overfitting goal of model is not to predict well on D goal of model is to predict well on new data if the model has training set error and test set error, we say the model: low test set error high test set error low training set error generalizes overfits high training set error?!?! underfits 3 / 21
Simplest case: generalizing from a mean exit polling sample n voters leaving polling places 1 for each voter i, define the Boolean random variable { 1 if voter i voted for Clinton z i = 0 otherwise sample mean: ν = 1 n n i=1 z i true mean: µ = E i US electorate z i is Clinton s expected vote share 1 and suppose no one votes by mail or early or..., so these are iid samples from the US electorate 4 / 21
Simplest case: generalizing from a mean exit polling sample n voters leaving polling places 1 for each voter i, define the Boolean random variable { 1 if voter i voted for Clinton z i = 0 otherwise sample mean: ν = 1 n n i=1 z i true mean: µ = E i US electorate z i is Clinton s expected vote share what does the sample mean ν tell us about the true mean µ? 1 and suppose no one votes by mail or early or..., so these are iid samples from the US electorate 4 / 21
Hoeffding inequality Theorem (Hoeffding Inequality) Let z i {0, 1}, i = 1,..., n, be independent Boolean random variables with mean Ez i = µ. Define the sample mean ν = 1 n n i=1 z i. Then for any ɛ > 0, P[ ν µ > ε] 2 exp ( 2ε 2 n ). an example of a concentration inequality 5 / 21
Hoeffding inequality Theorem (Hoeffding Inequality) Let z i {0, 1}, i = 1,..., n, be independent Boolean random variables with mean Ez i = µ. Define the sample mean ν = 1 n n i=1 z i. Then for any ɛ > 0, P[ ν µ > ε] 2 exp ( 2ε 2 n ). an example of a concentration inequality µ can t be much higher than ν µ can t be much lower than ν more samples n improve estimate exponentially quickly 5 / 21
Back to the learning problem fix a hypothesis h : X Y. take { 1 y i = h(x i ) z i = 0 otherwise = 1(y i = h(x i )) 6 / 21
Back to the learning problem fix a hypothesis h : X Y. take { 1 y i = h(x i ) z i = 0 otherwise = 1(y i = h(x i )) example. build a model of voting behavior: y i = f (x i ) is 1 if voter i voted for Clinton, 0 otherwise h(x i ) is our guess of how the voter will vote, using hypothesis h z i is 1 if we guess correctly for voter i, 0 otherwise z i depends on x i, y i, and h 6 / 21
Adding in probability make our model probabilistic: fix a probability distribution P(x, y) sample (x i, y i ) iid from P(x, y) form data set D by sampling: for i = 1,..., n sample (x i, y i ) P(x, y) set D = {(x1, y 1 ),..., (x n, y n )} 7 / 21
Adding in probability make our model probabilistic: fix a probability distribution P(x, y) sample (x i, y i ) iid from P(x, y) form data set D by sampling: for i = 1,..., n sample (x i, y i ) P(x, y) set D = {(x1, y 1 ),..., (x n, y n )} special case. y = f (x) is deterministic conditioned on x: { 1 y = f (x) P(y x) = 0 otherwise 7 / 21
Hoeffding for the noisy learning problem fix a hypothesis h : X Y. draw samples (x i, y i ) iid from P(x, y) to form D = {(x 1, y 1 ),..., (x n, y n )} take z i = 1(y i = h(x i )) z i are iid (since (x i, y i ) are iid, and h is fixed) Ez = E (x,y) P(x,y) 1(y = h(x)) so we can apply Hoeffding! for any ɛ > 0, [ ] 1 n P z i Ez n > ε 2 exp ( 2ε 2 n ) i=1 8 / 21
Hoeffding for the noisy learning problem fix a hypothesis h : X Y. draw samples (x i, y i ) iid from P(x, y) to form D = {(x 1, y 1 ),..., (x n, y n )} take z i = 1(y i = h(x i )) z i are iid (since (x i, y i ) are iid, and h is fixed) Ez = E (x,y) P(x,y) 1(y = h(x)) so we can apply Hoeffding! for any ɛ > 0, [ ] 1 n P z i Ez n > ε 2 exp ( 2ε 2 n ) i=1 Q: what is the probability over? 8 / 21
Hoeffding for the noisy learning problem fix a hypothesis h : X Y. draw samples (x i, y i ) iid from P(x, y) to form D = {(x 1, y 1 ),..., (x n, y n )} take z i = 1(y i = h(x i )) z i are iid (since (x i, y i ) are iid, and h is fixed) Ez = E (x,y) P(x,y) 1(y = h(x)) so we can apply Hoeffding! for any ɛ > 0, [ ] 1 n P z i Ez n > ε 2 exp ( 2ε 2 n ) i=1 Q: what is the probability over? A: other data sets D = {(x 1, y 1 ),..., (x n, y n )} drawn iid according to P(x, y) 8 / 21
Hoeffding for the noisy learning problem fix a hypothesis h : X Y. draw samples (x i, y i ) iid from P(x, y) to form D = {(x 1, y 1 ),..., (x n, y n )} take z i = 1(y i = h(x i )) z i are iid (since (x i, y i ) are iid, and h is fixed) Ez = E (x,y) P(x,y) 1(y = h(x)) so we can apply Hoeffding! for any ɛ > 0, [ ] 1 n P z i Ez n > ε 2 exp ( 2ε 2 n ) i=1 Q: what is the probability over? A: other data sets D = {(x 1, y 1 ),..., (x n, y n )} drawn iid according to P(x, y) Q: is 1 n n i=1 z i more like training set error or test set error? 8 / 21
Hoeffding for the noisy learning problem fix a hypothesis h : X Y. draw samples (x i, y i ) iid from P(x, y) to form D = {(x 1, y 1 ),..., (x n, y n )} take z i = 1(y i = h(x i )) z i are iid (since (x i, y i ) are iid, and h is fixed) Ez = E (x,y) P(x,y) 1(y = h(x)) so we can apply Hoeffding! for any ɛ > 0, [ ] 1 n P z i Ez n > ε 2 exp ( 2ε 2 n ) i=1 Q: what is the probability over? A: other data sets D = {(x 1, y 1 ),..., (x n, y n )} drawn iid according to P(x, y) Q: is 1 n n i=1 z i more like training set error or test set error? A: test set error, since h is independent of D 8 / 21
In-sample and out-of-sample error some new terminology: in-sample error. E in (h) = fraction of D where y i and h(x i ) disagree = 1 n 1(y i h(x i )) n i=1 out-of-sample error. E out (h) = probability that y and h(x) disagree = P [y h(x)] (x,y) P(x,y) notice E out (h) = E [E in (h)] 9 / 21
Hoeffding for the noisy learning problem fix a hypothesis h : X Y. consider (x i, y i ) as samples drawn from P(x, y) take z i = 1(y i = h(x i )) z i are iid (since (x i, y i ) are iid, and h is fixed) Ez = E (x,y) P(x,y) 1(y = h(x)) apply Hoeffding: for any ε > 0, P[ E in (h) E out (h) > ε] 2 exp ( 2ε 2 n ) 10 / 21
Does Hoeffding work for our learned model? two scenarios: 1. Without looking at any data, pick a model h : X Y to predict who will vote for Clinton. Then sample data D = {(x 1, y 1 ),..., (x n, y n )}, and set z i = 1(y i = h(x i )). 2. Sample the data D = {(x 1, y 1 ),..., (x n, y n )}, and use it to develop a model g : X Y to predict who will vote for Clinton (e.g., using perceptron). Set z i = 1(y i = g(x i )). Is the sample mean 1 n n i=1 z i a good estimate for the expected performance Ez? Is 1 n n i=1 z i a good estimate for Ez? 11 / 21
Does Hoeffding work for our learned model? two scenarios: 1. Without looking at any data, pick a model h : X Y to predict who will vote for Clinton. Then sample data D = {(x 1, y 1 ),..., (x n, y n )}, and set z i = 1(y i = h(x i )). 2. Sample the data D = {(x 1, y 1 ),..., (x n, y n )}, and use it to develop a model g : X Y to predict who will vote for Clinton (e.g., using perceptron). Set z i = 1(y i = g(x i )). Is the sample mean 1 n n i=1 z i a good estimate for the expected performance Ez? Is 1 n n i=1 z i a good estimate for Ez? Q: Are the z i s iid? What about the z i s? 11 / 21
Does Hoeffding work for our learned model? two scenarios: 1. Without looking at any data, pick a model h : X Y to predict who will vote for Clinton. Then sample data D = {(x 1, y 1 ),..., (x n, y n )}, and set z i = 1(y i = h(x i )). 2. Sample the data D = {(x 1, y 1 ),..., (x n, y n )}, and use it to develop a model g : X Y to predict who will vote for Clinton (e.g., using perceptron). Set z i = 1(y i = g(x i )). Is the sample mean 1 n n i=1 z i a good estimate for the expected performance Ez? Is 1 n n i=1 z i a good estimate for Ez? Q: Are the z i s iid? What about the z i s? A: z i s are iid. z i s are not independent: they depend on g, which depends on the whole data set D = {(x 1, y 1 ),..., (x n, y n )}. 11 / 21
Does Hoeffding work for our learned model? two scenarios: 1. Without looking at any data, pick a model h : X Y to predict who will vote for Clinton. Then sample data D = {(x 1, y 1 ),..., (x n, y n )}, and set z i = 1(y i = h(x i )). 2. Sample the data D = {(x 1, y 1 ),..., (x n, y n )}, and use it to develop a model g : X Y to predict who will vote for Clinton (e.g., using perceptron). Set z i = 1(y i = g(x i )). Is the sample mean 1 n n i=1 z i a good estimate for the expected performance Ez? Is 1 n n i=1 z i a good estimate for Ez? Q: Are the z i s iid? What about the z i s? A: z i s are iid. z i s are not independent: they depend on g, which depends on the whole data set D = {(x 1, y 1 ),..., (x n, y n )}. Q: Does Hoeffding apply to the first? the second? 11 / 21
Does Hoeffding work for our learned model? two scenarios: 1. Without looking at any data, pick a model h : X Y to predict who will vote for Clinton. Then sample data D = {(x 1, y 1 ),..., (x n, y n )}, and set z i = 1(y i = h(x i )). 2. Sample the data D = {(x 1, y 1 ),..., (x n, y n )}, and use it to develop a model g : X Y to predict who will vote for Clinton (e.g., using perceptron). Set z i = 1(y i = g(x i )). Is the sample mean 1 n n i=1 z i a good estimate for the expected performance Ez? Is 1 n n i=1 z i a good estimate for Ez? Q: Are the z i s iid? What about the z i s? A: z i s are iid. z i s are not independent: they depend on g, which depends on the whole data set D = {(x 1, y 1 ),..., (x n, y n )}. Q: Does Hoeffding apply to the first? the second? A: Hoeffding applies to first, not to second. Extreme case for second scenario: model memorizes the data. 11 / 21
Recall validation procedure how to decide which model to use? split data into training set D train and test set D test pick m different interesting model classes e.g., different φs: φ 1, φ 2,..., φ m fit ( train ) models on training set D train get one model h : X Y for each φs, and set H = {h 1, h 2,..., h m } compute error of each model on test set D test and choose lowest: g = argmin E Dtest (h) h H 12 / 21
Recall validation procedure how to decide which model to use? split data into training set D train and test set D test pick m different interesting model classes e.g., different φs: φ 1, φ 2,..., φ m fit ( train ) models on training set D train get one model h : X Y for each φs, and set H = {h 1, h 2,..., h m } compute error of each model on test set D test and choose lowest: g = argmin E Dtest (h) h H Q: Are {z i = 1(y i = g(x i ) : (x i, y i ) D test )} independent? 12 / 21
Recall validation procedure how to decide which model to use? split data into training set D train and test set D test pick m different interesting model classes e.g., different φs: φ 1, φ 2,..., φ m fit ( train ) models on training set D train get one model h : X Y for each φs, and set H = {h 1, h 2,..., h m } compute error of each model on test set D test and choose lowest: g = argmin E Dtest (h) h H Q: Are {z i = 1(y i = g(x i ) : (x i, y i ) D test )} independent? A: No; g was trained on D test! Hoeffding does not directly apply: E Dtest (g) may not accurately estimate E out (g) 12 / 21
The union bound recall the union bound: for two random events A and B, P(A B) P(A) + P(B) 13 / 21
The union bound recall the union bound: for two random events A and B, P(A B) P(A) + P(B) Q: When is this bound tight? i.e., when is P(A B) = P(A) + P(B)? 13 / 21
The union bound recall the union bound: for two random events A and B, P(A B) P(A) + P(B) Q: When is this bound tight? i.e., when is P(A B) = P(A) + P(B)? A: When A B = 13 / 21
Rescuing Hoeffding: the union bound let s suppose H is finite, with m hypotheses in it the hypothesis g is one of those m hypotheses so if (given a data set D) E in (g) E out (g) > ε, then for some h H, E in (h) E out (h) > ε (g depends on the data set; we might choose different hs for different data sets) 14 / 21
so Rescuing Hoeffding: the union bound let s suppose H is finite, with m hypotheses in it the hypothesis g is one of those m hypotheses so if (given a data set D) E in (g) E out (g) > ε, then for some h H, E in (h) E out (h) > ε (g depends on the data set; we might choose different hs for different data sets) P[ E in (g) E out (g) > ε] h H P[ E in (h) E out (h) > ε] h H 2 exp ( 2ε 2 n ) = 2m exp ( 2ε 2 n ) 14 / 21
Hoeffding for learning we just proved that our learning algorithm generalizes! Theorem (Generalization bound for learning) Let g be a hypothesis chosen from among m different hypotheses. Then P[ E in (g) E out (g) > ε] 2m exp ( 2ε 2 n ). 15 / 21
Hoeffding for learning we just proved that our learning algorithm generalizes! Theorem (Generalization bound for learning) Let g be a hypothesis chosen from among m different hypotheses. Then P[ E in (g) E out (g) > ε] 2m exp ( 2ε 2 n ). Q: do you think this bound is tight? 15 / 21
Hoeffding for learning we just proved that our learning algorithm generalizes! Theorem (Generalization bound for learning) Let g be a hypothesis chosen from among m different hypotheses. Then P[ E in (g) E out (g) > ε] 2m exp ( 2ε 2 n ). Q: do you think this bound is tight? A: no, it can overcount badly. for random events A and B, if P(A B) is large, then P(A B) P(A) + P(B) 15 / 21
Hoeffding for learning we just proved that our learning algorithm generalizes! Theorem (Generalization bound for learning) Let g be a hypothesis chosen from among m different hypotheses. Then P[ E in (g) E out (g) > ε] 2m exp ( 2ε 2 n ). Q: do you think this bound is tight? A: no, it can overcount badly. for random events A and B, if P(A B) is large, then P(A B) P(A) + P(B) more information. look up the Vapnik-Chervoninkis (VC) dimension, e.g., in Learning from Data, by Abu-Mostafa, Magdon-Ismail, and Lin. 15 / 21
A tradeoff for learning we want H to be big to make E in small we want H to be small to ensure E out is close to E in 16 / 21
A tradeoff for learning we want H to be big to make E in small we want H to be small to ensure E out is close to E in what does this tell us about the difficulty of learning complicated functions f? 16 / 21
Generalization for regression Theorem (Generalization bound for learning) Let g be a hypothesis chosen from among m different hypotheses. Then P[ E in (g) E out (g) > ε] 2m exp ( 2ε 2 n ). to apply Hoeffding to real-valued outputs: pick some small ɛ > 0 1((y i h(x i )) 2 ε)) is 0 if hypothesis h predicts well, 1 if hypothesis h predicts poorly define error of hypothesis h on data set D as E D (h) = 1 D (x,y) D 1((y h(x)) 2 ɛ) 17 / 21
Recap we introduced a probabilistic framework for generating data we showed that the in-sample error predicts the out-of-sample error for a single hypothesis we showed that the in-sample error predicts the out-of-sample error for a learned hypothesis, when H is finite we stopped there, because the math gets much more complicated but indeed, generalization is possible! the practical lesson: (especially for complex models), don t learn and estimate your error on the same data set 18 / 21
in, out, train, test the in-sample error and training error do not obey the Hoeffding inequality (without using, e.g., a union bound) if test set is only used for testing (not for model selection), the test error does obey the Hoeffding inequality P[ E test (g) E out (g) > ε] 2 exp ( 2ε 2 D test ). so we can use the test error to predict generalization 19 / 21
Overfitting to the test set if test set is used for model selection, the test error obeys the Hoeffding inequality with the union bound for each model family, optimal model trained on D is a hypothesis h so finite number of models m = finite hypothesis space H hypotheses h H are independent of test set D let g D be the hypothesis h H with lowest error on test set D Hoeffding with union bound applies! P[ E D (g D ) E out (g D ) > ε] 2m exp ( 2ε 2 D ). 20 / 21
References Concentration bounds for infinite model classes: see introduction to VC dimension in Learning from Data by Abu-Mostafa et al. Concentration bounds for cross validation: https://arxiv.org/pdf/1706.05801.pdf Concentration bounds for time series: see papers by Cosma Shalizi 21 / 21