Generalization and Overfitting

Size: px

Start display at page:

Download "Generalization and Overfitting"

Justin Grant
5 years ago
Views:

1 Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016

2 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle Labeled Examples (x 1,c*(x 1 )),, (x m,c*(x m )) Alg.outputs h : X! Y x 1 > 5 x 6 > c* : X! Y

3 PAC/SLT models for Supervised Learning X feature/instance space; distribution D over X e.g., X = R d or X = {0,1} d Algo sees training sample S: (x 1,c*(x 1 )),, (x m,c*(x m )), x i i.i.d. from D labeled examples - drawn i.i.d. from D and labeled by target c * labels 2 {-1,1} - binary classification Algo does optimization over S, find hypothesis h. Goal: h has small error over D. err D h = Pr (h x x~ D c (x)) - - h c * Instance space X Bias: fix hypothesis space H [whose complexity is not too large] Realizable: c H. Agnostic: c close to H.

4 PAC/SLT models for Supervised Learning Algo sees training sample S: (x 1,c*(x 1 )),, (x m,c*(x m )), x i i.i.d. from D Does optimization over S, find hypothesis h H. Goal: h has small error over D. True error: err D h But, can only measure: = Pr (h x x~ D c (x)) How often h x c (x) over future instances drawn at random from D Training error: err S h = 1 m I h x i c x i i How often h x instances c (x) over training Sample complexity: bound err D h in terms of err S h

5 Sample Complexity for Supervised Learning Consistent Learner Input: S: (x 1,c*(x 1 )),, (x m,c*(x m )) Output: Find h in H consistent with the sample (if one exits). Bound only logarithmic in H, linear in 1/ε Probability over different samples of m training examples So, if c H and can find consistent fns, then only need this many examples to get generalization error ε with prob. 1 δ

6 Sample Complexity: Finite Hypothesis Spaces Agnostic Case Important Conclusion: W.h.p. 1 δ,err D h err D h + 2ϵ, h is ERM output, h is hyp. of smallest true error rate. err D h err S h ε err S h err D h ε

7 Definition: Shattering, VC-dimension H shatters S if H S = 2 S. A set of points S is shattered by H is there are hypotheses in H that split S in all of the 2 S possible ways, all possible ways of classifying points in S are achievable using concepts in H. Definition: VC-dimension (Vapnik-Chervonenkis dimension) The VC-dimension of a hypothesis space H is the cardinality of the largest set S that can be shattered by H. If arbitrarily large finite sets can be shattered by H, then VCdim(H) =

8 Shattering, VC-dimension Definition: VC-dimension (Vapnik-Chervonenkis dimension) The VC-dimension of a hypothesis space H is the cardinality of the largest set S that can be shattered by H. If arbitrarily large finite sets can be shattered by H, then VCdim(H) = To show that VC-dimension is d: there exists a set of d points that can be shattered there is no set of d+1 points that can be shattered. Fact: If H is finite, then VCdim (H) log ( H ).

9 Sample Complexity: Infinite Hypothesis Spaces H[m] - max number of ways to split m points using concepts in H Sauer s Lemma: H m = O m VCdim H

10 Sample Complexity: Infinite Hypothesis Spaces Realizable Case

11 Sample Complexity: Infinite Hypothesis Spaces Theorem (agnostic case) m C ε 2 VCdim H + log 1 δ labeled examples are sufficient s.t. with probability at least 1 δ for all h in H err D h err S (h) ε Statistical Learning Theory Style With prob at least 1 δ for all h in H err D h err S h + 1 2m VCdim H + ln 1 δ.

12 Can we use our bounds for model selection?

13 True Error, Training Error, Overfitting Model selection: trade-off between decreasing training error and keeping H simple. err D h err S h + VCdim(H) + m error train error generalization error complexity

14 Structural Risk Minimization (SRM) H 1 H 2 H 3 H i error rate overfitting empirical error Hypothesis complexity

15 What happens if we increase m? Black curve will stay close to the red curve for longer, everything shift to the right

16 Structural Risk Minimization (SRM) H 1 H 2 H 3 H i error rate overfitting empirical error Hypothesis complexity

17 Structural Risk Minimization (SRM) H 1 H 2 H 3 H i h k = argmin h Hk {err S h } As k increases, err S h k goes down but complex. term goes up. k = argmin k 1 {err S h k + complexity(h k )} Output h = h k Claim: W.h.p., err D h min k min h H k err D h + 2complexity H k

18 Techniques to Handle Overfitting Structural Risk Minimization (SRM). Minimize gener. bound: Often computationally hard. Regularization: E.g., SVM, regularized logistic regression, etc. minimizes expressions of the form: err S h + λ h 2 Cross Validation: H 1 H 2 H i h = argmin k 1 {err S h k + complexity(h k )} Nice case where it is possible: M. Kearns, Y. Mansour, ICML 98, A Fast, Bottom-Up Decision Tree Pruning Algorithm with Near-Optimal Generalization general family closely related to SRM Hold out part of the training data and use it as a proxy for the generalization error

19 What you should know Notion of sample complexity. Understand reasoning behind the simple sample complexity bound for finite H [exam question!]. Shattering, VC dimension as measure of complexity, Sauer s lemma, form of the VC bounds (upper and lower bounds). Model Selection, Structural Risk Minimization.

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How