Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016

Two Core Aspects of Machine Learning Algorithm Design. How to optimize? Automatically generate rules that do well on observed data. E.g.: logistic regression, SVM, Adaboost, etc. Confidence Bounds, Generalization Confidence for rule effectiveness on future data. Very well understood: Occam s bound, VC theory, etc. Our focus so far has been on Algorithm Design. In this module, Generalization Guarantees (not overfiting)

PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle Labeled Examples (x 1,c*(x 1 )),, (x m,c*(x m )) Alg.outputs h : X! Y x 1 > 5 x 6 > 2 +1-1 +1 c* : X! Y + + - - - - -

PAC/SLT models for Supervised Learning X feature/instance space; distribution D over X e.g., X = R d or X = {0,1} d Algo sees training sample S: (x 1,c*(x 1 )),, (x m,c*(x m )), x i i.i.d. from D labeled examples - drawn i.i.d. from D and labeled by target c * labels 2 {-1,1} - binary classification Algo does optimization over S, find hypothesis h. Goal: h has small error over D. err D h = Pr (h x x~ D c (x)) - - h c * + + + + - - Instance space X Bias: fix hypothesis space H [whose complexity is not too large] Realizable: c H. Agnostic: c close to H.

PAC/SLT models for Supervised Learning Algo sees training sample S: (x 1,c*(x 1 )),, (x m,c*(x m )), x i i.i.d. from D Does optimization over S, find hypothesis h H. Goal: h has small error over D. True error: err D h But, can only measure: = Pr (h x x~ D c (x)) How often h x c (x) over future instances drawn at random from D Training error: err S h = 1 m I h x i c x i i How often h x instances c (x) over training Sample complexity: bound err D h in terms of err S h

Sample Complexity for Supervised Learning Consistent Learner Input: S: (x 1,c*(x 1 )),, (x m,c*(x m )) Output: Find h in H consistent with the sample (if one exits). Bound only logarithmic in H, linear in 1/ε Probability over different samples of m training examples So, if c H and can find consistent fns, then only need this many examples to get generalization error ε with prob. 1 δ PAC bound: with high probability (at least 1-δ) we are approximately correct (generalization error ε).

What if c H?

Uniform Convergence This basic result only bounds the chance that a bad hypothesis looks perfect on the data. What if there is no perfect h H (agnostic case)? What can we say if c H? Can we say that whp all h H satisfy err D (h) err S (h)? Called uniform convergence. Motivates optimizing over S, even if we can t find a perfect function.

Sample Complexity: Uniform Convergence Agnostic Case Empirical Risk Minimization (ERM) Input: S: (x 1,c*(x 1 )),, (x m,c*(x m )) Output: Find h in H with smallest err S (h) 1/ε 2 dependence [as opposed to1/ε for realizable]

Sample Complexity: Finite Hypothesis Spaces Agnostic Case Important Conclusion: W.h.p. 1 δ,err D h err D h + 2ϵ, h is ERM output, h is hyp. of smallest true error rate. err D h err S h ε err S h err D h ε

Sample Complexity: Finite Hypothesis Spaces Realizable Case Agnostic Case What if there is no perfect h? To prove bounds like this, need some good tail inequalities.

What if H is infinite? E.g., linear separators in R d E.g., thresholds on the real line + + - - - - - - w + E.g., intervals on the real line - + - a b

Definition: Shattering, VC-dimension H[S] the set of splittings of dataset S using concepts from H. H shatters S if H S = 2 S. A set of points S is shattered by H is there are hypotheses in H that split S in all of the 2 S possible ways; i.e., all possible ways of classifying points in S are achievable using concepts in H. Definition: VC-dimension (Vapnik-Chervonenkis dimension) The VC-dimension of a hypothesis space H is the cardinality of the largest set S that can be shattered by H. If arbitrarily large finite sets can be shattered by H, then VCdim(H) =

Shattering, VC-dimension Definition: VC-dimension (Vapnik-Chervonenkis dimension) The VC-dimension of a hypothesis space H is the cardinality of the largest set S that can be shattered by H. If arbitrarily large finite sets can be shattered by H, then VCdim(H) = To show that VC-dimension is d: there exists a set of d points that can be shattered there is no set of d+1 points that can be shattered. Fact: If H is finite, then VCdim (H) log ( H ).

Shattering, VC-dimension If the VC-dimension is d, that means there exists a set of d points that can be shattered, but there is no set of d+1 points that can be shattered. E.g., H= Thresholds on the real line VCdim H = 1 + - - w + - E.g., H= Intervals on the real line + - VCdim H = 2 + - +

Shattering, VC-dimension If the VC-dimension is d, that means there exists a set of d points that can be shattered, but there is no set of d+1 points that can be shattered. E.g., H= Union of k intervals on the real line VCdim H = 2k - + - + - + VCdim H 2k A sample of size 2k shatters (treat each pair of points as a separate case of intervals) VCdim H < 2k + 1 + - + - +

Shattering, VC-dimension E.g., H= linear separators in R 2 VCdim H 3

Shattering, VC-dimension E.g., H= linear separators in R 2 VCdim H < 4 Case 1: one point inside the triangle formed by the others. Cannot label inside point as positive and outside points as negative. Case 2: all points on the boundary (convex hull). Cannot label two diagonally as positive and other two as negative. Fact: VCdim of linear separators in R d is d+1

Sample Complexity: Infinite Hypothesis Spaces Realizable Case E.g., H= linear separators in R d Sample complexity linear in d Interpretation: if double the number of features, thenwe only need roughly twice the number of samples to do well.

Sample Complexity: Infinite Hypothesis Spaces Theorem (agnostic case) m C ε 2 VCdim H + log 1 δ labeled examples are sufficient s.t. with probability at least 1 δ for all h in H err D h err S (h) ε Statistical Learning Theory Style With prob at least 1 δ for all h in H err D h err S h + 1 2m VCdim H + ln 1 δ.

Can we use our bounds for model selection?

True Error, Training Error, Overfitting Model selection: trade-off between decreasing training error and keeping H simple. err D h err S h + VCdim(H) + m error train error generalization error complexity

Structural Risk Minimization (SRM) H 1 H 2 H 3 H i (E.g., H i = decision trees of depth i) error rate overfitting empirical error Hypothesis complexity

What happens if we increase m? Black curve will stay close to the red curve for longer, everything shift to the right

Structural Risk Minimization (SRM) H 1 H 2 H 3 H i error rate overfitting empirical error Hypothesis complexity

Structural Risk Minimization (SRM) H 1 H 2 H 3 H i h k = argmin h Hk {err S h } As k increases, err S h k goes down but complex. term goes up. k = argmin k 1 {err S h k + complexity(h k )} Output h = h k Claim: W.h.p., err D h min k min h H k err D h + 2complexity H k

Techniques to Handle Overfitting Structural Risk Minimization (SRM). Minimize gener. bound: Often computationally hard. Regularization: E.g., SVM, regularized logistic regression, etc. minimizes expressions of the form: err S h + λ h 2 Cross Validation: H 1 H 2 H i h = argmin k 1 {err S h k + complexity(h k )} Nice case where it is possible: M. Kearns, Y. Mansour, ICML 98, A Fast, Bottom-Up Decision Tree Pruning Algorithm with Near-Optimal Generalization general family closely related to SRM Hold out part of the training data and use it as a proxy for the generalization error

What you should know Notion of sample complexity. Understand meaning of PAC bounds. Shattering, VC dimension as measure of complexity, form of the VC bounds. Model Selection, Structural Risk Minimization.