Generalization, Overfitting, and Model Selection

Size: px

Start display at page:

Download "Generalization, Overfitting, and Model Selection"

Edgar Nichols
5 years ago
Views:

1 Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification MariaFlorina (Nina) Balcan 10/05/2016

2 Reminders Midterm Exam Mon, Oct. 10th Midterm Review Session Thu, Oct. 6th at 6:00pm Homework 4 Extension: due Fri (10/7) at 5:30pm Homework 4 Review Session Today (Wed), Oct. 5th at 6:00pm in NSH 3002 Please look for a (pending) announcement 2

3 Midterm Exam Inclass exam on Mon, Oct. 10 th 5 problems Format of questions: Multiple choice True / False (with justification) Very short derivations Short answers Interpreting figures No electronic devices You are allowed to bring one 8½ x 11 sheet of notes (front and back) 3

4 Midterm Exam How to Prepare Attend the midterm recitation session: Thu, Oct. 6 th at 6:00pm Review prior year s exams and solutions (we ll post them) Review this year s homework problems 4

5 Generalization, Overfitting, and Model Selection

6 Two Core Aspects of Machine Learning Algorithm Design. How to optimize? Automatically generate rules that do well on observed data. Confidence Bounds, Generalization Confidence for rule effectiveness on future data. Our focus so far has been on Algorithm Design. In this module, Generalization Guarantees (not overfiting) they apply to all algorithms we talk about throughout the course.

7 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle Labeled Examples (x 1,c*(x 1 )),, (x m,c*(x m )) Alg.outputs h : X! Y x 1 > 5 x 6 > c* : X! Y + +

8 PAC/SLT models for Supervised Learning X feature/instance space; distribution D over X e.g., X = R d or X = {0,1} d Algo sees training sample S: (x 1,c*(x 1 )),, (x m,c*(x m )), x i i.i.d. from D labeled examples drawn i.i.d. from D and labeled by target c * labels 2 {1,1} binary classification Algo does optimization over S, find hypothesis h. Goal: h has small error over D. err D h = Pr (h x x~ D c (x)) h c * Instance space X Bias: fix hypothesis space H [whose complexity is not too large] Realizable: c H. Agnostic: c close to H.

9 PAC/SLT models for Supervised Learning Algo sees training sample S: (x 1,c*(x 1 )),, (x m,c*(x m )), x i i.i.d. from D Does optimization over S, find hypothesis h H. Goal: h has small error over D. True error: err D h But, can only measure: = Pr (h x x~ D c (x)) How often h x c (x) over future instances drawn at random from D Training error: err S h = 1 m I h x i c x i i How often h x instances c (x) over training Sample complexity: bound err D h in terms of err S h

10 Sample Complexity: Finite Hypothesis Spaces Realizable Case So, if c H and can find consistent fns, then only need this many examples to get generalization error ε with prob. 1 δ Agnostic Case What if there is no perfect h?

11 What if H is infinite? E.g., linear separators in R d E.g., thresholds on the real line + + w + E.g., intervals on the real line + a b

12 Shattering, VCdimension Definition: VCdimension (VapnikChervonenkis dimension) The VCdimension of a hypothesis space H is the cardinality of the largest set S that can be shattered (labeled in all possible ways ) by H. If arbitrarily large finite sets can be shattered by H, then VCdim(H) = To show that VCdimension is d: there exists a set of d points that can be shattered there is no set of d+1 points that can be shattered. Fact: If H is finite, then VCdim (H) log ( H ).

13 True complexity of a hypothesis class E.g., H= Thresholds on the real line w + Can label 4 points with thresholds only in 5 ways In general, can label m points with thresholds only in m m

14 Shattering, VCdimension If the VCdimension is d, that means there exists a set of d points that can be shattered, but there is no set of d+1 points that can be shattered. E.g., H= Thresholds on the real line VCdim H = 1 E.g., H= Intervals on the real line + VCdim H = 2 E.g., H= linear separators in R d VCdim H = d + 1 w +

15 Sample Complexity: Infinite Hypothesis Spaces Realizable Case E.g., H= linear separators in R d Sample complexity linear in d Interpretation: if double the number of features, thenwe only need roughly twice the number of samples to do well.

16 Sample Complexity: Infinite Hypothesis Spaces Theorem (agnostic case) m C ε 2 VCdim H + log 1 δ labeled examples are sufficient s.t. with probability at least 1 δ for all h in H err D h err S (h) ε Statistical Learning Theory Style With prob at least 1 δ for all h in H err D h err S h + 1 2m VCdim H + ln 1 δ.

17 Can we use our bounds for model selection?

18 True Error, Training Error, Overfitting Model selection: tradeoff between decreasing training error and keeping H simple. err D h err S h + VCdim(H) + m error train error generalization error complexity

19 Structural Risk Minimization (SRM) H 1 H 2 H 3 H i (E.g., H i = decision trees of depth i) error rate overfitting empirical error Hypothesis complexity

20 What happens if we increase m? Black curve will stay close to the red curve for longer, everything shift to the right

21 Structural Risk Minimization (SRM) H 1 H 2 H 3 H i error rate overfitting empirical error Hypothesis complexity

22 Structural Risk Minimization (SRM) H 1 H 2 H 3 H i h k = argmin h Hk {err S h } As k increases, err S h k goes down but complex. term goes up. k = argmin k 1 {err S h k + complexity(h k )} Output h = h k Claim: W.h.p., err D h min k min h H k err D h + 2complexity H k

23 Techniques to Handle Overfitting Structural Risk Minimization (SRM). Minimize gener. bound: Often computationally hard. Regularization: E.g., SVM, regularized logistic regression, etc. minimizes expressions of the form: err S h + λ h 2 Cross Validation: H 1 H 2 H i h = argmin k 1 {err S h k + complexity(h k )} Nice case where it is possible: M. Kearns, Y. Mansour, ICML 98, A Fast, BottomUp Decision Tree Pruning Algorithm with NearOptimal Generalization general family closely related to SRM Hold out part of the training data and use it as a proxy for the generalization error

24 What you should know The importance of sample complexity in Machine Learning. Understand meaning of PAC bounds (what PAC stands for, meaning of parameters ε and δ). Shattering, VC dimension as measure of complexity, form of the VC bounds. Model Selection, Structural Risk Minimization.

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How