Generalization, Overfitting, and Model Selection

Similar documents
Generalization and Overfitting

Generalization, Overfitting, and Model Selection

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Distributed Machine Learning. Maria-Florina Balcan, Georgia Tech

Distributed Machine Learning. Maria-Florina Balcan Carnegie Mellon University

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Understanding Generalization Error: Bounds and Decompositions

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Machine Learning

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Introduction to Machine Learning

Computational Learning Theory

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

The sample complexity of agnostic learning with deterministic labels

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

COMS 4771 Introduction to Machine Learning. Nakul Verma

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Computational Learning Theory

CS 6375: Machine Learning Computational Learning Theory

Does Unlabeled Data Help?

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Computational Learning Theory. Definitions

Machine Learning

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Generalization theory

Computational Learning Theory. CS534 - Machine Learning

PAC-learning, VC Dimension and Margin-based Bounds


Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

ECE 5424: Introduction to Machine Learning

Computational and Statistical Learning Theory

VC dimension and Model Selection

Advanced Introduction to Machine Learning CMU-10715

Statistical Learning Learning From Examples

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

The Perceptron algorithm

Classification: The PAC Learning Framework

Introduction to Machine Learning

Active Learning and Optimized Information Gathering

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

Computational Learning Theory (VC Dimension)

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

A Course in Machine Learning

PAC Model and Generalization Bounds

Introduction to Machine Learning

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Solving Classification Problems By Knowledge Sets

Support Vector Machines. Machine Learning Fall 2017

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Part of the slides are adapted from Ziko Kolter

Computational Learning Theory

The Perceptron Algorithm, Margins

Computational and Statistical Learning theory

Empirical Risk Minimization, Model Selection, and Model Assessment

2 Upper-bound of Generalization Error of AdaBoost

Introduction to Machine Learning (67577) Lecture 5

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

CS229 Supplemental Lecture notes

References for online kernel methods

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Discriminative Models

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Learning Theory Continued

Computational and Statistical Learning Theory

Decision Tree Learning Lecture 2

Class 2 & 3 Overfitting & Regularization

Foundations of Machine Learning

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Computational and Statistical Learning Theory

Machine Learning. Ensemble Methods. Manfred Huber

Introduction and Models

Name (NetID): (1 Point)

Computational and Statistical Learning Theory

1 The Probably Approximately Correct (PAC) Model

Machine Learning 4771

Learning theory. Ensemble methods. Boosting. Boosting: history

Empirical Risk Minimization

Introduction to Machine Learning

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Computational Learning Theory

Computational Learning Theory

ECE 5984: Introduction to Machine Learning

Qualifying Exam in Machine Learning

Bayesian Learning (II)

Transcription:

Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016

Two Core Aspects of Machine Learning Algorithm Design. How to optimize? Automatically generate rules that do well on observed data. E.g.: logistic regression, SVM, Adaboost, etc. Confidence Bounds, Generalization Confidence for rule effectiveness on future data. Very well understood: Occam s bound, VC theory, etc. Our focus so far has been on Algorithm Design. In this module, Generalization Guarantees (not overfiting)

PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle Labeled Examples (x 1,c*(x 1 )),, (x m,c*(x m )) Alg.outputs h : X! Y x 1 > 5 x 6 > 2 +1-1 +1 c* : X! Y + + - - - - -

PAC/SLT models for Supervised Learning X feature/instance space; distribution D over X e.g., X = R d or X = {0,1} d Algo sees training sample S: (x 1,c*(x 1 )),, (x m,c*(x m )), x i i.i.d. from D labeled examples - drawn i.i.d. from D and labeled by target c * labels 2 {-1,1} - binary classification Algo does optimization over S, find hypothesis h. Goal: h has small error over D. err D h = Pr (h x x~ D c (x)) - - h c * + + + + - - Instance space X Bias: fix hypothesis space H [whose complexity is not too large] Realizable: c H. Agnostic: c close to H.

PAC/SLT models for Supervised Learning Algo sees training sample S: (x 1,c*(x 1 )),, (x m,c*(x m )), x i i.i.d. from D Does optimization over S, find hypothesis h H. Goal: h has small error over D. True error: err D h But, can only measure: = Pr (h x x~ D c (x)) How often h x c (x) over future instances drawn at random from D Training error: err S h = 1 m I h x i c x i i How often h x instances c (x) over training Sample complexity: bound err D h in terms of err S h

Sample Complexity for Supervised Learning Consistent Learner Input: S: (x 1,c*(x 1 )),, (x m,c*(x m )) Output: Find h in H consistent with the sample (if one exits). Bound only logarithmic in H, linear in 1/ε Probability over different samples of m training examples So, if c H and can find consistent fns, then only need this many examples to get generalization error ε with prob. 1 δ PAC bound: with high probability (at least 1-δ) we are approximately correct (generalization error ε).

What if c H?

Uniform Convergence This basic result only bounds the chance that a bad hypothesis looks perfect on the data. What if there is no perfect h H (agnostic case)? What can we say if c H? Can we say that whp all h H satisfy err D (h) err S (h)? Called uniform convergence. Motivates optimizing over S, even if we can t find a perfect function.

Sample Complexity: Uniform Convergence Agnostic Case Empirical Risk Minimization (ERM) Input: S: (x 1,c*(x 1 )),, (x m,c*(x m )) Output: Find h in H with smallest err S (h) 1/ε 2 dependence [as opposed to1/ε for realizable]

Sample Complexity: Finite Hypothesis Spaces Agnostic Case Important Conclusion: W.h.p. 1 δ,err D h err D h + 2ϵ, h is ERM output, h is hyp. of smallest true error rate. err D h err S h ε err S h err D h ε

Sample Complexity: Finite Hypothesis Spaces Realizable Case Agnostic Case What if there is no perfect h? To prove bounds like this, need some good tail inequalities.

What if H is infinite? E.g., linear separators in R d E.g., thresholds on the real line + + - - - - - - w + E.g., intervals on the real line - + - a b

Definition: Shattering, VC-dimension H[S] the set of splittings of dataset S using concepts from H. H shatters S if H S = 2 S. A set of points S is shattered by H is there are hypotheses in H that split S in all of the 2 S possible ways; i.e., all possible ways of classifying points in S are achievable using concepts in H. Definition: VC-dimension (Vapnik-Chervonenkis dimension) The VC-dimension of a hypothesis space H is the cardinality of the largest set S that can be shattered by H. If arbitrarily large finite sets can be shattered by H, then VCdim(H) =

Shattering, VC-dimension Definition: VC-dimension (Vapnik-Chervonenkis dimension) The VC-dimension of a hypothesis space H is the cardinality of the largest set S that can be shattered by H. If arbitrarily large finite sets can be shattered by H, then VCdim(H) = To show that VC-dimension is d: there exists a set of d points that can be shattered there is no set of d+1 points that can be shattered. Fact: If H is finite, then VCdim (H) log ( H ).

Shattering, VC-dimension If the VC-dimension is d, that means there exists a set of d points that can be shattered, but there is no set of d+1 points that can be shattered. E.g., H= Thresholds on the real line VCdim H = 1 + - - w + - E.g., H= Intervals on the real line + - VCdim H = 2 + - +

Shattering, VC-dimension If the VC-dimension is d, that means there exists a set of d points that can be shattered, but there is no set of d+1 points that can be shattered. E.g., H= Union of k intervals on the real line VCdim H = 2k - + - + - + VCdim H 2k A sample of size 2k shatters (treat each pair of points as a separate case of intervals) VCdim H < 2k + 1 + - + - +

Shattering, VC-dimension E.g., H= linear separators in R 2 VCdim H 3

Shattering, VC-dimension E.g., H= linear separators in R 2 VCdim H < 4 Case 1: one point inside the triangle formed by the others. Cannot label inside point as positive and outside points as negative. Case 2: all points on the boundary (convex hull). Cannot label two diagonally as positive and other two as negative. Fact: VCdim of linear separators in R d is d+1

Sample Complexity: Infinite Hypothesis Spaces Realizable Case E.g., H= linear separators in R d Sample complexity linear in d Interpretation: if double the number of features, thenwe only need roughly twice the number of samples to do well.

Sample Complexity: Infinite Hypothesis Spaces Theorem (agnostic case) m C ε 2 VCdim H + log 1 δ labeled examples are sufficient s.t. with probability at least 1 δ for all h in H err D h err S (h) ε Statistical Learning Theory Style With prob at least 1 δ for all h in H err D h err S h + 1 2m VCdim H + ln 1 δ.

Can we use our bounds for model selection?

True Error, Training Error, Overfitting Model selection: trade-off between decreasing training error and keeping H simple. err D h err S h + VCdim(H) + m error train error generalization error complexity

Structural Risk Minimization (SRM) H 1 H 2 H 3 H i (E.g., H i = decision trees of depth i) error rate overfitting empirical error Hypothesis complexity

What happens if we increase m? Black curve will stay close to the red curve for longer, everything shift to the right

Structural Risk Minimization (SRM) H 1 H 2 H 3 H i error rate overfitting empirical error Hypothesis complexity

Structural Risk Minimization (SRM) H 1 H 2 H 3 H i h k = argmin h Hk {err S h } As k increases, err S h k goes down but complex. term goes up. k = argmin k 1 {err S h k + complexity(h k )} Output h = h k Claim: W.h.p., err D h min k min h H k err D h + 2complexity H k

Techniques to Handle Overfitting Structural Risk Minimization (SRM). Minimize gener. bound: Often computationally hard. Regularization: E.g., SVM, regularized logistic regression, etc. minimizes expressions of the form: err S h + λ h 2 Cross Validation: H 1 H 2 H i h = argmin k 1 {err S h k + complexity(h k )} Nice case where it is possible: M. Kearns, Y. Mansour, ICML 98, A Fast, Bottom-Up Decision Tree Pruning Algorithm with Near-Optimal Generalization general family closely related to SRM Hold out part of the training data and use it as a proxy for the generalization error

What you should know Notion of sample complexity. Understand meaning of PAC bounds. Shattering, VC dimension as measure of complexity, form of the VC bounds. Model Selection, Structural Risk Minimization.