PAC-learning, VC Dimension and Margin-based Bounds

Similar documents
PAC-learning, VC Dimension and Margin-based Bounds

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Theory Continued

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Support Vector Machines

Machine Learning

Support vector machines Lecture 4

Machine Learning

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Computational Learning Theory

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

ECE 5424: Introduction to Machine Learning

Bias-Variance Tradeoff

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

SVMs, Duality and the Kernel Trick

Computational Learning Theory. CS534 - Machine Learning

Kernelized Perceptron Support Vector Machines

ECE 5984: Introduction to Machine Learning

Support Vector Machines. Machine Learning Fall 2017

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Linear classifiers Lecture 3

Learning theory Lecture 4

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

CS 6375: Machine Learning Computational Learning Theory

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Classification: The rest of the story

VBM683 Machine Learning

Computational Learning Theory

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Statistical and Computational Learning Theory

Introduction to Machine Learning CMU-10701

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Computational Learning Theory (VC Dimension)

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Stochastic Gradient Descent

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Understanding Generalization Error: Bounds and Decompositions

Variance Reduction and Ensemble Methods

Computational Learning Theory

Midterm Exam, Spring 2005

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

References for online kernel methods

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Introduction to Support Vector Machines

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Active Learning and Optimized Information Gathering

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Web-Mining Agents Computational Learning Theory

Advanced Introduction to Machine Learning CMU-10715

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Generalization, Overfitting, and Model Selection

Support Vector Machines

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Logistic Regression Logistic

Introduction to Machine Learning

Gaussians Linear Regression Bias-Variance Tradeoff

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Name (NetID): (1 Point)

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

Qualifying Exam in Machine Learning

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Support Vector Machines

Discriminative Models

Learning with multiple models. Boosting.

A first model of learning

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Kernel Methods and Support Vector Machines

Machine Learning 4771

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

COMS 4771 Introduction to Machine Learning. Nakul Verma

TDT4173 Machine Learning

Introduction to Machine Learning CMU-10701

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

COMS 4771 Introduction to Machine Learning. Nakul Verma

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Lecture 3: Statistical Decision Theory (Part II)

Machine Learning Gaussian Naïve Bayes Big Picture

Generative v. Discriminative classifiers Intuition

The Perceptron algorithm

Holdout and Cross-Validation Methods Overfitting Avoidance

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Announcements. Proposals graded

Part of the slides are adapted from Ziko Kolter

Announcements - Homework

Generative v. Discriminative classifiers Intuition

1 Learning Linear Separators

Logistic Regression. Machine Learning Fall 2018

Midterm Exam Solutions, Spring 2007

Discriminative Models

Overfitting, Bias / Variance Analysis

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Computational Learning Theory

Introduction to Machine Learning

Stephen Scott.

Transcription:

More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based Bounds Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 28 th, 2005

Review: Generalization error in finite hypothesis spaces [Haussler 88] Theorem: Hypothesis space H finite, dataset D with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data: Even if h makes zero errors in training data, may make errors in test

Using a PAC bound Typically, 2 use cases: 1: Pick ε and δ, give you m 2: Pick m and δ, give you ε

Limitations of Haussler 88 bound Consistent classifier Size of hypothesis space

What if our classifier does not have zero error on the training data? A learner with zero training errors may make mistakes in test set A learner with error D (h) in training set, may make even more mistakes in test set

Simpler question: What s the expected error of a hypothesis? The error of a hypothesis is like estimating the parameter of a coin! Chernoff bound: for m i.d.d. coin flips, x 1,,x m, where x i {0,1}. For 0<ε<1:

Using Chernoff bound to estimate error of a single hypothesis

But we are comparing many hypothesis: Union bound For each hypothesis h i : What if I am comparing two hypothesis, h 1 and h 2?

Generalization bound for H hypothesis Theorem: Hypothesis space H finite, dataset D with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h:

PAC bound and Bias-Variance tradeoff or, after moving some terms around, with probability at least 1-δ: Important: PAC bound holds for all h, but doesn t guarantee that algorithm finds best h!!!

What about the size of the hypothesis space? How large is the hypothesis space?

Boolean formulas with n binary features

Number of decision trees of depth k Recursive solution Given n attributes H k = Number of decision trees of depth k H 0 =2 H k+1 = (#choices of root attribute) * (# possible left subtrees) * (# possible right subtrees) = n * H k * H k Write L k = log 2 H k L 0 = 1 L k+1 = log 2 n + 2L k So L k = (2 k -1)(1+log 2 n) +1

PAC bound for decision trees of depth k Bad!!! Number of points is exponential in depth! But, for m data points, decision tree can t get too big Number of leaves never more than number data points

Number of decision trees with k leaves H k = Number of decision trees with k leaves H 0 =2 Loose bound: Reminder:

PAC bound for decision trees with k leaves Bias-Variance revisited

What did we learn from decision trees? Bias-Variance tradeoff formalized Moral of the story: Complexity of learning not measured in terms of size hypothesis space, but in maximum number of points that allows consistent classification Complexity m no bias, lots of variance Lower than m some bias, less variance

What about continuous hypothesis spaces? Continuous hypothesis space: H = Infinite variance??? As with decision trees, only care about the maximum number of points that can be classified exactly!

How many points can a linear boundary classify exactly? (1-D)

How many points can a linear boundary classify exactly? (2-D)

How many points can a linear boundary classify exactly? (d-d)

PAC bound using VC dimension Number of training points that can be classified exactly is VC dimension!!! Measures relevant size of hypothesis space, as with decision trees with k leaves

Shattering a set of points

VC dimension

Examples of VC dimension Linear classifiers: VC(H) = d+1, for d features plus constant term b Neural networks VC(H) = #parameters Local minima means NNs will probably not find best parameters 1-Nearest neighbor?

PAC bound for SVMs SVMs use a linear classifier For d features, VC(H) = d+1:

VC dimension and SVMs: Problems!!! Doesn t take margin into account What about kernels? Polynomials: num. features grows really fast = Bad bound n input features p degree of polynomial Gaussian kernels can classify any set of points exactly

Margin-based VC dimension H: Class of linear classifiers: w.φ(x) (b=0) Canonical form: min j w.φ(x j ) = 1 VC(H) = R 2 w.w Doesn t depend on number of features!!! R 2 = max j Φ(x j ).Φ(x j ) magnitude of data R 2 is bounded even for Gaussian kernels bounded VC dimension Large margin, low w.w, low VC dimension Very cool!

Applying margin VC to SVMs? VC(H) = R 2 w.w R 2 = max j Φ(x j ).Φ(x j ) magnitude of data, doesn t depend on choice of w SVMs minimize w.w SVMs minimize VC dimension to get best bound? Not quite right: Bound assumes VC dimension chosen before looking at data Would require union bound over infinite number of possible VC dimensions But, it can be fixed!

Structural risk minimization theorem For a family of hyperplanes with margin γ>0 w.w 1 SVMs maximize margin γ + hinge loss Optimize tradeoff training error (bias) versus margin γ (variance)

Reality check Bounds are loose ε d=2000 d=200 d=20 d=2 m (in 10 5 ) Bound can be very loose, why should you care? There are tighter, albeit more complicated, bounds Bounds gives us formal guarantees that empirical studies can t provide Bounds give us intuition about complexity of problems and convergence rate of algorithms

What you need to know Finite hypothesis space Derive results Counting number of hypothesis Mistakes on Training data Complexity of the classifier depends on number of points that can be classified exactly Finite case decision trees Infinite case VC dimension Bias-Variance tradeoff in learning theory Margin-based bound for SVM Remember: will your algorithm find best classifier?