CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Similar documents
CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

PAC-learning, VC Dimension and Margin-based Bounds


Statistical Learning Theory: Generalization Error Bounds

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory

Generalization, Overfitting, and Model Selection

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

Machine Learning

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Introduction to Machine Learning

PAC-learning, VC Dimension and Margin-based Bounds

CS 6375: Machine Learning Computational Learning Theory

Understanding Generalization Error: Bounds and Decompositions

Statistical and Computational Learning Theory

Web-Mining Agents Computational Learning Theory

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

COMS 4771 Introduction to Machine Learning. Nakul Verma

Computational and Statistical Learning theory

1 The Probably Approximately Correct (PAC) Model

Introduction to Statistical Learning Theory. Material para Máster en Matemáticas y Computación

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Empirical Risk Minimization, Model Selection, and Model Assessment

Support vector machines Lecture 4

1 Review of The Learning Setting

Introduction to Machine Learning CMU-10701

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

ECS171: Machine Learning

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Statistical Learning Learning From Examples

An Introduction to No Free Lunch Theorems

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

2 Upper-bound of Generalization Error of AdaBoost

VC dimension and Model Selection

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Support Vector Machines. Machine Learning Fall 2017

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Active Learning and Optimized Information Gathering

Computational and Statistical Learning Theory

Generalization and Overfitting

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Introduction to Machine Learning

1 A Lower Bound on Sample Complexity

Introduction to Artificial Intelligence. Learning from Oberservations

Does Unlabeled Data Help?

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

References for online kernel methods

Generalization, Overfitting, and Model Selection

From inductive inference to machine learning

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Learning theory Lecture 4

CS 380: ARTIFICIAL INTELLIGENCE

Computational learning theory. PAC learning. VC dimension.

IFT Lecture 7 Elements of statistical learning theory

Learning from Observations. Chapter 18, Sections 1 3 1

PAC Model and Generalization Bounds

Learning From Data Lecture 5 Training Versus Testing

Discriminative Models

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

10.1 The Formal Model

Computational Learning Theory

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Introduction to Statistical Learning Theory

Classification: The PAC Learning Framework

The Perceptron algorithm

Machine Learning Foundations

Statistical Learning. Philipp Koehn. 10 November 2015

Active Learning: Disagreement Coefficient

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

The PAC Learning Framework -II

The sample complexity of agnostic learning with deterministic labels

Introduction to Machine Learning (67577) Lecture 3

Part of the slides are adapted from Ziko Kolter

Computational and Statistical Learning Theory

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

A first model of learning

Lecture 3: Introduction to Complexity Regularization

Lecture 4: Linear predictors and the Perceptron

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

An Algorithms-based Intro to Machine Learning

Lecture 3: Empirical Risk Minimization

Advanced Introduction to Machine Learning CMU-10715

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

Dan Roth 461C, 3401 Walnut

FORMULATION OF THE LEARNING PROBLEM

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Learning Theory Continued

Introduction to Machine Learning

Computational and Statistical Learning Theory

Transcription:

CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims

Inductive learning Simplest form: learn a function from examples f is the target function An example is a pair (x, f(x)) Problem: find a hypothesis h such that h f given a training set of examples (This is a highly simplified model of real learning: Ignores prior knowledge Assumes examples are given)

Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

Ockham's razor Ockham s razor: prefer the simplest hypothesis consistent with data Why? A simpler hypothesis is less likely to be correct "by chance" and is therefore more likely to generalize well

Ockham's razor: why? If we have 2 hypotheses with equally small training error, how can we pick the right one? If we pick the wrong one, with enough data, we will eventually find out. The amount of data we need (to be sure we pick the right hypothesis) depends on the complexity of the hypothesis class. There are more ways to "accidently" fit the training data if we have a very flexible hypothesis class. If we want to avoid the possibility of overfitting, we should restrict the complexity of the hypothesis class, or use a larger training set. (Of course, an overly simple hypothesis class may underfit.)

Empirical risk minimization The ERM principle says to pick the hypothesis with the lowest error on the training set. When does low training error guarantee low generalization error? Equivalently, given an observed error rate on a finite sample, can we bound the expected error rate on future data? (Empirical process theory) Bound will depend on size of the hypothesis class. A very complex hypothesis class can fit anything, so if it has low training error, this does not mean it will have low generalization error.

Statistical learning theory SLT is concerned with establishing conditions under which we can say We say that h is probably approximately correct This statement holds if the hypothesis class is sufficiently constrained and/or the training set size is sufficiently large

Hoeffding/ Chernoff bound Imagine estimating the probability of heads from m iid coin tosses x 1,..., x m, x i 2 {0,1} The probability of making an error of size ε is bounded by

Training vs generalization error Let S = training set, h(a(s)) be the hypothesis learned by algorithm A on S Let err S (h) be error of h on sample S, and err P (h) be the true expected error on distribution P We want to be sure low training error will give rise to low generalization error We use the union and Chernoff bounds

Hence wp 1-δ, Bounds on err train -err true The 2nd term on RHS is the growth function

Sample complexity To ensure we need this many samples

Finite H, zero training error Suppose H is finite, and there exists and h with zero training error ("truth is in the hypothesis space") We showed last time that prob. exists h 2 H with high true error rate, but zero training error (i.e., h is consistent), is bounded by which is tighter than Tighter bounds means lower sample complexity.

PAC bounds for finite H, zero training error Partition H into H ε, an ε "ball" around f true, and H bad = H \ H ε What is the prob. that a "seriously wrong" hypothesis h b 2H bad is consistent with m examples (so we are fooled)? We can use a union bound The prob of finding such an h b is bounded by

What if H is infinite? Infinite H Union bound no longer works. Also, many hypotheses may be very similar (eg rectangles of slightly different size). Roughly speaking, we replace log H with VC(H).

VC dimension Consider a sample S of size m. The set of all possible binary labelings realizable by hypothesis class H on S is H shatters S if H can produce all possible labelings The VC dimensions of H is equal to the maximal number d of examples that can be shattered Intuitively, VC = number free parameters.

VC bounds on err train -err true Thm: wp 1-δ, with d=vcd(h)

Structural risk minimization Or use cross validation!

Data dependent bounds The bound Φ(m,d,δ) is independent of the observed data set, and is therefore very loose. More complex bounds can be derived. These depend on the margin, i.e., the degree of overlap between positive and negative examples