Statistical Learning Theory: Generalization Error Bounds

Similar documents

Empirical Risk Minimization, Model Selection, and Model Assessment

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

ORIE 4741: Learning with Big Messy Data. Generalization

Machine Learning

Machine Learning

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

1 The Probably Approximately Correct (PAC) Model

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

The PAC Learning Framework -II

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Machine Learning (CS 567) Lecture 2

CS 543 Page 1 John E. Boon, Jr.

Naïve Bayes classification

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

Machine Learning

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

We choose parameter values that will minimize the difference between the model outputs & the true function values.

Advanced Introduction to Machine Learning CMU-10715

ORIE 4741: Learning with Big Messy Data. Train, Test, Validate

Efficient and Principled Online Classification Algorithms for Lifelon

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

PAC-learning, VC Dimension and Margin-based Bounds

Linear Models for Regression CS534

Generalization, Overfitting, and Model Selection

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Support Vector Machines: Kernels

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Computational Learning Theory. CS534 - Machine Learning

Generative Models for Classification

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Machine Learning

Introduction to Machine Learning

PAC Model and Generalization Bounds

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Computational Learning Theory

Evaluating Hypotheses

Lecture 3: Empirical Risk Minimization

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Linear Models for Regression CS534

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

Understanding Generalization Error: Bounds and Decompositions

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

An Introduction to No Free Lunch Theorems

1 Review of The Learning Setting

Introduction to Machine Learning

Introduction to Machine Learning

10.1 The Formal Model

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

Introduction to Machine Learning

CS 6140: Machine Learning Spring 2016

CS 361: Probability & Statistics

2 Upper-bound of Generalization Error of AdaBoost

Linear Models for Regression CS534

Generalization, Overfitting, and Model Selection

Generalization and Overfitting

CS446: Machine Learning Spring Problem Set 4

Machine Learning (CS 567) Lecture 3

Linear Classifiers: Expressiveness

Learning with Probabilities

Introduction to Machine Learning

Discrete Random Variables. Discrete Random Variables

MA : Introductory Probability

CS 361: Probability & Statistics

Notes on Machine Learning for and

PAC-learning, VC Dimension and Margin-based Bounds

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Learning with multiple models. Boosting.

Introduction to Machine Learning (67577) Lecture 7

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Name: 180A MIDTERM 2. (x + n)/2

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

18.440: Lecture 26 Conditional expectation

Computational Learning Theory

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Introduction to Machine Learning (67577) Lecture 3

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

Lecture 2 Machine Learning Review

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Bayesian decision making

Decision Trees. Tirgul 5

Bayesian Learning (II)

Information Theory & Decision Trees

Discrete and continuous

Chapter 10 Markov Chains and Transition Matrices

CS 6375 Machine Learning

Machine Learning Theory (CS 6783)

Chapter 18 Sampling Distribution Models

ECE521 week 3: 23/26 January 2017

COMP90051 Statistical Machine Learning

The sample complexity of agnostic learning with deterministic labels

Transcription:

Statistical Learning Theory: Generalization Error Bounds CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 6.5.4 Schoelkopf/Smola Chapter 5 (beginning, rest later)

Outline Questions in Statistical Learning Theory: How good is the learned rule after n examples? How many examples do I need before the learned rule is accurate? What can be learned and what cannot? Is there a universally best learning algorithm? In particular, we will address: What is the true error of h if we only know the training error of h? Finite hypothesis spaces and zero training error Finite hypothesis spaces and non-zero training error Infinite hypothesis spaces and VC dimension (later)

Can you Convince me of your Psychic Abilities? Game I think of n bits If somebody in the class guesses my bit sequence, that person clearly has telepathic abilities right? Question: If at least one of H players guesses the bit sequence correctly, is there any significant evidence that he/she has telepathic abilities? How large would n and H have to be?

Discriminative Learning and Prediction Reminder drawn i.i.d. Real-world Process P(X,Y) drawn i.i.d. Train Sample S train (x 1,y 1 ),, (x n,y n ) S Test Sample S test train h Learner (x n+1,y n+1 ), Goal: Find h with small prediction error Err P (h) over P(X,Y). Discriminative Learning: Given H, find h with small error Err Strain (h) on training sample S train. Training Error: Error Err Strain (h) on training sample. Test Error: Error Err Stest (h) on test sample is an estimate of Err P (h)

Useful Formulas Binomial Distribution: The probability of observing x heads in a sample of n independent coin tosses, where in each toss the probability of heads is p, is Union Bound: Unnamed: P X = x p, n = n! r! n r! px 1 p n x P X 1 = x 1 X 2 = x 2 X n = x n P X i = x i 1 ε e ε n i=1

Generalization Error Bound: Finite H, Zero Error Setting Sample of n labeled instances S train Learning Algorithm L with a finite hypothesis space H At least one h 2 H has zero prediction error Err P (h)=0 ( Err Strain (h)=0) Learning Algorithm L returns zero training error hypothesis ĥ What is the probability that the prediction error of ĥ is larger than? Training Sample S train S Test Sample S test train ĥ (x 1,y 1 ),, (x n,y n ) Learner (x n+1,y n+1 ),

Sample Complexity: Finite H, Zero Error Setting Sample of n labeled instances S train Learning Algorithm L with a finite hypothesis space H At least one h 2 H has zero prediction error ( Err Strain (h)=0) Learning Algorithm L returns zero training error hypothesis ĥ How many training examples does L need so that with probability at least (1- ) it learns an ĥ with prediction error less than? Training Sample S train S Test Sample S test train ĥ (x 1,y 1 ),, (x n,y n ) Learner (x n+1,y n+1 ),

Example: Smart Investing Task: Pick stock analyst based on past performance. Experiment: Review analyst prediction next day up/down for past 10 days. Pick analyst that makes the fewest errors. Situation 1: 2 stock analyst {A1,A2}, A1 makes 5 errors Situation 2: 5 stock analysts {A1,A2,B1,B2,B3}, B2 best with 1 error Situation 3: 1005 stock analysts {A1,A2,B1,B2,B3,C1,,C1000}, C543 best with 0 errors Question: Which analysts are you most confident in, A1, B2, or C543?

Useful Formula Hoeffding/Chernoff Bound: For any distribution P(X) where X can take the values 0 and 1, the probability that an average of an i.i.d. sample deviates from its mean p by more than is bounded as

Setting Generalization Error Bound: Sample of n labeled instances S Finite H, Non-Zero Error Learning Algorithm L with a finite hypothesis space H L returns hypothesis ĥ=l(s) with lowest training error What is the probability that the prediction error of ĥ exceeds the fraction of training errors by more than? Training Sample S train S Test Sample S test train ĥ (x 1,y 1 ),, (x n,y n ) Learner (x n+1,y n+1 ),

Overfitting vs. Underfitting [Mitchell, 1997] With probability at least (1- ):