PAC Model and Generalization Bounds

Similar documents
VC dimension and Model Selection

Computational Learning Theory. Definitions

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Computational Learning Theory. CS534 - Machine Learning

Introduction to Machine Learning

Generalization, Overfitting, and Model Selection

Statistical and Computational Learning Theory

Classification: The PAC Learning Framework

Statistical Learning Learning From Examples

Computational Learning Theory

Computational Learning Theory

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

COMS 4771 Introduction to Machine Learning. Nakul Verma

CS 6375: Machine Learning Computational Learning Theory

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

MACHINE LEARNING. Probably Approximately Correct (PAC) Learning. Alessandro Moschitti

Computational Learning Theory

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

10.1 The Formal Model

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Computational Learning Theory

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

The Perceptron algorithm

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Empirical Risk Minimization Algorithms

Computational Learning Theory

On the Sample Complexity of Noise-Tolerant Learning

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

FORMULATION OF THE LEARNING PROBLEM

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Computational Learning Theory (COLT)

References for online kernel methods

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity


1 The Probably Approximately Correct (PAC) Model

Machine Learning

Generalization and Overfitting

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

PAC-learning, VC Dimension and Margin-based Bounds

Solving Classification Problems By Knowledge Sets

Computational Learning Theory: PAC Model

Machine Learning

Web-Mining Agents Computational Learning Theory

Introduction to Machine Learning (67577) Lecture 3

Part of the slides are adapted from Ziko Kolter

Lecture 7: Passive Learning

Generalization, Overfitting, and Model Selection

Does Unlabeled Data Help?

Lecture 8. Instructor: Haipeng Luo

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

2 Upper-bound of Generalization Error of AdaBoost

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Computational and Statistical Learning Theory

Qualifying Exam in Machine Learning

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Introduction to Statistical Learning Theory

Generalization theory

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

PAC-learning, VC Dimension and Margin-based Bounds

Computational Learning Theory

Lecture 29: Computational Learning Theory

Understanding Machine Learning Solution Manual

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Statistical Learning Theory: Generalization Error Bounds

Stephen Scott.

CS446: Machine Learning Spring Problem Set 4

Computational and Statistical Learning theory

Computational Learning Theory - Hilary Term : Introduction to the PAC Learning Framework

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Confusion matrix. a = true positives b = false negatives c = false positives d = true negatives 1. F-measure combines Recall and Precision:

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Introduction to Machine Learning

Introduction to Computational Learning Theory

Lecture Support Vector Machine (SVM) Classifiers

Introduction to Machine Learning

On the power and the limits of evolvability. Vitaly Feldman Almaden Research Center

Computational Learning Theory

On Computational Limitations of Neural Network Architectures

The PAC Learning Framework -II

PAC Generalization Bounds for Co-training

CMU-Q Lecture 24:

The Vapnik-Chervonenkis Dimension

Bayesian Learning (II)

Online Learning, Mistake Bounds, Perceptron Algorithm

ICML '97 and AAAI '97 Tutorials

1 Review of The Learning Setting

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Transcription:

PAC Model and Generalization Bounds

Overview Probably Approximately Correct (PAC) model Basic generalization bounds finite hypothesis class infinite hypothesis class Simple case More next week 2

Motivating Example (PAC) Concept: Average body-size person Inputs: for each person: height weight Sample: labeled examples of persons label + : average body-size label - : not average body-size Two dimensional inputs 3

4

Motivating Example (PAC) Assumption: Target concept is a rectangle. Realizable case Goal: Find a rectangle that approximate the target. Formally: With high probability output a rectangle such that its error is low. 5

Example (Modeling) Assume: Fixed distribution over persons. Goal: Low error with respect to THIS distribution!!! How does the distribution look like? Highly complex. Each parameter is not uniform. Highly correlated. 6

PAC approach Assume that the distribution is fixed. But unknown Samples are drawn are i.i.d. independent identical Concentrate on the decision rule rather than distribution. 7

PAC Learning Task: learn a rectangle from examples. Input: points (x 1,x 2 ) and classification + or - classifies by a rectangle R Goal: With the fewest examples compute R efficiently R is a good approximation for R 8

PAC Learning: Accuracy Testing the accuracy of a hypothesis: using the distribution D of examples. Error = R D R Pr[Error] = D(Error) = D(R D R ) We would like Pr[Error] to be controllable. Given a parameter e: Find R such that Pr[Error] < e. 9

PAC Learning: Hypothesis Which Rectangle should we choose? Later we show it is not that important. 10

PAC model: Setting A distribution: D (unknown) Target function: c t from C c t : X {0,1} Hypothesis: h from H h: X {0,1} Error probability: error(h) = Prob D [h(x) c t (x)] Oracle: EX(c t,d) 15

PAC Learning: Definition C and H are concept classes over X. C is PAC learnable by H if There Exist an Algorithm A such that: For any distribution D over X and c t in C for every input e and d: outputs a hypothesis h in H, while having access to EX(c t,d) with probability 1-d we have error(h) < e Complexities: sample, running time 16

PAC: comments We only assumed that examples are i.i.d. We have two independent parameters: Accuracy e Confidence d Hypothesis is tested on the same distribution as the sample. No assumption about the likelihood of concepts. 17

Finite Concept class Assume C=H and finite. realizable case h is e-bad if error(h)> e. Algorithm: Sample a set S of m(e,d) examples. Find ħ in H which is consistent. Basically ERM Algorithm fails if ħ is e-bad. 18

Analysis Consistent hypothesis: classifies all examples correctly Fix an hypothesis g which is e-bad. The probability that g is consistent: Pr[g consistent]= i Pr[g(x i )=y i ] (1-e) m < e - em Need: Prob[ g: g consistent & e-bad] 19

Analysis The probability that: exists g which is e-bad and consistent: Use union bound: Pr[ g: g is consistent and e-bad] g is ε-bad Pr[ g is consistent and e-bad] H Pr[g consistent and e-bad] 2 H e - 2em Sample size: m > (1/2e) ln (2 H /d) 2 H log δ OR: Pr ħ is bad δ m 20

PAC: non-realizable case What happens if c t not in H Needs to redefine the goal. Let h * in H minimize the error b=error(h * ) Goal: find h in H such that error(h) error(h * ) +e = b+e Algorithm ERM Empirical Risk Minimization 21

Analysis For each h in H: let error(h) be the error on the sample S. Compute the probability that: error(h) - error(h) < e/2 Chernoff bound: 2exp(-2(e/2) 2 m) Consider entire H : Want to holds for any h With prob 1-2 H exp(-2(e/2) 2 m) = 1-δ Sample size m > (2/e 2 ) ln (2 H /d) 23

Correctness Assume that for all h in H: error(h) - error(h) < e/2 In particular: error(h ) < error(h * ) + e/2 error(h) -e/2 < error(h) For the output h ERM : error(h ERM ) < error(h ) Conclusion: error(h ERM ) < error(h * )+e 24

ERM: Finite Hypothesis Class Theroem: For any finite class H For any target function c t Using sample size m ERM will output ħ s.t. Pr error(ħ) > min h in H error(h)+ 2 log 2 H δ m δ 25

Example: Learning OR of literals Inputs: x 1,, x n Literals : x 1, x 1 OR functions: 1 x4 x7 Number of functions? x 3 n 26

ELIM: Algorithm for learning OR Keep a list of all literals For every example whose classification is 0: Erase all the literals that are 1. Example Correctness: Our hypothesis h: OR of our set of remaining literals. Our set of literals always includes the target OR literals. Every time h predicts zero: we are correct. Sample size: m > (1/e) ln (3 n /d) 27

Learning parity Functions: x 1 x 7 x 9 Number of functions: 2 n Algorithm: Sample set of examples Solve linear equations Sample size: m > (1/e) ln (2 n /d) 28

Lower Bounds No free lunch Theorems Impossibility results Too few examples any algorithm fails General structure: Show an H and D Show that if m too small, For any output, expected error is high 29

Lower bounds: Hypothesis class Fix a finite domain X Let H be any Boolean function over X H = 2 X Let D be uniform over X D(x) = 1/ X Target function c t at random from H For each x, Pr[c t (x)=1]= ½ Realizable case 30

Lower Bounds: Hypothesis class Assume we sample only X /2 examples At least X /2 points not observed For each such point: Pr[c t (x)=1]= ½ Regardless of the sample Error rate: For each unseen point: expected error ½ Expected error at least ¼ 31

Lower Bounds: accuracy & confidence Let H={c 1,c 2 }, D s.t. Pr[c 1 c 2 ]=4ε No function is ε-good for both c 1 and c 2 Select c t at random from H If for every x in S: c 1 (x)=c 2 (x) we cannot distinguish. Pr[ i: c 1 (x i )=c 2 (x i )] = (1-4ε) m δ (1-4ε) m m log δ log(1 4ε) 1 4ε log 1 δ 32

Remark on finite precision Reducing infinite to finite class assuming finite precision Real value on computers is 64 bits Any real value parameter has only 2 64 values Class size H with d parameters 2 64d log H = 64d Example: hyperplane 33

Infinite Concept class X=[0,1] and H={c q q in [0,1]} c q (x) = 0 iff x < q Assume C=H: max min Which c q should we choose in [min,max]? 34

Threshold on a line General idea: Take a large sample S of m examples Return some consistent hypothesis c q Correctness: Show that with probability 1-δ The interval [min,max] has prob. less than ε D([min,max]) ε 35

Threshold on a line: Proof Need to show that Pr[ D([min,max]) > e ] < d Proof: By Contradiction. Assume the probability that x in [min,max] is at least e D([min,max]) > e Compute the probability over the sample S Each example has prob at least e to be in [min,max] Probability that no x is in [min,max] is (1-e) m Need δ (1-e) m m > (1/e) ln (1/d) 36

Threshold on a line: Proof Need to show that Pr[ D([min,max]) > e ] < d Proof: By Contradiction. Assume the probability that x in [min,max] at least e D([min,max]) > e Compute the probability over the sample S Each example has prob at least e to be in [min,max] Probability that no x is in [min,max] is (1-e) m Need δ (1-e) m m > (1/e) ln (1/d) 37

Threshold on a line: why wrong?! Sample S is a random variable. min and max are a function of S. Let s write min(s) and max(s) Define only after we have the sample S Alternatively: First sample S, then set min/max, no more x to sample Pr S [ x in [min(s),max(s)] and x in S] Actually, by definition, this is zero! 38

Threshold on a line: corrected Define events which do not depend on the sample S Let min be : D([min,q])=e/2 Let max be : D([q,max ])=e/2 Goal: Show that with high probability min in [min,q] and max in [q,max ] Then any value in [min,max] is good. 39

Threshold on a line: corrected For any sample x: Prob x in [min,q] is e/2 Prob x in [q,max ] is e/2 Probabilities over S Probability min not in [min,q] is (1-e/2) m Probability max not in [q,max ] is (1-e/2) m Probability that [min,max] is bad At most 2 (1-e/2) m δ m (2/e) log (2/ δ) 40

Threshold: non-realizable case Suppose we sample: ERM Algorithm: Find the function h with lowest error! 41

Threshold: non-realizable Try to reduce to a finite class Build an ε-net Given a class H define a class G For every h in H There exist a g in G such that D(g D h) < e/4 Algorithm: Find the best g in G. 42

Threshold: non-realizable Define: z i as a e/4 - net (w.r.t. D) D([z i,z i+1 ])= e/4 G= { c z_i } G = 4/e For any c θ there is a g in G: error g error c θ ε/4 Learn using G What is the problem?! 43

Threshold: proof non-realizable Goal: show that ERM works use ε-net only in the proof. Sampling errors: For g in G: error(g) error g ε For any c θ in H there is a g in G error g error c θ ε 4 16 For any sample S: error g error c θ 3ε 8 44

Threshold: proof non-realizable Assume all three conditions hold For the best hypothesis h * Assume g * is its approx. in G error h error g ε 4 error g ε 4 ε 16 = error g 5ε 16 45

Threshold: proof non-realizable For the hypothesis selected by ERM h erm Assume g erm is its approx. in G error h erm error g erm + ε 4 error g erm + 5ε 16 error h erm + 11ε 16 ERM: error(h erm ) error (g ) 46

Threshold: proof non-realizable error h error g error g error(h erm ) 5ε 16 error h erm error h erm 11ε error h error h erm ε 16 47

Threshold: proof non-realizable Completing the proof Computing probability the conditions hold For g in G: error(g) error g ε 16 G is finite with ε 4 functions. For any sample S: error g error c θ 3ε 8 Sufficient that for any [z i, z i+1 ] at most 3εm 8 samples Expectation 2εm 8 48

Summary PAC model Generalization bounds Empirical Risk Minimization Finite classes Infinite classes Threshold on interval Next class Infinite classes: general methodology 49