The PAC Learning Framework -II

Similar documents
Computational Learning Theory

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Probably Approximately Correct Learning - III

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

1 The Probably Approximately Correct (PAC) Model

Computational and Statistical Learning theory

Empirical Risk Minimization

1 A Lower Bound on Sample Complexity

The Vapnik-Chervonenkis Dimension

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Machine Learning (67577) Lecture 5

IFT Lecture 7 Elements of statistical learning theory

Generalization bounds

1 Differential Privacy and Statistical Query Learning

The sample complexity of agnostic learning with deterministic labels

Understanding Generalization Error: Bounds and Decompositions

1 Review of The Learning Setting

ORIE 4741: Learning with Big Messy Data. Generalization

Introduction to Statistical Learning Theory

FORMULATION OF THE LEARNING PROBLEM

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

Introduction to Machine Learning (67577) Lecture 3

Computational Learning Theory. CS534 - Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Computational and Statistical Learning Theory

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

Naïve Bayes classification

Statistical Learning Learning From Examples

Classification: The PAC Learning Framework

10.1 The Formal Model

COMS 4771 Introduction to Machine Learning. Nakul Verma

Foundations of Machine Learning

Introduction to Machine Learning

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Generalization Bounds and Stability

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Computational Learning Theory

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Advanced Introduction to Machine Learning CMU-10715

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Bayesian Learning (II)

Computational Learning Theory. Definitions

Introduction to Machine Learning

Learning with Rejection

Logistic Regression. Machine Learning Fall 2018

Empirical Risk Minimization, Model Selection, and Model Assessment

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Computational Learning Theory

Computational and Statistical Learning Theory


COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

Dan Roth 461C, 3401 Walnut

Lecture 2 Machine Learning Review

Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015

Machine Learning

Generalization, Overfitting, and Model Selection

Linear Models for Regression CS534

ECE521 week 3: 23/26 January 2017

Linear Models for Regression CS534

Lecture : Probabilistic Machine Learning

Expectation maximization tutorial

Statistical learning. Chapter 20, Sections 1 3 1

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Statistical learning. Chapter 20, Sections 1 4 1

Computational learning theory. PAC learning. VC dimension.

Machine Learning

Computational Learning Theory

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783)

Generalization Bounds

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Machine Learning

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

outline Nonlinear transformation Error measures Noisy targets Preambles to the theory

Mathematical Foundations of Supervised Learning

Introduction to Machine Learning

Domain Adaptation Can Quantity Compensate for Quality?

Active Learning: Disagreement Coefficient

Lecture 35: December The fundamental statistical distances

Statistical Learning Theory: Generalization Error Bounds

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

Introduction to Machine Learning CMU-10701

CSCE 478/878 Lecture 6: Bayesian Learning

MODULE -4 BAYEIAN LEARNING

The Learning Problem and Regularization

Computational Learning Theory

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

CS 6375: Machine Learning Computational Learning Theory

Introduction to Statistical Learning Theory

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Lecture 7 Introduction to Statistical Decision Theory

Bayesian Machine Learning

Lecture 9: Bayesian Learning

Classification objectives COMS 4771

Transcription:

The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1

Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1

Outline Universal Concept Class Let X = {0, 1} n and let U n = P(X be the concept class formed by all subsets of X. To guarantee a consistent hypothesis, the hypothesis class must include the concept class, so H U n = 2 n. We have m 1 ( 2 n log 2 + log 1, ɛ δ The number of example is exponential in n by the theorem, so PAC learnability does not follow. 3 / 1

Finite Hypothesis Space - The Inconsistent Case Framework If the concept class is more complex than the hypotheses space it may be the case that there is no hypothesis consistent with a labeled training sample, that is, for no h S we would have ˆR(h S = 0. We use the corollary of Hoeffding s Inequality: Corollary Let X 1,..., X n be n independent random variables such that X i [0, 1] for 1 i n and let Z n be the random variable defined by: Z n = 1 n n X i. i=1 The following inequalities hold: P(Z n E(Z n ɛ e 2nɛ2 P(Z n E(Z n ɛ e 2nɛ2. 4 / 1

Finite Hypothesis Space - The Inconsistent Case Framework (cont d Recall that R(h = E( ˆR(h. The Corolarry of Hoeffding s Inequality applied to ˆR(h = 1 m {x i h(x i c(x i } m i=1 implies that for ɛ > 0, any S = (x 1,..., x m size n and any hypothesis h : X {0, 1} the following inequalities hold: ( P ˆR(h R(h ɛ e 2mɛ2 ( P ˆR(h R(h ɛ e 2mɛ2 Therefore, and P P ( ˆR(h R(h ɛ 2e 2mɛ2. ( ˆR(h R(h < ɛ 1 2e 2mɛ2. (1 5 / 1

Finite Hypothesis Space - The Inconsistent Case Generalization Bound - Single Hypothesis Corollary For a random hypothesis h : X {0, 1} and for any δ > 0 the following inequality log R(h ˆR(h 2 δ 2m holds with probability at least 1 δ. Proof: Taking 1 2 2mɛ2 1 δ, or δ 2e 2mɛ2 in Equality (1 we obtain ( P ˆR(h R(h < ɛ 1 δ, log for ɛ = 2 δ 2m. Note: The inequality of the corollary is an inequality involving random variables not numbers because h is a randomly chosen hypothesis in H 6 / 1

Finite Hypothesis Space - The Inconsistent Case Tossing a Coin Example Let p be the probability that a biased coin that lands heads. Let h be the hypothesis be the one that always guesses tails. The generalization error rate is R(h = p and let ˆR(h = ˆp, where ˆp is the empirical probability of heads based on the training sample drawn iid. Thus, with a probability of at least 1 δ we have log 2 δ ˆp p 2m. 7 / 1

Finite Hypothesis Space - The Inconsistent Case Learning bound finite H, inconsistent case Theorem Let H be a finite hypothesis set. For any δ > 0, the inequality log H + log ( h H R(h ˆR(h 2 δ 2m holds with probability at least 1 δ. Remark: this is a uniform bound (it applies to all hypotheses in H. 8 / 1

Finite Hypothesis Space - The Inconsistent Case Proof Let H = {h 1,..., h H } be the set of hypotheses. We have: ( P ( h H R(h ˆR(h > ɛ ( = P ( R(h 1 ˆR(h 1 > ɛ ( R(h H ˆR(h H > ɛ ( P R(h ˆR(h > ɛ h H 2 H e 2mɛ2. 9 / 1

Finite Hypothesis Space - The Inconsistent Case Proof (cont d Thus, we have ( P ( h H R(h ˆR(h > ɛ 2 H e 2mɛ2. Choosing δ = 2 H e 2mɛ2 it follows that log δ = log 2 + log H 2mɛ 2, so log 2+log H log δ log H +log ɛ = 2m = 2 δ 2m. With these choices we have: ( P ( h H R(h ˆR(h > ɛ δ, which amounts to the inequality of the theorem: ( P ( h H R(h ˆR(h < ɛ 1 δ. 10 / 1

Finite Hypothesis Space - The Inconsistent Case Previous theorem stipulates that for a finite hypothesis set H, we have ( log2 H R(h ˆR(h + O m Note that log 2 H is the number of bits needed to represent H ; this point to Occam s principle: a smaller hypothesis space size is better; a larger sample size m guarantees better generalization; for the inconsistent size, a larger sample size is required to obtain the same guarantee as in the consistent case (R(h S 1 ɛ (log H + log 1 δ. 11 / 1

Deterministic versus stochastic scenario The Stochastic Scenario Example the distribution D is defined now on X Y (in the deterministic scenario it was defined just on X ; the training data is a sample S = {(x 1, y 1,..., (x m, y m }, where (x i, y i are iid random variables; the output label y i is a probabilistic function of the input. If we try to predict the gender of a person based on weight and height, the result (male, or female is not unique. 12 / 1

Deterministic versus stochastic scenario Agnostic PAC-algorithms Definition Let H be a hypothesis set. An algorithm A is an agnostic PAC-algorithm if there exists a polynomial function such that of any ɛ > 0 and δ > 0 we have ( P R(h S min R(h < ɛ 1 δ h H for every sample of size ( 1 m ɛ, 1 δ and for all probability distributions D over X Y. If A runs in time polynomial in 1 ɛ, 1 δ, then A is an efficient agnostic PAC-algorithm. 13 / 1

Bayes Error and Noise Definition Given a distribution D over X Y, the Bayes error R is R = inf{r(h his measurable }. A hypothesis h such that R(h = R is called a Bayes hypothesis and denoted by h Bayes. 14 / 1

Bayes Error and Noise in the deterministic case R = 0; in the stochastic case we may have R 0; using conditional probabilities the Bayes hypothesis can be defined by ( xh Bayes (x = argmax y {0,1} P(y x, which means that the class y is the most probable class a posteriori, that is, after seeing the data x; the average error made by h Bayes on x is min{p(1 x, P(0 x}. 15 / 1

Bayes Error and Noise Definition Given a distribution D the noise at x is noise(x = min{p(1 x, P(0 x}. The average noise at x is E(noise(x. The average noise is the Bayes error: E(noise(x = R. The noise indicates the level of difficulty of the learning task. A point x X with noise(x = 0.5 is said to be noisy. 16 / 1

Bayes Error and Noise Estimation and Approximation Errors R(h the error of hypothesis h; R = inf{r(h his measurable } is the Bayes error; h is the hypothesis in H with minimal error (best in class hypothesis. It always exists when H if finite; if this is not the case, instead of R(h we can use inf h H R(h. By definition, R(h R(h R. 17 / 1

Bayes Error and Noise Since R(h R(h R, we can define: estimation error: R(h R(h ; it depends on the hypothesis h selected; approximation error: R(h R : it measures how well the Bayes error can be approximated using H. R(h R = (R(h R(h + (R(h R. 18 / 1

Bayes Error and Noise Empirical Risk Minimization ERM Definition An algorithm that returns a hypothesis hs ERM error ˆR(h is said to be an ERM algorithm. with the smallest empirical We have R(h ERM S R(h = (R(h ERM ˆR(h S ERM + ( ˆR(h S ERM R(h (R(h ERM ˆR(h S ERM + ( ˆR(h R(h 2 sup ˆR(h R(h. h H Note that: Since h is the hypothesis in H with minimal error (best in class hypothesis, R(h decreases with H. log R(h ˆR(h 2 δ 2m and increases with H. 19 / 1