Probably Approximately Correct (PAC) Learning

Similar documents
12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Lecture 3: Introduction to Complexity Regularization

FORMULATION OF THE LEARNING PROBLEM

Understanding Generalization Error: Bounds and Decompositions

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Case study: stochastic simulation via Rademacher bootstrap

VC dimension and Model Selection

Generalization, Overfitting, and Model Selection

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Generalization theory

Lecture Support Vector Machine (SVM) Classifiers

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Lecture 3: Statistical Decision Theory (Part II)

IFT Lecture 7 Elements of statistical learning theory

Solving Classification Problems By Knowledge Sets

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

GRADIENT DESCENT AND LOCAL MINIMA

Introduction to Machine Learning (67577) Lecture 3

Consistency of Nearest Neighbor Methods

Advanced Introduction to Machine Learning CMU-10715

Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1]

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Introduction to the regression problem. Luca Martino

1 A Lower Bound on Sample Complexity

Cross-Validation with Confidence

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

12 Statistical Justifications; the Bias-Variance Decomposition

Cross-Validation with Confidence

An Introduction to Statistical Machine Learning - Theoretical Aspects -

Symmetrization and Rademacher Averages

Generalization and Overfitting

Machine Learning Lecture 7

1 Review of The Learning Setting

Computational and Statistical Learning theory

1 The Probably Approximately Correct (PAC) Model

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

11. Learning graphical models

Max Margin-Classifier

ECE 5984: Introduction to Machine Learning

FA Homework 5 Recitation 1. Easwaran Ramamurthy Guoquan (GQ) Zhao Logan Brooks

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

How do we compare the relative performance among competing models?

CS 6375: Machine Learning Computational Learning Theory

Unlabeled Data: Now It Helps, Now It Doesn t

PAC-learning, VC Dimension and Margin-based Bounds

Computational and Statistical Learning Theory

ECE521 Lecture7. Logistic Regression

Sample width for multi-category classifiers

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Class 2 & 3 Overfitting & Regularization

Model Complexity. Chapter 1

Bits of Machine Learning Part 1: Supervised Learning

A Geometric Approach to Statistical Learning Theory

Linear Classifiers: Expressiveness

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Statistical Machine Learning

Hilbert Space Methods in Learning

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Support vector machines Lecture 4

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

30.1 Continuity of scalar fields: Definition: Theorem: Module 10 : Scaler fields, Limit and Continuity

The PAC Learning Framework -II

Computational and Statistical Learning Theory

Today. Calculus. Linear Regression. Lagrange Multipliers

Minimax risk bounds for linear threshold functions

Recap from previous lecture

Lecture 4: Training a Classifier

2.7 Subsequences. Definition Suppose that (s n ) n N is a sequence. Let (n 1, n 2, n 3,... ) be a sequence of natural numbers such that

Computational Learning Theory

Linear regression COMS 4771

Lecture 5: February 16, 2012

Introduction to Machine Learning CMU-10701

Stochastic Subgradient Method

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

2. Transience and Recurrence

Decision trees COMS 4771

Plug-in Approach to Active Learning

Lecture 7 Introduction to Statistical Decision Theory

Machine Learning Linear Classification. Prof. Matteo Matteucci

Classification and Regression Trees

Support Vector Machines. Machine Learning Fall 2017

Adaptive Sampling Under Low Noise Conditions 1

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Introduction to Machine Learning

Machine Learning

Computational Oracle Inequalities for Large Scale Model Selection Problems

Statistical Learning Learning From Examples

2 Upper-bound of Generalization Error of AdaBoost

MLCC 2017 Regularization Networks I: Linear Models

Decision Tree Learning Lecture 2

Information, Learning and Falsification

Lecture 7: Passive Learning

Transcription:

ECE91 Spring 24 Statistical Regularization and Learning Theory Lecture: 6 Probably Approximately Correct (PAC) Learning Lecturer: Rob Nowak Scribe: Badri Narayan 1 Introduction 1.1 Overview of the Learning Problem The fundamental problem in learning from data is proper Model Selection. As we have seen in the previous lectures, a model that is too complex could overfit the training data (causing an estimation error) and a model that is too simple could be a bad approximation of the function that we are trying to estimate (causing an approximation error). The estimation error arises because of the fact that we do not know the true joint distribution of data in the input and output space, and therefore we minimize the empirical risk (which is a random number depending on the data) and estimate the average risk again from the limited number of training samples we have. The approximation error measures how well the functions in the chosen model space can approximate the true dependence of the output space on the input space, and in general improves as the size of our model space increases. 1.2 Lecture Outline In the preceding lectures, we looked at some proposed solutions to deal with the overfitting problem and investigated in detail the application of Method of Sieves which suggests a way of making the complexity of the model space depend on the number of data samples. We refine our idea further and propose that we may do better if we allowed the complexity of the model space to adapt to the distribution of the training data, rather than just the number of samples. The premise is that such a choice of model space could reduce the probability of overfitting as we only choose a model that is no more complex than what the data can reliably tell us. Based on these ideas, the second part of the lecture introduces a learning model called Probably Approximately Correct learning, where we derive bounds on estimation error for a choice of model space and given training data. 2 Recap: Method of Sieves The method of Sieves underpinned our approaches in the denoising problem and in the histogram classification problem. Recall that the basic idea is to define a sequence of model spaces F 1, F 2,... of increasing complexity, and then given the training data {X i, Y i } n i=1 select a model according to fˆ n = arg min ˆRn (f) f F n The choice of the model space F n (and hence the model complexity and structure) is determined completely by the sample size n, and does not depend on the values of the training data. This is a major limitation of the sieve method. In a nutshell, the method of sieves tells us to average the data in a certain way ( e.g., over a partition X ) based on the sample size, independent on the sample values. Learning basically comprises of two things: 1. Averaging data to reduce variability 1

Probably Approximately Correct (PAC) Learning 2 2. Deciding where (or how) to average to reduce bias Sieves basically force us to deal with (2) a priori (before we analyze the tranining data). This will lead to suboptimal classifiers and estimators, in general. Indeed (2) is the really interesting and fundamental aspect of learning; once we learn to deal with (2) we have solved the learing problem. There are at least two possibilities for breaking the rigidity of the method of sieves, as we shall see in the following section. 3 Data dependent Model Selection 1. Structural Risk Minimization 2. Complexity Regularization 3.1 Structural Risk Minimization The basic idea is to select F n based on the training data themselves. Let F 1, F 2,... be a sequence of model spaces of increasing complexities with lim inf R(f) = R k f F k Let ˆf n,k = arg min ˆRn (f) f F k, a function from F k that minimizes the empirical risk. This gives us a sequence of selected models ˆf n,1, ˆf n,2,... Also associate with each set F k a value C n,k > that measures the complexity or size of the set F k. Typically, C n,k is monotonically increasing with k (since the sets are of increasing complexity) and decreasing with n (since we become more confident with more training data). A typical example could be the use of VC dimension to characterize the complexity of the collection of model spaces i.e.,c n,k is derived from a bound on the estimation error. 3.2 Complexity Regularization In this case, to every f F assign a complexity C n (f) (e.g., code length) and select ˆf n = arg min f F { ˆR n (f) + C n (f)} Complexity Regularization and Structural Risk Minimization (SRM) are very similar (differing only in how one measures complexity) and can be equivalent in certain instances. The key point of SRM and complexity reqularization techniques is that the complexity and structure of the model is not fixed prior to examining the data, the data aid in the selection of the best complexity. In fact, the key difference compared to the method of Sieves is that these methods can allow the data to play an integral role in deciding where and how to average the data. 4 Probably Approximately Correct (PAC) learning In this section, we look at the development of the theory of the PAC learning model which relates the sample size of the model space and the complexity of the class k required to bound the estimation error to be with a certain accuracy ε, and with a certain confidence 1 δ

Probably Approximately Correct (PAC) Learning 3 4.1 Approximation and Estimation Errors In order to develop complexity regularization schemes we will need to revisit the estimation error / approximation error trade-off. Let ˆf n = arg min f F ˆRn (f) for some space of models F. E[R( ˆf n )] R = inf R(f) R + E[R( ˆf n )] inf R(f) f F f F }{{}}{{} approximation error Estimation Error The approximation error depends on how close f is close to F, and without making assumptions, this is unknown. The estimation error is quantifiable, and depends on the complexity of the size of F. The error decomposition is illustrated in Figure 1. The estimation error quantifies how much we can trust the empirical risk minimization process to select a model close to the best in a given class. Figure 1: Relationship between the errors In SRM, we have a sequence of classes F 1, F 2,... of increasing complexity. We bound the estimation error E[R( ˆf n )] inf f F R(f) C(n, F k) in each class, and then use C(n, F k ), k = 1, 2, to help us avoid overfitting to the training model. That is we select a model from a class F k that is not too simple and not too complex; one that balances the empirical risk minimization and the estimation error boubd in an appropriate way. 4.2 The PAC Learning Model (Valiant 84) The estimation error will be small if R( ˆf n ) is close to inf f C R(f). PAC learning expresses this as follows. We want ˆf n to be the probably approximately correct model from C. Finally ˆf n is PAC if for some ε >, δ >, ( ) P R( ˆf n ) inf R(f) > ε < δ f C This says that the difference between R( ˆf n ) and inf f C R(f) is greater than ε with probability less than δ. Using a PAC bound of this form, one can easily bound the estimation error E[R( ˆf n )] inf f C R(f) as we will see. Sometimes, especially in computer science jargon, PAC bounds are stated as, with probability of at least 1 δ, R( ˆf n ) inf f F R(f) ε

Probably Approximately Correct (PAC) Learning 4 To introduce PAC bounds, let us consider a simple case. Let Fconsist of a finite number of models, and let F denote that number. Furthermore, assume that min f F R(f) =. Example 1 F= set of all histogram classifiers with M bins = F = 2 M min R(f) = = a classifier in Fthat has a zero probability of error f F Theorem 1 Assume F < and min f F R(f) =, where R(f) = P (f(x) Y ). Then for every n and ε >, P (R( ˆf n ) > ε) F e nε δ Proof: Since min f F R(f) =, it follows that ˆR n (f n ) =. In fact, there may be several f F such that ˆR n (f) =. Let G = {f : ˆR n (f) = }. P (R( ˆf n ) > ε) P {R(f) > ε} i.e., f G = P {R(f) > ε, ˆRn (f) = } = P f F f F:R(f)>ε f F:R(f)>ε F.(1 ε) n { ˆR n (f) = } P ( ˆR n (f) = ) (since the probability that none of the training samples fall into the set {(X, Y ) : f(x) Y } is less than (1 ε) n. Finally apply the inequality 1 x e x Note that for n sufficiently large, R(f) > ε = P (f(x) Y ) > ε = P ({X, Y } : f(x) Y ) > ε = P ({X, Y } : f(x) = Y ) 1 ε δ = F e nε is arbitrarily small. To achieve a (ε, δ)-pac bound for a desired ε >, δ > we require at least training examples. n = log F log δ ε Corollary 1 Assume that F < and min f F R(f) =. Then for every n and ε >, E[R( ˆf n )] 1 + log F n

Probably Approximately Correct (PAC) Learning 5 Proof: Recall that for any non-negative random variable Z with finite mean, E[Z] = P (Z > t)dt. This follows from an application of integration by parts. E[R( ˆf n )] = = u P (R( ˆf n ) > t)dt P (R( ˆf n ) > t) dt + }{{} u + F 1 u = u + F n e nu e nt dt u P (R( ˆf n ) > t)dt, for any u > Minimizing with respect to u produces the smallest upper bound with u = log F n