Understanding Generalization Error: Bounds and Decompositions

Size: px
Start display at page:

Download "Understanding Generalization Error: Bounds and Decompositions"

Transcription

1 CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed in the lecture and vice versa) Outline Introduction Generalization error bounds Finite function classes Infinite function classes: VC dimension Estimation-approximation error decomposition Bias-variance decomposition 1 Introduction In many learning algorithms, we have some flexibility in choosing a model complexity parameter: the degree of a polynomial kernel; the number of hidden nodes in a neural network; the depth of a decision tree; the number of neighbors in nearest neighbor methods; and so on We have seen that as the model complexity increases, the training error generally decreases, but the generalization or test) error generally has a U shape: it is high for models of low complexity, decreases until the model complexity matches the unknown data distribution, and then becomes high again for models of higher complexity: 1

2 2 Understanding Generalization Error: Bounds and Decompositions Models of low complexity tend to underfit the data: they are not flexible enough to adequately describe patterns in the data On the other hand, models of high complexity tend to overfit the data: they are so flexible that they fit themselves not only to broad patterns, but also to various types of spurious noise in the particular training data, and so do not generalize well In general, our goal is to select a model complexity parameter that leads to neither underfitting nor overfitting, ie that leads to low generalization error The challenge, of course, is that the right model complexity depends on the unknown data distribution, and so must also be estimated from the data itself This is known as the model selection problem So far, we have relied on cross-validation as a means to estimate the generalization error for various model complexities, and to thereby solve the model selection problem However, cross-validation has two disadvantages: 1) it requires making the model selection decision based on training on a smaller number of data points than actually available since some points need to be held out for validation purposes); 2) it requires training several models for each model complexity parameter under consideration, and is therefore computationally expensive Wouldn t it be nice if, for each model complexity parameter under consideration, we could just train a model once on the full training data available, and somehow estimate the generalization error from the training error of the resulting model? In this lecture, we have two goals First, we will introduce the notion of generalization error bounds These give bounds on the generalization error of a learned model in terms of its training error There are several types of generalization error bounds that make use of different properties of the learning algorithm and/or data involved We will describe the simplest type, which is a uniform convergence bound based on the capacity of the function class searched by an algorithm; in doing so, we will also introduce the Vapnik- Chervonenkis VC) dimension, which is one widely studied measure of the capacity of a binary-valued) function class In practice, most generalization error bounds, particularly those that hold for all data distributions such as the ones we will discuss here), are quite loose, and would require a very large training sample in order to actually provide useful estimates of the generalization error However, even when they are loose, these bounds can often be useful for model selection purposes Second, we will try to better understand some of the components that contribute to the overall generalization error In particular, we will try to formalize our intuition about underfitting and overfitting by considering two types of decompositions of the generalization error: a decomposition based on notions of estimation error and approximation error, and a decomposition based on notions of bias and variance These decompositions are useful in understanding various practices in machine learning and when/why they can be helpful: for example, the estimation-approximation error decomposition is useful in motivating the practice of structural risk minimization, and the bias-variance decomposition is useful in understanding when/why the practice of bootstrap aggregation bagging) can be helpful The broad notions we will discuss are applicable in many learning settings, but to keep things concrete, we will focus our discussion mostly on binary classification und loss) The main exception will be when we discuss the bias-variance decomposition, which is most naturally discussed in the context of regression under squared loss) 2 Generalization Error Bounds In this section, our goal is provide bounds on the generalization error of a learned model in terms of its training error As discussed above, we will focus here on binary classification und loss, although the broad ideas apply more generally Let D be a probability distribution on X {±1}, and let S = x 1, y 1 ),, x m, y m )) D m be a training sample containing m labeled examples drawn iid from D Suppose we learn a binary classifier h S : X {±1}

3 Understanding Generalization Error: Bounds and Decompositions 3 from S, and observe its training error: êr 0-1 S [h S ] = 1 m Our goal is obtain bounds on the generalization error of h S : m 1h S x i ) y i ) i=1 D [h S ] = E X,Y ) D [ 1hS X) Y ) ] Since the training error êr 0-1 S [h S ] is calculated using the same sample S from which the model h S is learned, it is typically smaller than the generalization error D [h S] A generalization error bound provides a high confidence bound on the difference D [h S] êr 0-1 S [h S ] one-sided bound) or on the absolute difference D [h S] êr 0-1 S [h S ] two-sided bound) As noted above, there are many types of generalization error bounds that make use of different properties of the learning algorithm used and/or the data distribution involved We will consider here the simplest type of bound which will depend only on the function class H searched by the algorithm eg H could be the class of linear classifiers, or the class of quadratic classifiers, etc) We start with the following classical concentration inequality, which gives a high confidence bound on the deviation of the fraction of times a biased coin comes up heads from its expected value: Theorem 1 Hoeffding s inequality for iid Bernoulli random variables) Let X 1,, X m be iid Bernoulli random variables with parameter p, and let X = 1 m m i=1 X i Then for any ɛ > 0, P X p ɛ) e 2mɛ2 and P p X ɛ) e 2mɛ2 Applying Hoeffding s inequality to a fixed classifier h that is independent of the sample S, it is easy to see that for any ɛ > 0, 1 ) P S Dm S [h] ɛ e 2mɛ2 Equivalently, for any 0 < δ 1, by setting e 2mɛ2 = δ and solving for ɛ, we have that with probability at least 1 δ, ln1/δ) S [h] 2m However, this reasoning does not apply to the learned classifier h S, since it depends on the training sample S 2 In order to obtain a bound on the generalization error of h S, we need to do a little more work To provide some intuition, we start by discussing the case when h S is learned from a finite function class H; we ll then discuss the more general and more realistic) case when H can be infinite 21 Finite Function Classes Consider first the case when the function class H from which h S is learned is finite In this case, we can apply Hoeffding s inequality to each classifier h in H separately, and then can use the union bound to obtain the following uniform bound that holds simultaneously for all classifiers in H: 1 To see this, set X i = 1hx i ) y i ); then the X i s are iid Bernoulli random variables with parameter p = er D [h] 2 In particular, the random variables 1h S x i ) y i ) are not independent, since they depend on the full sample S

4 4 Understanding Generalization Error: Bounds and Decompositions Theorem 2 Uniform error bound for finite H) Let H be finite Then for any ɛ > 0, P S D m max er 0-1 S [h] ) ) ɛ H e 2mɛ2 Proof We have, P S D m max er 0-1 S [h] ) ) ɛ { }) = P S D m [h] ɛ S S ) P S D m [h] ɛ, by the union bound e 2mɛ2, by Hoeffding s inequality = H e 2mɛ2 In other words, for any 0 < δ 1, we have that with probability at least 1 δ, max er 0-1 S [h] ) ln H + ln1/δ) 2m Since the above bound holds uniformly for all classifiers in H, it follows that it holds in particular for the classifier h S selected by a learning algorithm Therefore we have the following generalization error bound for the classifier h S learned by any algorithm that searches a finite function class H: with probability at least 1 δ, ln H + ln1/δ) D [h S ] êr 0-1 S [h S ] + 2m The bound becomes smaller when the number of training examples m increases, or when the confidence parameter δ is loosened increased) The quantity ln H here acts as a measure of the capacity of the function class H: as the capacity of H increases the algorithm has more flexibility to search over a larger function class), the guarantee on the difference between the generalization error and training error becomes weaker the bound becomes larger) 22 Infinite Function Classes and VC dimension In practice, most learning algorithms we have seen learn a classifier from an infinite function class H In this case, we cannot use ln H to measure the capacity of H; we need a different notion One such widely used notion is the Vapnik-Chervonenkis VC) dimension of a class of binary-valued functions H Definition Shattering and VC dimension) Let H be a class of {±1}-valued functions on X We say a set of m points {x 1,, x m } X is shattered by H if all possible 2 m binary labelings of the points can be realized by functions in H The VC dimension of H, denoted by VCdimH), is the cardinality of the largest set of points in X that can be shattered by H If H shatters arbitrarily large sets of points in X, then VCdimH) = As an example, consider the class of linear classifiers of the form hx) = signw x + b) in 2 dimensions, X = R 2 Figure 1 shows a set of 3 points in R 2 that are shattered by this class Moreover, it can be verified that no set of 4 points is shattered by this class Therefore the VC dimension of the class of linear classifiers in R 2 is 3 More generally, the VC dimension of the class of linear classifiers in d dimensions, X = R d, is known to be d + 1

5 Understanding Generalization Error: Bounds and Decompositions 5 Figure 1: Three points in R 2 that can be shattered using linear classifiers For any function class H that has finite VC dimension, we have the following uniform bound that holds simultaneously for all classifiers in H: Theorem 3 Uniform error bound for general H) Let VCdimH) be finite Then for any ɛ > 0, 3 P S D m sup er 0-1 S [h] ) ) ɛ 4 2em ) VCdimH) e mɛ 2 /8 The proof of this result involves advanced machinery which we will not discuss here For our purposes, this yields the following generalization error bound for the classifier h S learned by any algorithm that searches a function class H of finite VC dimension: with probability at least 1 δ, D [h S ] êr 0-1 S [h S ] + 8 VCdimH) ln2m) + 1 ) ) + ln4/δ) m As before, the bound becomes smaller when the number of training examples m increases, or when the confidence parameter δ is loosened increased) The capacity of the function class H is now measured by VCdimH) The above bound is distribution-free, in that it holds for any distribution D This is both a strength and a weakness: it is a strength since it does not require any assumptions on D, but it is also a weakness since it means the bound must hold even for the worst-case distribution, and will therefore be loose for most distributions There are various other types of generalization error bounds that can be tighter than simple capacity based uniform bounds: some that are distribution-free but that involve data-dependent capacity/complexity measures eg Rademacher complexities); others that are distribution-free, but rather than giving a uniform bound for all functions in a class, bound the error of the learned classifier directly by using other properties of the learning algorithm eg algorithmic stability); and yet others that require assumptions on the distribution In general, for small/moderate sample sizes m, most generalization error bounds are too loose to be used as absolute estimates of the generalization error However, if the bounds are such that they accurately track the relative behavior of the generalization error across different function classes/algorithms, then they can be useful for model selection For example, to use the VC dimension based bound above for model selection, one would train classifiers on the given training data from different function classes, compare the VC dimension based upper bounds on the generalization errors of the learned classifiers at some suitable confidence level δ, and then select the classifier with the smallest value of this upper bound 3 Note that there are tighter versions of this bound; we state a basic version here for simplicity

6 6 Understanding Generalization Error: Bounds and Decompositions Figure 2: For a fixed sample size, as model complexity increases, the approximation error decreases, while the estimation error increases A high value of either contributes to a high generalization error: high approximation error is associated with underfitting; high estimation error is associated with overfitting 3 Estimation-Approximation Error Decomposition Some insight into the generalization error of a classifier h S learned from a function class H can be obtained by decomposing it as follows: ) ) D [h S ] = D [h S ] inf er0-1 D [h] + inf er0-1 D [h], D +, D 1) }{{}}{{}}{{} Irreducible Bayes error Estimation error in H Approximation error of H Recall that the Bayes error is the smallest possible generalization error over all possible classifiers; it is an irreducible error associated with the distribution D, sometimes also called the noise intrinsic to D The approximation error of H measures how far the best classifier in H is from the Bayes optimal classifier; it is a property of the function class H The estimation error measures how far the learned classifier h S is from the best classifier in H; this is a property of the learning algorithm, and depends on the training sample S for a good learning algorithm, one would expect that the estimation error would become smaller with increasing sample size m) In general, there is a tradeoff between the estimation error and approximation error Indeed, for a fixed training sample size m, as the model complexity here, complexity of the function class H) increases, we would expect the approximation error to decrease, and the estimation error to increase see Figure 2) Thus high approximation error is associated with underfitting; on the other hand, high estimation error is associated with overfitting For a learning algorithm to be statistically consistent, ie for its generalization error to converge to the Bayes error as m, it is clear that we must find a way to make both the estimation error and the approximation error converge to zero This is typically done via structural risk minimization, where one allows the function class H to grow with the sample size m so that the approximation error goes to zero), but does so slowly enough that one can still estimate a good function in the class so that the estimation error also goes to zero) We will come back to this at the end of the course 4 Bias-Variance Decomposition Another type of decomposition that is often useful in analyzing generalization error is the bias-variance decomposition In this case, it is most natural to discuss this decomposition in the context of regression under squared loss Therefore, in this section, we let D be a probability distribution on X R, and let

7 Understanding Generalization Error: Bounds and Decompositions 7 Figure 3: For a fixed sample size, as model complexity increases, the bias typically decreases, while the variance typically increases A high value of either contributes to a high average) generalization error: high bias is associated with underfitting; high variance is associated with overfitting S = x 1, y 1 ),, x m, y m )) D m be a training sample containing m labeled examples drawn iid from D We denote by f S : X R the regression model learned by an algorithm from S, and denote the training and generalization errors of f S as follows: êr sq S [f S] = 1 m m yi f S x i ) ) 2 i=1 er sq D [f S] = E X,Y ) D [ Y fs X)) 2] The bias-variance decomposition aims to understand the average or expected generalization error of the models f S that would result if we trained an algorithm on several different training samples S The analysis below applies both if we consider the full expectation over all samples S drawn from D m, and if we consider an average over some finite number of random samples S; in both cases, we will simply write E S to denote this expectation Our goal, then, is to understand the behavior of E S [er sq D [f S]] In order to analyze the average generalization error E S [er sq D [f S]], it will be useful to also introduce an average model f : X R, whose prediction on an instance x is obtained by simply averaging over the predictions of the individually trained models f S : fx) = E S [f S x)] Then the average generalization error can be decomposed as follows: [ E S [er sq D [f [ S]] = E X E S fs X) fx)) 2]] [ + E X fx) f X)) 2] + er sq, D }{{}}{{}}{{} Variance Bias 2 Irreducible error Recall that er sq, D = E X[Var[Y X]] is the irreducible error or intrinsic noise associated with D, and that f x) = E[Y X = x] is the optimal regression model The squared) bias term measures how far the average model f is from the optimal model f The variance term measures how much, on average, a model f S learned from a particular random sample S bounces around the average model f Again, there is a tradeoff between the bias and variance terms: for a fixed training sample size m, as the model complexity increases, we would expect the bias term to decrease, and the variance term to increase see Figure 3) Thus high bias is associated with underfitting; on the other hand, high variance is associated with overfitting Note that here, the notions of bias and variance apply to an algorithm, not necessarily to a function class; so for example, it is possible for two algorithms that both search the same function class to have different bias and variance properties

8 8 Understanding Generalization Error: Bounds and Decompositions The variance term is related to the stability of an algorithm: an algorithm with high variance has low stability, in the sense that changing the training sample S a little can produce a very different model f S The practice of bootstrap aggregation bagging), where one creates multiple randomly selected bootstrap samples from a given training sample S and aggregates averages) the models learned from the various bootstrap samples, can be viewed as a practice aimed at reducing variance This is especially useful in reducing the error of algorithms that otherwise have high variance, such as decision tree learning algorithms indeed, random forests, which apply bagging and random feature selection procedures to decision trees, often have improved performance over algorithms that learn a single decision tree) Unlike the estimation-approximation error decomposition, whose correctness can be verified by simple visual inspection, the bias-variance decomposition needs a little work to derive It is easiest to first show the decomposition for a fixed instance x, and then take expectations over x; in particular, it can be shown that for any fixed x, [ [ E S EY X=x fs x) Y ) 2]] [ = E S fs x) fx)) 2] + fx) f x)) 2 + Var[Y X = x] }{{}}{{}}{{} Variance at x Bias 2 at x Irreducible error at x We leave the details as an exercise for the reader Exercise Show that the bias-variance decomposition is correct To do this, first establish that the decomposition shown above for a fixed instance x is correct, and then take expectations over x) Acknowledgments Thanks to Achintya Kundu for help in preparing Figure 1 as part of scribing a previous lecture by the instructor)

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

Variance Reduction and Ensemble Methods

Variance Reduction and Ensemble Methods Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning PAC Learning and VC Dimension Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

The PAC Learning Framework -II

The PAC Learning Framework -II The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

1 A Lower Bound on Sample Complexity

1 A Lower Bound on Sample Complexity COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #7 Scribe: Chee Wei Tan February 25, 2008 1 A Lower Bound on Sample Complexity In the last lecture, we stopped at the lower bound on

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1 ML Big Picture Learning Paradigms:

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification MariaFlorina (Nina) Balcan 10/05/2016 Reminders Midterm Exam Mon, Oct. 10th Midterm Review Session

More information

Introduction and Models

Introduction and Models CSE522, Winter 2011, Learning Theory Lecture 1 and 2-01/04/2011, 01/06/2011 Lecturer: Ofer Dekel Introduction and Models Scribe: Jessica Chang Machine learning algorithms have emerged as the dominant and

More information

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 2: Introduction to statistical learning theory. 1 / 22 Goals of statistical learning theory SLT aims at studying the performance of

More information

Web-Mining Agents Computational Learning Theory

Web-Mining Agents Computational Learning Theory Web-Mining Agents Computational Learning Theory Prof. Dr. Ralf Möller Dr. Özgür Özcep Universität zu Lübeck Institut für Informationssysteme Tanya Braun (Exercise Lab) Computational Learning Theory (Adapted)

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

Introduction to Machine Learning (67577) Lecture 5

Introduction to Machine Learning (67577) Lecture 5 Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai

More information

An Introduction to Statistical Machine Learning - Theoretical Aspects -

An Introduction to Statistical Machine Learning - Theoretical Aspects - An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

PDEEC Machine Learning 2016/17

PDEEC Machine Learning 2016/17 PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /

More information

Computational and Statistical Learning theory

Computational and Statistical Learning theory Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,

More information

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization : Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage

More information

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable

More information

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1 Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1 The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq.

More information

Statistical and Computational Learning Theory

Statistical and Computational Learning Theory Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, 018 Review Theorem (Occam s Razor). Say algorithm A finds a hypothesis h A H consistent with

More information

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality Weiqiang Dong 1 The goal of the work presented here is to illustrate that classification error responds to error in the target probability estimates

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory Problem set 1 Due: Monday, October 10th Please send your solutions to learning-submissions@ttic.edu Notation: Input space: X Label space: Y = {±1} Sample:

More information

Computational Learning Theory. Definitions

Computational Learning Theory. Definitions Computational Learning Theory Computational learning theory is interested in theoretical analyses of the following issues. What is needed to learn effectively? Sample complexity. How many examples? Computational

More information

Classifier Complexity and Support Vector Classifiers

Classifier Complexity and Support Vector Classifiers Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines

More information

Decision trees COMS 4771

Decision trees COMS 4771 Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Slides by and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5) Computational Learning Theory Inductive learning: given the training set, a learning algorithm

More information

CS 6375: Machine Learning Computational Learning Theory

CS 6375: Machine Learning Computational Learning Theory CS 6375: Machine Learning Computational Learning Theory Vibhav Gogate The University of Texas at Dallas Many slides borrowed from Ray Mooney 1 Learning Theory Theoretical characterizations of Difficulty

More information

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell Hypothesis Testing and Computational Learning Theory EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell Overview Hypothesis Testing: How do we know our learners are good? What does performance

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Cognitive Cyber-Physical System

Cognitive Cyber-Physical System Cognitive Cyber-Physical System Physical to Cyber-Physical The emergence of non-trivial embedded sensor units, networked embedded systems and sensor/actuator networks has made possible the design and implementation

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Support Vector Machines for Classification and Regression

Support Vector Machines for Classification and Regression CIS 520: Machine Learning Oct 04, 207 Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may

More information

Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

12.1 A Polynomial Bound on the Sample Size m for PAC Learning 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 12: PAC III Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 In this lecture will use the measure of VC dimension, which is a combinatorial

More information

10.1 The Formal Model

10.1 The Formal Model 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Learning Theory Continued

Learning Theory Continued Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec.

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Generalization theory

Generalization theory Generalization theory Chapter 4 T.P. Runarsson (tpr@hi.is) and S. Sigurdsson (sven@hi.is) Introduction Suppose you are given the empirical observations, (x 1, y 1 ),..., (x l, y l ) (X Y) l. Consider the

More information

TDT4173 Machine Learning

TDT4173 Machine Learning TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

A first model of learning

A first model of learning A first model of learning Let s restrict our attention to binary classification our labels belong to (or ) We observe the data where each Suppose we are given an ensemble of possible hypotheses / classifiers

More information

References for online kernel methods

References for online kernel methods References for online kernel methods W. Liu, J. Principe, S. Haykin Kernel Adaptive Filtering: A Comprehensive Introduction. Wiley, 2010. W. Liu, P. Pokharel, J. Principe. The kernel least mean square

More information

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

Learning Theory, Overfi1ng, Bias Variance Decomposi9on Learning Theory, Overfi1ng, Bias Variance Decomposi9on Machine Learning 10-601B Seyoung Kim Many of these slides are derived from Tom Mitchell, Ziv- 1 Bar Joseph. Thanks! Any(!) learner that outputs a

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good

More information

Generalization Bounds for the Area Under an ROC Curve

Generalization Bounds for the Area Under an ROC Curve Generalization Bounds for the Area Under an ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich, Sariel Har-Peled and Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 4, 2015 Today: Generative discriminative classifiers Linear regression Decomposition of error into

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

CS 543 Page 1 John E. Boon, Jr.

CS 543 Page 1 John E. Boon, Jr. CS 543 Machine Learning Spring 2010 Lecture 05 Evaluating Hypotheses I. Overview A. Given observed accuracy of a hypothesis over a limited sample of data, how well does this estimate its accuracy over

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

Machine Learning

Machine Learning Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 1, 2011 Today: Generative discriminative classifiers Linear regression Decomposition of error into

More information

http://imgs.xkcd.com/comics/electoral_precedent.png Statistical Learning Theory CS4780/5780 Machine Learning Fall 2012 Thorsten Joachims Cornell University Reading: Mitchell Chapter 7 (not 7.4.4 and 7.5)

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14 Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office

More information