Understanding Generalization Error: Bounds and Decompositions
|
|
- Caitlin Page
- 5 years ago
- Views:
Transcription
1 CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed in the lecture and vice versa) Outline Introduction Generalization error bounds Finite function classes Infinite function classes: VC dimension Estimation-approximation error decomposition Bias-variance decomposition 1 Introduction In many learning algorithms, we have some flexibility in choosing a model complexity parameter: the degree of a polynomial kernel; the number of hidden nodes in a neural network; the depth of a decision tree; the number of neighbors in nearest neighbor methods; and so on We have seen that as the model complexity increases, the training error generally decreases, but the generalization or test) error generally has a U shape: it is high for models of low complexity, decreases until the model complexity matches the unknown data distribution, and then becomes high again for models of higher complexity: 1
2 2 Understanding Generalization Error: Bounds and Decompositions Models of low complexity tend to underfit the data: they are not flexible enough to adequately describe patterns in the data On the other hand, models of high complexity tend to overfit the data: they are so flexible that they fit themselves not only to broad patterns, but also to various types of spurious noise in the particular training data, and so do not generalize well In general, our goal is to select a model complexity parameter that leads to neither underfitting nor overfitting, ie that leads to low generalization error The challenge, of course, is that the right model complexity depends on the unknown data distribution, and so must also be estimated from the data itself This is known as the model selection problem So far, we have relied on cross-validation as a means to estimate the generalization error for various model complexities, and to thereby solve the model selection problem However, cross-validation has two disadvantages: 1) it requires making the model selection decision based on training on a smaller number of data points than actually available since some points need to be held out for validation purposes); 2) it requires training several models for each model complexity parameter under consideration, and is therefore computationally expensive Wouldn t it be nice if, for each model complexity parameter under consideration, we could just train a model once on the full training data available, and somehow estimate the generalization error from the training error of the resulting model? In this lecture, we have two goals First, we will introduce the notion of generalization error bounds These give bounds on the generalization error of a learned model in terms of its training error There are several types of generalization error bounds that make use of different properties of the learning algorithm and/or data involved We will describe the simplest type, which is a uniform convergence bound based on the capacity of the function class searched by an algorithm; in doing so, we will also introduce the Vapnik- Chervonenkis VC) dimension, which is one widely studied measure of the capacity of a binary-valued) function class In practice, most generalization error bounds, particularly those that hold for all data distributions such as the ones we will discuss here), are quite loose, and would require a very large training sample in order to actually provide useful estimates of the generalization error However, even when they are loose, these bounds can often be useful for model selection purposes Second, we will try to better understand some of the components that contribute to the overall generalization error In particular, we will try to formalize our intuition about underfitting and overfitting by considering two types of decompositions of the generalization error: a decomposition based on notions of estimation error and approximation error, and a decomposition based on notions of bias and variance These decompositions are useful in understanding various practices in machine learning and when/why they can be helpful: for example, the estimation-approximation error decomposition is useful in motivating the practice of structural risk minimization, and the bias-variance decomposition is useful in understanding when/why the practice of bootstrap aggregation bagging) can be helpful The broad notions we will discuss are applicable in many learning settings, but to keep things concrete, we will focus our discussion mostly on binary classification und loss) The main exception will be when we discuss the bias-variance decomposition, which is most naturally discussed in the context of regression under squared loss) 2 Generalization Error Bounds In this section, our goal is provide bounds on the generalization error of a learned model in terms of its training error As discussed above, we will focus here on binary classification und loss, although the broad ideas apply more generally Let D be a probability distribution on X {±1}, and let S = x 1, y 1 ),, x m, y m )) D m be a training sample containing m labeled examples drawn iid from D Suppose we learn a binary classifier h S : X {±1}
3 Understanding Generalization Error: Bounds and Decompositions 3 from S, and observe its training error: êr 0-1 S [h S ] = 1 m Our goal is obtain bounds on the generalization error of h S : m 1h S x i ) y i ) i=1 D [h S ] = E X,Y ) D [ 1hS X) Y ) ] Since the training error êr 0-1 S [h S ] is calculated using the same sample S from which the model h S is learned, it is typically smaller than the generalization error D [h S] A generalization error bound provides a high confidence bound on the difference D [h S] êr 0-1 S [h S ] one-sided bound) or on the absolute difference D [h S] êr 0-1 S [h S ] two-sided bound) As noted above, there are many types of generalization error bounds that make use of different properties of the learning algorithm used and/or the data distribution involved We will consider here the simplest type of bound which will depend only on the function class H searched by the algorithm eg H could be the class of linear classifiers, or the class of quadratic classifiers, etc) We start with the following classical concentration inequality, which gives a high confidence bound on the deviation of the fraction of times a biased coin comes up heads from its expected value: Theorem 1 Hoeffding s inequality for iid Bernoulli random variables) Let X 1,, X m be iid Bernoulli random variables with parameter p, and let X = 1 m m i=1 X i Then for any ɛ > 0, P X p ɛ) e 2mɛ2 and P p X ɛ) e 2mɛ2 Applying Hoeffding s inequality to a fixed classifier h that is independent of the sample S, it is easy to see that for any ɛ > 0, 1 ) P S Dm S [h] ɛ e 2mɛ2 Equivalently, for any 0 < δ 1, by setting e 2mɛ2 = δ and solving for ɛ, we have that with probability at least 1 δ, ln1/δ) S [h] 2m However, this reasoning does not apply to the learned classifier h S, since it depends on the training sample S 2 In order to obtain a bound on the generalization error of h S, we need to do a little more work To provide some intuition, we start by discussing the case when h S is learned from a finite function class H; we ll then discuss the more general and more realistic) case when H can be infinite 21 Finite Function Classes Consider first the case when the function class H from which h S is learned is finite In this case, we can apply Hoeffding s inequality to each classifier h in H separately, and then can use the union bound to obtain the following uniform bound that holds simultaneously for all classifiers in H: 1 To see this, set X i = 1hx i ) y i ); then the X i s are iid Bernoulli random variables with parameter p = er D [h] 2 In particular, the random variables 1h S x i ) y i ) are not independent, since they depend on the full sample S
4 4 Understanding Generalization Error: Bounds and Decompositions Theorem 2 Uniform error bound for finite H) Let H be finite Then for any ɛ > 0, P S D m max er 0-1 S [h] ) ) ɛ H e 2mɛ2 Proof We have, P S D m max er 0-1 S [h] ) ) ɛ { }) = P S D m [h] ɛ S S ) P S D m [h] ɛ, by the union bound e 2mɛ2, by Hoeffding s inequality = H e 2mɛ2 In other words, for any 0 < δ 1, we have that with probability at least 1 δ, max er 0-1 S [h] ) ln H + ln1/δ) 2m Since the above bound holds uniformly for all classifiers in H, it follows that it holds in particular for the classifier h S selected by a learning algorithm Therefore we have the following generalization error bound for the classifier h S learned by any algorithm that searches a finite function class H: with probability at least 1 δ, ln H + ln1/δ) D [h S ] êr 0-1 S [h S ] + 2m The bound becomes smaller when the number of training examples m increases, or when the confidence parameter δ is loosened increased) The quantity ln H here acts as a measure of the capacity of the function class H: as the capacity of H increases the algorithm has more flexibility to search over a larger function class), the guarantee on the difference between the generalization error and training error becomes weaker the bound becomes larger) 22 Infinite Function Classes and VC dimension In practice, most learning algorithms we have seen learn a classifier from an infinite function class H In this case, we cannot use ln H to measure the capacity of H; we need a different notion One such widely used notion is the Vapnik-Chervonenkis VC) dimension of a class of binary-valued functions H Definition Shattering and VC dimension) Let H be a class of {±1}-valued functions on X We say a set of m points {x 1,, x m } X is shattered by H if all possible 2 m binary labelings of the points can be realized by functions in H The VC dimension of H, denoted by VCdimH), is the cardinality of the largest set of points in X that can be shattered by H If H shatters arbitrarily large sets of points in X, then VCdimH) = As an example, consider the class of linear classifiers of the form hx) = signw x + b) in 2 dimensions, X = R 2 Figure 1 shows a set of 3 points in R 2 that are shattered by this class Moreover, it can be verified that no set of 4 points is shattered by this class Therefore the VC dimension of the class of linear classifiers in R 2 is 3 More generally, the VC dimension of the class of linear classifiers in d dimensions, X = R d, is known to be d + 1
5 Understanding Generalization Error: Bounds and Decompositions 5 Figure 1: Three points in R 2 that can be shattered using linear classifiers For any function class H that has finite VC dimension, we have the following uniform bound that holds simultaneously for all classifiers in H: Theorem 3 Uniform error bound for general H) Let VCdimH) be finite Then for any ɛ > 0, 3 P S D m sup er 0-1 S [h] ) ) ɛ 4 2em ) VCdimH) e mɛ 2 /8 The proof of this result involves advanced machinery which we will not discuss here For our purposes, this yields the following generalization error bound for the classifier h S learned by any algorithm that searches a function class H of finite VC dimension: with probability at least 1 δ, D [h S ] êr 0-1 S [h S ] + 8 VCdimH) ln2m) + 1 ) ) + ln4/δ) m As before, the bound becomes smaller when the number of training examples m increases, or when the confidence parameter δ is loosened increased) The capacity of the function class H is now measured by VCdimH) The above bound is distribution-free, in that it holds for any distribution D This is both a strength and a weakness: it is a strength since it does not require any assumptions on D, but it is also a weakness since it means the bound must hold even for the worst-case distribution, and will therefore be loose for most distributions There are various other types of generalization error bounds that can be tighter than simple capacity based uniform bounds: some that are distribution-free but that involve data-dependent capacity/complexity measures eg Rademacher complexities); others that are distribution-free, but rather than giving a uniform bound for all functions in a class, bound the error of the learned classifier directly by using other properties of the learning algorithm eg algorithmic stability); and yet others that require assumptions on the distribution In general, for small/moderate sample sizes m, most generalization error bounds are too loose to be used as absolute estimates of the generalization error However, if the bounds are such that they accurately track the relative behavior of the generalization error across different function classes/algorithms, then they can be useful for model selection For example, to use the VC dimension based bound above for model selection, one would train classifiers on the given training data from different function classes, compare the VC dimension based upper bounds on the generalization errors of the learned classifiers at some suitable confidence level δ, and then select the classifier with the smallest value of this upper bound 3 Note that there are tighter versions of this bound; we state a basic version here for simplicity
6 6 Understanding Generalization Error: Bounds and Decompositions Figure 2: For a fixed sample size, as model complexity increases, the approximation error decreases, while the estimation error increases A high value of either contributes to a high generalization error: high approximation error is associated with underfitting; high estimation error is associated with overfitting 3 Estimation-Approximation Error Decomposition Some insight into the generalization error of a classifier h S learned from a function class H can be obtained by decomposing it as follows: ) ) D [h S ] = D [h S ] inf er0-1 D [h] + inf er0-1 D [h], D +, D 1) }{{}}{{}}{{} Irreducible Bayes error Estimation error in H Approximation error of H Recall that the Bayes error is the smallest possible generalization error over all possible classifiers; it is an irreducible error associated with the distribution D, sometimes also called the noise intrinsic to D The approximation error of H measures how far the best classifier in H is from the Bayes optimal classifier; it is a property of the function class H The estimation error measures how far the learned classifier h S is from the best classifier in H; this is a property of the learning algorithm, and depends on the training sample S for a good learning algorithm, one would expect that the estimation error would become smaller with increasing sample size m) In general, there is a tradeoff between the estimation error and approximation error Indeed, for a fixed training sample size m, as the model complexity here, complexity of the function class H) increases, we would expect the approximation error to decrease, and the estimation error to increase see Figure 2) Thus high approximation error is associated with underfitting; on the other hand, high estimation error is associated with overfitting For a learning algorithm to be statistically consistent, ie for its generalization error to converge to the Bayes error as m, it is clear that we must find a way to make both the estimation error and the approximation error converge to zero This is typically done via structural risk minimization, where one allows the function class H to grow with the sample size m so that the approximation error goes to zero), but does so slowly enough that one can still estimate a good function in the class so that the estimation error also goes to zero) We will come back to this at the end of the course 4 Bias-Variance Decomposition Another type of decomposition that is often useful in analyzing generalization error is the bias-variance decomposition In this case, it is most natural to discuss this decomposition in the context of regression under squared loss Therefore, in this section, we let D be a probability distribution on X R, and let
7 Understanding Generalization Error: Bounds and Decompositions 7 Figure 3: For a fixed sample size, as model complexity increases, the bias typically decreases, while the variance typically increases A high value of either contributes to a high average) generalization error: high bias is associated with underfitting; high variance is associated with overfitting S = x 1, y 1 ),, x m, y m )) D m be a training sample containing m labeled examples drawn iid from D We denote by f S : X R the regression model learned by an algorithm from S, and denote the training and generalization errors of f S as follows: êr sq S [f S] = 1 m m yi f S x i ) ) 2 i=1 er sq D [f S] = E X,Y ) D [ Y fs X)) 2] The bias-variance decomposition aims to understand the average or expected generalization error of the models f S that would result if we trained an algorithm on several different training samples S The analysis below applies both if we consider the full expectation over all samples S drawn from D m, and if we consider an average over some finite number of random samples S; in both cases, we will simply write E S to denote this expectation Our goal, then, is to understand the behavior of E S [er sq D [f S]] In order to analyze the average generalization error E S [er sq D [f S]], it will be useful to also introduce an average model f : X R, whose prediction on an instance x is obtained by simply averaging over the predictions of the individually trained models f S : fx) = E S [f S x)] Then the average generalization error can be decomposed as follows: [ E S [er sq D [f [ S]] = E X E S fs X) fx)) 2]] [ + E X fx) f X)) 2] + er sq, D }{{}}{{}}{{} Variance Bias 2 Irreducible error Recall that er sq, D = E X[Var[Y X]] is the irreducible error or intrinsic noise associated with D, and that f x) = E[Y X = x] is the optimal regression model The squared) bias term measures how far the average model f is from the optimal model f The variance term measures how much, on average, a model f S learned from a particular random sample S bounces around the average model f Again, there is a tradeoff between the bias and variance terms: for a fixed training sample size m, as the model complexity increases, we would expect the bias term to decrease, and the variance term to increase see Figure 3) Thus high bias is associated with underfitting; on the other hand, high variance is associated with overfitting Note that here, the notions of bias and variance apply to an algorithm, not necessarily to a function class; so for example, it is possible for two algorithms that both search the same function class to have different bias and variance properties
8 8 Understanding Generalization Error: Bounds and Decompositions The variance term is related to the stability of an algorithm: an algorithm with high variance has low stability, in the sense that changing the training sample S a little can produce a very different model f S The practice of bootstrap aggregation bagging), where one creates multiple randomly selected bootstrap samples from a given training sample S and aggregates averages) the models learned from the various bootstrap samples, can be viewed as a practice aimed at reducing variance This is especially useful in reducing the error of algorithms that otherwise have high variance, such as decision tree learning algorithms indeed, random forests, which apply bagging and random feature selection procedures to decision trees, often have improved performance over algorithms that learn a single decision tree) Unlike the estimation-approximation error decomposition, whose correctness can be verified by simple visual inspection, the bias-variance decomposition needs a little work to derive It is easiest to first show the decomposition for a fixed instance x, and then take expectations over x; in particular, it can be shown that for any fixed x, [ [ E S EY X=x fs x) Y ) 2]] [ = E S fs x) fx)) 2] + fx) f x)) 2 + Var[Y X = x] }{{}}{{}}{{} Variance at x Bias 2 at x Irreducible error at x We leave the details as an exercise for the reader Exercise Show that the bias-variance decomposition is correct To do this, first establish that the decomposition shown above for a fixed instance x is correct, and then take expectations over x) Acknowledgments Thanks to Achintya Kundu for help in preparing Figure 1 as part of scribing a previous lecture by the instructor)
Machine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationGeneralization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh
Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationComputational Learning Theory
CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful
More informationLecture 3: Introduction to Complexity Regularization
ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationGeneralization and Overfitting
Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationAn Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI
An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process
More informationVariance Reduction and Ensemble Methods
Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationComputational Learning Theory
Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms
More informationCS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims
CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target
More informationIntroduction to Machine Learning (67577) Lecture 3
Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz
More informationIntroduction to Machine Learning
Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical
More informationLecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;
CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationIntroduction to Machine Learning
Introduction to Machine Learning PAC Learning and VC Dimension Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More information12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016
12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses
More informationThe PAC Learning Framework -II
The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More information1 A Lower Bound on Sample Complexity
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #7 Scribe: Chee Wei Tan February 25, 2008 1 A Lower Bound on Sample Complexity In the last lecture, we stopped at the lower bound on
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationPAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1 ML Big Picture Learning Paradigms:
More informationE0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)
E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how
More informationPart of the slides are adapted from Ziko Kolter
Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification MariaFlorina (Nina) Balcan 10/05/2016 Reminders Midterm Exam Mon, Oct. 10th Midterm Review Session
More informationIntroduction and Models
CSE522, Winter 2011, Learning Theory Lecture 1 and 2-01/04/2011, 01/06/2011 Lecturer: Ofer Dekel Introduction and Models Scribe: Jessica Chang Machine learning algorithms have emerged as the dominant and
More informationCSE 151 Machine Learning. Instructor: Kamalika Chaudhuri
CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 2: Introduction to statistical learning theory. 1 / 22 Goals of statistical learning theory SLT aims at studying the performance of
More informationWeb-Mining Agents Computational Learning Theory
Web-Mining Agents Computational Learning Theory Prof. Dr. Ralf Möller Dr. Özgür Özcep Universität zu Lübeck Institut für Informationssysteme Tanya Braun (Exercise Lab) Computational Learning Theory (Adapted)
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationEnsemble Methods and Random Forests
Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization
More informationIntroduction to Machine Learning (67577) Lecture 5
Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai
More informationAn Introduction to Statistical Machine Learning - Theoretical Aspects -
An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne
More informationPDEEC Machine Learning 2016/17
PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /
More informationComputational and Statistical Learning theory
Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,
More informationCOMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization
: Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage
More informationPAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht
PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable
More informationLinear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1
Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1 The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq.
More informationStatistical and Computational Learning Theory
Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the
More informationCOS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, 018 Review Theorem (Occam s Razor). Say algorithm A finds a hypothesis h A H consistent with
More informationOn Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong
On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality Weiqiang Dong 1 The goal of the work presented here is to illustrate that classification error responds to error in the target probability estimates
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory Problem set 1 Due: Monday, October 10th Please send your solutions to learning-submissions@ttic.edu Notation: Input space: X Label space: Y = {±1} Sample:
More informationComputational Learning Theory. Definitions
Computational Learning Theory Computational learning theory is interested in theoretical analyses of the following issues. What is needed to learn effectively? Sample complexity. How many examples? Computational
More informationClassifier Complexity and Support Vector Classifiers
Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl
More informationVBM683 Machine Learning
VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationMachine Learning 4771
Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines
More informationDecision trees COMS 4771
Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationComputational Learning Theory
Computational Learning Theory Slides by and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5) Computational Learning Theory Inductive learning: given the training set, a learning algorithm
More informationCS 6375: Machine Learning Computational Learning Theory
CS 6375: Machine Learning Computational Learning Theory Vibhav Gogate The University of Texas at Dallas Many slides borrowed from Ray Mooney 1 Learning Theory Theoretical characterizations of Difficulty
More informationHypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell
Hypothesis Testing and Computational Learning Theory EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell Overview Hypothesis Testing: How do we know our learners are good? What does performance
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More informationMachine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /
Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical
More informationCognitive Cyber-Physical System
Cognitive Cyber-Physical System Physical to Cyber-Physical The emergence of non-trivial embedded sensor units, networked embedded systems and sensor/actuator networks has made possible the design and implementation
More informationComputational Learning Theory. CS534 - Machine Learning
Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear
More informationSupport Vector Machines for Classification and Regression
CIS 520: Machine Learning Oct 04, 207 Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may
More informationHoldout and Cross-Validation Methods Overfitting Avoidance
Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More information12.1 A Polynomial Bound on the Sample Size m for PAC Learning
67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 12: PAC III Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 In this lecture will use the measure of VC dimension, which is a combinatorial
More information10.1 The Formal Model
67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More informationLearning Theory Continued
Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec.
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationGeneralization theory
Generalization theory Chapter 4 T.P. Runarsson (tpr@hi.is) and S. Sigurdsson (sven@hi.is) Introduction Suppose you are given the empirical observations, (x 1, y 1 ),..., (x l, y l ) (X Y) l. Consider the
More informationTDT4173 Machine Learning
TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement
More informationA first model of learning
A first model of learning Let s restrict our attention to binary classification our labels belong to (or ) We observe the data where each Suppose we are given an ensemble of possible hypotheses / classifiers
More informationReferences for online kernel methods
References for online kernel methods W. Liu, J. Principe, S. Haykin Kernel Adaptive Filtering: A Comprehensive Introduction. Wiley, 2010. W. Liu, P. Pokharel, J. Principe. The kernel least mean square
More informationLearning Theory, Overfi1ng, Bias Variance Decomposi9on
Learning Theory, Overfi1ng, Bias Variance Decomposi9on Machine Learning 10-601B Seyoung Kim Many of these slides are derived from Tom Mitchell, Ziv- 1 Bar Joseph. Thanks! Any(!) learner that outputs a
More informationCOMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)
COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationLearning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin
Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good
More informationGeneralization Bounds for the Area Under an ROC Curve
Generalization Bounds for the Area Under an ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich, Sariel Har-Peled and Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 4, 2015 Today: Generative discriminative classifiers Linear regression Decomposition of error into
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationCS 543 Page 1 John E. Boon, Jr.
CS 543 Machine Learning Spring 2010 Lecture 05 Evaluating Hypotheses I. Overview A. Given observed accuracy of a hypothesis over a limited sample of data, how well does this estimate its accuracy over
More informationDecision Tree Learning Lecture 2
Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over
More informationMachine Learning
Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 1, 2011 Today: Generative discriminative classifiers Linear regression Decomposition of error into
More informationhttp://imgs.xkcd.com/comics/electoral_precedent.png Statistical Learning Theory CS4780/5780 Machine Learning Fall 2012 Thorsten Joachims Cornell University Reading: Mitchell Chapter 7 (not 7.4.4 and 7.5)
More informationData Mining und Maschinelles Lernen
Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting
More informationLearning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14
Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our
More informationActive Learning and Optimized Information Gathering
Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office
More information