Class 2 & 3 Overfitting & Regularization

Size: px
Start display at page:

Download "Class 2 & 3 Overfitting & Regularization"

Transcription

1 Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017

2 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating the lowest expected risk inf f:x Y E(f), E(f) = X Y l(f(x), y) dρ(x, y) given only a finite number of (training) examples (x i, y i ) n i=1 sampled independently from the unknown distribution ρ.

3 Last Class: The SLT Wishlist What does good estimator mean? Low excess risk E(f n ) E(f ) Consistency. Does E(f n ) E(f ) 0 as n + in Expectation? in Probability? with respect to a training set S = (x i, y i ) n i=1 of points randomly sampled from ρ. Learning Rates. How fast is consistency achieved? Nonasymptotic bounds: finite sample complexity, tail bounds, error bounds...

4 Last Class (Expected Vs Empirical Risk) Approximate the expected risk of f : X Y via its empirical risk E n (f) = 1 n n l(f(x i ), y i ) i=1 Expectation: E E n (f) E(f) V f n Probability (e.g. using Chebyshev): P ( E n (f) E(f) ɛ) V f nɛ 2 ɛ > 0 where V f = Var (x,y) ρ (l(f(x), y)).

5 Last Class (Empirical Risk Minimization) Idea: if E n is a good approximation to E, then we could use f n = argmin E n (f) f F to approximate f. This is known as empirical risk minimization (ERM) Note: If we sample the points in S = (x i, y i ) n i=1 independently from ρ, the corresponding f n = f S is a random variable and we have E E(f n ) E(f ) E E(f n ) E n (f n ) Question: does E E(f n ) E n (f n ) go to zero as n increases?

6 Issues with ERM Assume X = Y = R, ρ with dense support 1 and l(y, y) = 0 y Y. For any set (x i, y i ) n i=1 s.t. x i x j i j let f n : X Y be such that { yi if x = x f n (x) = i i {1,... n} 0 otherwise Then, for any number n of training points: E E n (f n ) = 0 E E(f n ) = E(0), which is greater than zero (unless f 0) Therefore E E(f n ) E n (f n ) = E(0) 0 as n increases! 1 and such that every pair (x, y) has measure zero according to ρ

7 Overfitting An estimator f n is said to overfit the training data if for any n N: E E(f n ) E(f ) > C for a constant C > 0, and E E n (f n ) E n (f ) 0 According to this definition ERM overfits...

8 ERM on Finite Hypotheses Spaces Is ERM hopeless? Consider the case X and Y finite. Then, F = Y X = {f : X Y} is finite as well (albeit possibly large), and therefore: E E n (f n ) E(f n ) E sup E n (f) E(f) f F f F E E n (f) E(f) F V F /n where V F = sup f F V f and F denotes the cardinality of F. Then ERM works! Namely: lim n + E E(f n ) E(f) = 0

9 ERM on Finite Hypotheses (Sub) Spaces The same argument holds in general: let H F be a finite space of hypotheses. Then, In particular, if f H, then E E n (f n ) E(f n ) H V H /n E E(f n ) E(f ) H V H /n and ERM is a good estimator for the problem considered.

10 Example: Threshold functions Consider a binary classification problem Y = {0, 1}. Someone has told us that the minimizer of the risk is a threshold function f a (x) = 1 [a,+ ) with a [ 1, 1]. 1.5 a b We can learn on H = {f a a R} = [ 1, 1]. However on a computer we can only represent real numbers up to a given precision.

11 Example: Threshold Functions (with precision p) Discretization: given a p > 0, we can consider H p = {f a a [ 1, 1], a 10 p = [a 10 p ]} with [a] denoting the integer part (i.e. the closest integer) of a scalar a. The value p can be interpreted as the precision of our space of functions H p. Note that H p = 2 10 p If f H p, then we have automatically that E E(f n ) E(f ) H p V H /n 10 p / n (V H 1 since l is the 0-1 loss and therefore l(f(x), y) 1 for any f H)

12 Rates in Expectation Vs Probability In practice, even for small values of p E E(f n ) E(f ) 10 p / n will need a very large n in order to have a meaningful bound on the expected error. Interestingly, we can get much better constants (not rates though!) by working in probability...

13 Hoeffding s Inequality Let X 1,..., X n independent random variables s.t. X i [a i, b i ]. Let X = 1 n n i=1 X i. Then, P ( ) X E X 2n ɛ 2 exp ( 2 ɛ 2 ) n i=1 (b i a i ) 2

14 Applying Hoeffding s inequality Assume that f H, x X, y Y the loss is bounded l(f(x), y) M by some constant M > 0. Then, for any f H we have P ( E n (f) E(f) ɛ) exp( nɛ2 2M 2 )

15 Controlling the Generalization Error We would like to control the generalization error E n (f n ) E(f n ) of our estimator in probability. One possible way to do that is by controlling the generalzation error of the whole set H. ( ) P ( E n (f n ) E(f n ) ɛ) P sup f H E n (f) E(f) ɛ The latter term is the probability that least one of the events E n (f) E(f) ɛ occurs for f H. In other words the probability of the union of such events. Therefore ( ) P sup f H E n (f) E(f) ɛ by the so-called union bound. f H P ( E n (f) E(f) ɛ)

16 Hoeffding the Generalization Error By applying Hoeffding s inequality, P ( E n (f n ) E(f n ) ɛ) 2 H exp( nɛ2 2M 2 ) Or, equivalently, that for any δ (0, 1], 2M 2 log(2 H /δ) E n (f n ) E(f n ) n with probability at least 1 δ.

17 Example: Threshold Functions (in Probability) Going back to H p space of threshold functions p 2 log δ E n (f n ) E(f n ) n since M = 1 and log 2 H = log 4 10 p = log 4 + p log p. For example, let δ = We can say that 6p + 18 E n (f n ) E(f n ) n holds at least 99.9% of the times.

18 Bounds in Expectation Vs Probability Comparing the two bounds E E n (f n ) E(f n ) 10 p / n (Expectation) While, with probability greater than 99.9% E n (f n ) E(f n ) 6p + 18 n (Probability) Although we cannot be 100% sure of it, we can be quite confident that the generalization error will be much smaller than what the bound in expectation tells us... Rates: note however that the rates of convergence to 0 are the same (i.e. O(1/ n)).

19 Improving the bound in Expectation Exploiting the bound in probability and the knowledge that on H p the excess risk is bounded by a constant, we can improve the bound in expectation... Let X be a random variable s.t. X < M for some constant M > 0. Then, for any ɛ > 0 we have E X ɛ P ( X ɛ) + MP ( X > ɛ) Applying to our problem: for any δ (0, 1] 2M 2 log(2 H p /δ) E E n (f n ) E(f n ) (1 δ) + δm n Therefore only log H p appears (no H p alone).

20 Infinite Hypotheses Spaces What if f H \ H p for any p > 0? ERM on H p will never minimize the expected risk. There will always be a gap for E(f n,p ) E(f ). For p + it is natural to expect such gap to decrease... BUT if p increases too fast (with respect to the number n of examples) we cannot control the generalization error anymore! 6p + 18 E n (f n ) E(f n ) + for p + n Therefore we need to increase p gradually as a function p(n) of the number of training examples. This approach is known as regularization.

21 Approximation Error for Threshold Functions Let s consider f p = 1 [ap,+ ) argmin f H p E(f) with a p [ 1, 1]. Consider the error decomposition of the excess risk E(f n ) E(f ): E(f n ) E n (f n ) + E n (f n ) E n (f p ) +E n (f p ) E(f p ) + E(f p ) E(f ) }{{} 0 We already know how to control the generalization of f n (via the supremum over H p ) and f p (since it is a single function). Moreover, we have that the approximation error is E(f p ) E(f ) a p a 10 p Note that it does not depend on training data! (why?)

22 Approximation Error for Threshold Functions II Putting everything together we have that, for any δ [0, 1) and p 0, 4 + 6p 2 log δ E(f n ) E(f ) p = φ(n, δ, p) n holds with probability greater or equal to 1 δ. In particular, for any n and δ, we can choose the best precision as p(n, δ) = argmin φ(n, δ, p) p 0 which leads to an error bound ɛ(n, δ) = φ(n, δ, p(n, δ)) holding with probability larger or equal than 1 δ.

23 Regularization Most hypotheses spaces are too large and therefore prone to overfitting. Regularization is the process of controlling the freedom of an estimator as a function on the number of training examples. Idea. Parametrize H as a union H = γ>0 H γ of hypotheses spaces H γ that are not prone to overfitting (e.g. finite spaces). γ is known as the regularization parameter (e.g. the precision p in our examples). Assume H γ H γ if γ γ. Regularization Algorithm. Given n training point, find an estimator f γ,n on H γ (e.g. ERM on H γ ). Let γ = γ(n) increase as n +.

24 Regularization and Decomposition of the Excess Risk Let γ > 0 and f γ = argmin f H γ E(f) We can decompose the excess risk E(f γ,n ) E(f ) as E(f γ,n ) E(f γ ) }{{} Sample error + E(f γ ) inf f H E(f) }{{} Approximation error + inf f H E(f) E(f ) }{{} Irreducible error

25 Irreducible Error inf E(f) E(f ) f H Recall: H is the largest possible Hypotheses space we are considering. If the irreducible error is zero, H is called universal (e.g. the RKHS induced by the Gaussian kernel is a universal Hypotheses space).

26 Approximation Error E(f γ ) inf f H E(f) Does not depend on the dataset (deterministic). Does depend on the distribution ρ. Also referred to as bias.

27 Convergence of the Approximation Error Under mild assumptions, lim γ + E(f γ) inf f H E(f) = 0

28 Density Results lim E(f γ) E(f ) = 0 γ + Convergence of Approximation error + Universal Hypotheses space Note: It corresponds to a density property of the space H in F = {f : X Y}

29 Approximation error bounds E(f γ ) inf E(f) A(ρ, γ) f H No rates without assumptions related to the so-called No Free Lunch Theorem. Studied in Approximation Theory using tools such as Kolmogorov n-width, K-functionals, interpolation spaces... Prototypical result: If f has smoothness 2. A(ρ, γ) = cγ s. 2 Some abstract notion of regularity parametrizing the class of target functions. Typical example: f in a Sobolev space W s,2.

30 Sample Error E(f γ,n ) E(f γ ) Random quantity depending on the data. Two main ways to study it: Capacity/Complexity estimates on H γ. Stability.

31 Sample Error Decomposition We have seen how to decompose the sample error E(f γ,n ) E(f γ ) in E(f γ,n ) E n (f γ,n ) }{{} Generalization error + E n (f γ,n ) E n (f γ ) }{{} Excess empirical Risk + E n (f γ ) E(f γ ) }{{} Generalization error

32 Generalization Error(s) As we have observed, E(f γ,n ) E n (f γ,n ) and E n (f γ ) E(f γ ) Can be controlled by studying the empirical process sup E n (f) E(f) f H γ Example: we have already observed that for a finite space H γ ( ) P sup E n (f) E(f) ɛ 2 H γ exp( nɛ2 f H γ 2M 2 )

33 ERM on Finite Spaces and Computational Efficiency The strategy used for threshold functions can be generalized to any H for which it is possible to find a finite discretization H p with respect to the L 1 (X, ρ X ) norm (e.g. H compact with respect to such norm). However, in general, it could be computationally very expensive to find the empirical risk minimizer on a discretization H p, since in principle it could be necessary to evaluate E n (f) for any f H p. As it turns out, ERM on e.g. convex (thus dense) spaces is often much more amenable to computations, but we have observed that on infinite hypotheses spaces it is difficult to control the generalization error. Interestingly, we can leverage the discretization argument to control the generalization error of ERM also for special dense hypotheses spaces.

34 Risks for Continuous functions Let X R d be a compact space and C(X ) be the space of continuous functions. Let be defined for any f C(X ) as f = sup x X f(x) ). If the loss function l : Y Y R is such that l(, y) is uniformly Lipschitz with constant C > 0, for any y Y, we have that 1) E(f 1 ) E(f 2 ) C f 1 f 2 L1 (X,ρ X ) C f 1 f 2, and 2) E n (f 1 ) E n (f 2 ) 1 n n l(f 1 (x i ), y i ) l(f 2 (x i ), y i ) C f 1 f 2 i=1 Therefore, close functions in will have similar expected and empirical risks!

35 Compact Spaces in C(X ) Idea. if H C(X ) admits a finite discretization H p = {h 1,..., h N } of precision p with respect to the (e.g. H is compact with respect to ), then we can control the generalization error over it as sup f H E n (f) E(f) sup E n (f) E(h f ) + E n (h f ) E(h f ) + E(h f ) E(f) f H 2L 10 p + sup h H p E n (h) E(h) where we have denoted h f = argmin h Hp h f Note. we know how to control sup h Hp E n (h) E(h) since H p is finite!

36 Covering numbers We define the covering number of H of radius η > 0 as the cardinality of a minimal cover of H with balls of radius η. N (H, η) = inf{m H m B η (h i ) h i H} i=1 Example. If H = B R (0) is a ball of radius R in R d : N (B R (0), η) = (4R/η) d Image credits: Lorenzo Rosasco.

37 Example: Covering numbers (continued) Putting the two together we have that for any δ [0, 1), sup E(f n ) E(f) 2Lη + f H 2M 2 log(2n (H, η)/δ) n holds with probability at least 1 δ. For η 0 the covering number N (H, η) +. However for n + the bound tends to zero. It is typically possible to show that there exists an η(n) for which the bound tends to zero as n +.

38 Complexity Measures In general, the error sup E n (f) E(f) f H γ Can be controlled via capacity/complexity measures: covering numbers, combinatorial dimension, e.g. VC-dimension, fat-hattering dimension Rademacher complexities Gaussian complexities...

39 Prototypical Results A prototypical result (under suitable assumptions, e.g. regularity of f ): E(f γ,n ) E(f ) E(f γ,n E(f γ )) + E(f γ ) E(f ) }{{}}{{} γ β n α γ τ (Variance) (Bias) Goal: find the γ(n) achieving the best Bias - Variance trade-off

40 Choosing γ(n) in practice The best γ(n) depends on the unknown distribution ρ. So how can we choose such parameter in practice? Problem known as model selection. Possible approaches: Cross validation, complexity regularization/structural risk minimization, balancing principles....

41 Abstract Regularization We just got our first introduction to the concept of regularization: controlling the expressiveness of the hypotheses space according to the number of training examples in order to guarantee good prediciton performance and consistency. There are many ways to implement this strategy in practice ( we will see some of them in this course): Tikonov (and Ivanov) regularization Spectral filtering Early stopping Random sampling...

42 Wrapping Up This class: Overfitting Controlling the Generalization error Abstract Regularization Next class: Tikhonov Regularization

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Generalization Bounds and Stability

Generalization Bounds and Stability Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 9 2009 About this class Goal To recall the notion of generalization bounds and show how they can be derived from a stability

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

The Learning Problem and Regularization

The Learning Problem and Regularization 9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Day 3: Classification, logistic regression

Day 3: Classification, logistic regression Day 3: Classification, logistic regression Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018 Topics so far Supervised

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Are Loss Functions All the Same?

Are Loss Functions All the Same? Are Loss Functions All the Same? L. Rosasco E. De Vito A. Caponnetto M. Piana A. Verri November 11, 2003 Abstract In this paper we investigate the impact of choosing different loss functions from the viewpoint

More information

Distirbutional robustness, regularizing variance, and adversaries

Distirbutional robustness, regularizing variance, and adversaries Distirbutional robustness, regularizing variance, and adversaries John Duchi Based on joint work with Hongseok Namkoong and Aman Sinha Stanford University November 2017 Motivation We do not want machine-learned

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Regularization via Spectral Filtering

Regularization via Spectral Filtering Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

Spectral Regularization

Spectral Regularization Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Lecture 21: Minimax Theory

Lecture 21: Minimax Theory Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways

More information

The PAC Learning Framework -II

The PAC Learning Framework -II The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline

More information

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017 Machine Learning Model Selection and Validation Fabio Vandin November 7, 2017 1 Model Selection When we have to solve a machine learning task: there are different algorithms/classes algorithms have parameters

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Problem set 1, Real Analysis I, Spring, 2015.

Problem set 1, Real Analysis I, Spring, 2015. Problem set 1, Real Analysis I, Spring, 015. (1) Let f n : D R be a sequence of functions with domain D R n. Recall that f n f uniformly if and only if for all ɛ > 0, there is an N = N(ɛ) so that if n

More information

Hilbert Space Methods in Learning

Hilbert Space Methods in Learning Hilbert Space Methods in Learning guest lecturer: Risi Kondor 6772 Advanced Machine Learning and Perception (Jebara), Columbia University, October 15, 2003. 1 1. A general formulation of the learning problem

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear

More information

On Learnability, Complexity and Stability

On Learnability, Complexity and Stability On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Lecture 3: Empirical Risk Minimization

Lecture 3: Empirical Risk Minimization Lecture 3: Empirical Risk Minimization Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 3 1 / 11 A more general approach We saw the learning algorithms Memorize and

More information

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

Case study: stochastic simulation via Rademacher bootstrap

Case study: stochastic simulation via Rademacher bootstrap Case study: stochastic simulation via Rademacher bootstrap Maxim Raginsky December 4, 2013 In this lecture, we will look at an application of statistical learning theory to the problem of efficient stochastic

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Math 328 Course Notes

Math 328 Course Notes Math 328 Course Notes Ian Robertson March 3, 2006 3 Properties of C[0, 1]: Sup-norm and Completeness In this chapter we are going to examine the vector space of all continuous functions defined on the

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Sample Complexity of Learning Mahalanobis Distance Metrics. Nakul Verma Janelia, HHMI

Sample Complexity of Learning Mahalanobis Distance Metrics. Nakul Verma Janelia, HHMI Sample Complexity of Learning Mahalanobis Distance Metrics Nakul Verma Janelia, HHMI feature 2 Mahalanobis Metric Learning Comparing observations in feature space: x 1 [sq. Euclidean dist] x 2 (all features

More information

A Geometric Approach to Statistical Learning Theory

A Geometric Approach to Statistical Learning Theory A Geometric Approach to Statistical Learning Theory Shahar Mendelson Centre for Mathematics and its Applications The Australian National University Canberra, Australia What is a learning problem A class

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

arxiv: v1 [cs.it] 21 Feb 2013

arxiv: v1 [cs.it] 21 Feb 2013 q-ary Compressive Sensing arxiv:30.568v [cs.it] Feb 03 Youssef Mroueh,, Lorenzo Rosasco, CBCL, CSAIL, Massachusetts Institute of Technology LCSL, Istituto Italiano di Tecnologia and IIT@MIT lab, Istituto

More information

Computational and Statistical Learning theory

Computational and Statistical Learning theory Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.

More information

12 Statistical Justifications; the Bias-Variance Decomposition

12 Statistical Justifications; the Bias-Variance Decomposition Statistical Justifications; the Bias-Variance Decomposition 65 12 Statistical Justifications; the Bias-Variance Decomposition STATISTICAL JUSTIFICATIONS FOR REGRESSION [So far, I ve talked about regression

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Sparse Approximation and Variable Selection

Sparse Approximation and Variable Selection Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

Mathematical Foundations of Supervised Learning

Mathematical Foundations of Supervised Learning Mathematical Foundations of Supervised Learning (growing lecture notes) Michael M. Wolf November 26, 2017 Contents Introduction 5 1 Learning Theory 7 1.1 Statistical framework.........................

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Computational Learning Theory - Hilary Term : Learning Real-valued Functions Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information

1 Cheeger differentiation

1 Cheeger differentiation 1 Cheeger differentiation after J. Cheeger [1] and S. Keith [3] A summary written by John Mackay Abstract We construct a measurable differentiable structure on any metric measure space that is doubling

More information

References for online kernel methods

References for online kernel methods References for online kernel methods W. Liu, J. Principe, S. Haykin Kernel Adaptive Filtering: A Comprehensive Introduction. Wiley, 2010. W. Liu, P. Pokharel, J. Principe. The kernel least mean square

More information

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

Statistical Data Analysis

Statistical Data Analysis DS-GA 0 Lecture notes 8 Fall 016 1 Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations

Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations Research

More information

are Banach algebras. f(x)g(x) max Example 7.4. Similarly, A = L and A = l with the pointwise multiplication

are Banach algebras. f(x)g(x) max Example 7.4. Similarly, A = L and A = l with the pointwise multiplication 7. Banach algebras Definition 7.1. A is called a Banach algebra (with unit) if: (1) A is a Banach space; (2) There is a multiplication A A A that has the following properties: (xy)z = x(yz), (x + y)z =

More information

Introduction to the regression problem. Luca Martino

Introduction to the regression problem. Luca Martino Introduction to the regression problem Luca Martino 2017 2018 1 / 30 Approximated outline of the course 1. Very basic introduction to regression 2. Gaussian Processes (GPs) and Relevant Vector Machines

More information

The Kolmogorov continuity theorem, Hölder continuity, and the Kolmogorov-Chentsov theorem

The Kolmogorov continuity theorem, Hölder continuity, and the Kolmogorov-Chentsov theorem The Kolmogorov continuity theorem, Hölder continuity, and the Kolmogorov-Chentsov theorem Jordan Bell jordan.bell@gmail.com Department of Mathematics, University of Toronto June 11, 2015 1 Modifications

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 2: Introduction to statistical learning theory. 1 / 22 Goals of statistical learning theory SLT aims at studying the performance of

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Generalization and Regularization

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Generalization and Regularization TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 Generalization and Regularization 1 Chomsky vs. Kolmogorov and Hinton Noam Chomsky: Natural language grammar cannot be learned by

More information

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

TUM 2016 Class 1 Statistical learning theory

TUM 2016 Class 1 Statistical learning theory TUM 2016 Class 1 Statistical learning theory Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Machine learning applications Texts Images Data: (x 1, y 1 ),..., (x n, y n ) Note: x i s huge dimensional! All

More information

Introduction to Machine Learning (67577) Lecture 5

Introduction to Machine Learning (67577) Lecture 5 Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai

More information

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing

More information

Singular Integrals. 1 Calderon-Zygmund decomposition

Singular Integrals. 1 Calderon-Zygmund decomposition Singular Integrals Analysis III Calderon-Zygmund decomposition Let f be an integrable function f dx 0, f = g + b with g Cα almost everywhere, with b

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003 Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)

More information

BINARY CLASSIFICATION

BINARY CLASSIFICATION BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Lecture 4: Linear predictors and the Perceptron

Lecture 4: Linear predictors and the Perceptron Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 4 1 / 34 Inductive Bias Inductive bias is critical to prevent overfitting.

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

Functional Analysis Review

Functional Analysis Review Outline 9.520: Statistical Learning Theory and Applications February 8, 2010 Outline 1 2 3 4 Vector Space Outline A vector space is a set V with binary operations +: V V V and : R V V such that for all

More information

If we want to analyze experimental or simulated data we might encounter the following tasks:

If we want to analyze experimental or simulated data we might encounter the following tasks: Chapter 1 Introduction If we want to analyze experimental or simulated data we might encounter the following tasks: Characterization of the source of the signal and diagnosis Studying dependencies Prediction

More information

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien

More information

3 Integration and Expectation

3 Integration and Expectation 3 Integration and Expectation 3.1 Construction of the Lebesgue Integral Let (, F, µ) be a measure space (not necessarily a probability space). Our objective will be to define the Lebesgue integral R fdµ

More information

Day 4: Classification, support vector machines

Day 4: Classification, support vector machines Day 4: Classification, support vector machines Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 21 June 2018 Topics so far

More information

9.520 Problem Set 2. Due April 25, 2011

9.520 Problem Set 2. Due April 25, 2011 9.50 Problem Set Due April 5, 011 Note: there are five problems in total in this set. Problem 1 In classification problems where the data are unbalanced (there are many more examples of one class than

More information