Empirical Risk Minimization

Similar documents
Support Vector Machine

Optimal decision and naive Bayes

Discriminative Models

Discriminative Models

COMS 4771 Introduction to Machine Learning. Nakul Verma

The PAC Learning Framework -II

Introduction to Machine Learning

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Machine Learning. Lecture 9: Learning Theory. Feng Li.

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

IFT Lecture 7 Elements of statistical learning theory

Computational and Statistical Learning Theory

Machine Learning Theory (CS 6783)

Part of the slides are adapted from Ziko Kolter

Understanding Generalization Error: Bounds and Decompositions

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Machine Learning Theory (CS 6783)

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Warm up: risk prediction with logistic regression

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Introduction to Machine Learning CMU-10701

The sample complexity of agnostic learning with deterministic labels

ECS171: Machine Learning

Linear Models in Machine Learning

Machine Learning for NLP

CS229 Supplemental Lecture notes

Logistic Regression. Machine Learning Fall 2018

Generalization theory

Statistical Learning Learning From Examples

FORMULATION OF THE LEARNING PROBLEM

Does Unlabeled Data Help?

Generalization, Overfitting, and Model Selection

Machine Learning Lecture 7

Overfitting, Bias / Variance Analysis

Introduction to Machine Learning (67577) Lecture 3

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Advanced Introduction to Machine Learning CMU-10715

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Computational Learning Theory

Lecture 2 Machine Learning Review

Computational Learning Theory

Generalization bounds

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Foundations of Machine Learning

Introduction to Statistical Learning Theory

Generalization and Overfitting

Stochastic optimization in Hilbert spaces

ECE521 week 3: 23/26 January 2017

Machine Learning 2017

Class 2 & 3 Overfitting & Regularization

Statistical learning theory, Support vector machines, and Bioinformatics

Computational and Statistical Learning theory

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Minimax risk bounds for linear threshold functions

CMU-Q Lecture 24:

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Classification objectives COMS 4771

Classification with Reject Option

Logistic regression and linear classifiers COMS 4771

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results

Introduction to Machine Learning

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

ECE521 Lecture7. Logistic Regression

Online Learning and Sequential Decision Making

Computational Learning Theory

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Machine Learning And Applications: Supervised Learning-SVM

CSCI-567: Machine Learning (Spring 2019)

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Vapnik-Chervonenkis Dimension of Neural Nets

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

Ad Placement Strategies

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

COMS 4771 Introduction to Machine Learning. Nakul Verma

Statistical Data Mining and Machine Learning Hilary Term 2016

Lecture 1: Supervised Learning

Linear & nonlinear classifiers

Machine Learning 4771

A Bound on the Label Complexity of Agnostic Active Learning

Generalization theory

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Recap from previous lecture

Regression with Numerical Optimization. Logistic

PATTERN RECOGNITION AND MACHINE LEARNING

18.9 SUPPORT VECTOR MACHINES

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning Lecture 5

Nonparametric Bayesian Methods (Gaussian Processes)

Statistical Machine Learning Hilary Term 2018

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

Linear Models for Regression CS534

Transcription:

Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018

Outline Introduction PAC learning ERM in practice 2

General setting Data X the input space and Y the output space D a fixed and unknown distribution on X Y Loss function A loss function l is a function from Y Y to R + such that Y Y, l(y, Y) = 0 Model, loss and risk a model g is a function from X to Y given a loss function l the risk of g is R l (g) = E (X,Y) D (l(g(x), Y)) optimal risk R l = inf g R l (g) 3

Supervised learning Data set D = ((X i, Y i )) 1 i N (X i, Y i ) D (i.i.d.) D D N (product distribution) General problem a learning algorithm creates from D a model g D does R l (g D ) reaches Rl when D goes to infinity? if so, how quickly? 4

Empirical risk minimization Empirical risk R l (g, D) = 1 N N i=1 l(g(x i ), Y i ) = 1 D (x,y) D l(g(x), y) ERM algorithm choose a class of functions G from X to Y define g ERM,l,G,D = arg min R l (g, D) g G is ERM a good machine learning algorithm? 5

Three distinct problems 1. an optimization problem given l and G how difficult is finding arg min g G Rl (g, D)? given limited computational resources, how close can we get to arg min g G Rl (g, D)? 2. an estimation problem given G a class of function, define R l,g = inf g G R l (g) can we bound R l (g D) R l,g? 3. an approximation problem can be bound R l,g R l? in a way that is compatible with estimation? 6

Three distinct problems 1. an optimization problem given l and G how difficult is finding arg min g G Rl (g, D)? given limited computational resources, how close can we get to arg min g G Rl (g, D)? 2. an estimation problem given G a class of function, define R l,g = inf g G R l (g) can we bound R l (g D) R l,g? 3. an approximation problem can be bound R l,g R l? in a way that is compatible with estimation? Focus of this course the estimation problem and then the approximation problem with a few words about the optimization problem 6

Outline Introduction PAC learning ERM in practice 7

A simplified case Learning concepts a concept c is a mapping from X to Y = {0, 1} in concept learning, the loss function l b with l b (p, t) = 1 p t we consider only a distribution D X over X risk and empirical risk definitions are adapted to this setting: risk: R(g) = E X DX (1 g(x) c(x) ) empirical risk: R(g, D) = 1 N N i=1 1 g(x i ) c(x i ) in essence the pair (D X, c) replaces D: this corresponds to a noise free situation as a consequence a data set is D = {X 1,..., X N } and has to complemented by a concept to learn 8

PAC learning Notations If A is a learning algorithm, then A(D) is the model produced by running A on the data set D Definition A concept class C (i.e. a set of concepts) is PAC-learnable if there is an algorithm A and a function N C from [0, 1] 2 to N such that: for any 1 > ɛ > 0 and any 1 > δ > 0, for any distribution D X and any concept c C, if N N C (ɛ, δ) then P D D N X {R (A(D)) ɛ} 1 δ probably 1 δ approximately correct ɛ 9

Concept learning and ERM Remark the concept to learn c is in C thus R G = 0 in addition, for any D, R(g ERM,G,D, D) = 0 then ERM provides PAC-learnability if for any g C such that R(g, D) = 0, P D D N X {R (g) ɛ} 1 δ Theorem Let C be a finite concept class and let A be an algorithm that outputs A(D) such that R(A(D), D) = 0. Then when N, P D D N X {R (A(D)) ɛ} 1 δ 1 ɛ log C δ 10

Proof 1. we consider ways to break the AC part, i.e. having both R(g, D) = 0 and R(g) > ɛ. We have Q = P( g C, R(g, D) = 0 and R(g) > ɛ) = P ) ( R(g, D) = 0 and R(g) > ɛ g C 2. union bound Q g C P( R(g, D) = 0 and R(g) > ɛ) 3. then we have P( R(g, D) = 0 and R(g) > ɛ) = P( R(g, D) = 0 R(g) > ɛ)p(r(g) > ɛ) P( R(g, D) = 0 R(g) > ɛ) 11

Proof cont. notice that R(g) = P X DX (g(x) c(x)) thus P X DX (g(x) = c(x) R(g) > ɛ) 1 ɛ as the observations are i.i.d, P( R(g, D) = 0 R(g) > ɛ) (1 ɛ) N e Nɛ finally P( g C, R(g, D) = 0 and R(g) > ɛ) C e Nɛ then if R(A(D), D) = 0, P D D N X {R (A(D)) ɛ} 1 C e Nɛ we want C e Nɛ δ, which happens when N 1 ɛ log C δ 12

PAC concept learning ERM ERM provides PAC-learnability for finite concept classes optimization computational cost in Θ (N C ) Data consumption the data needed to reach some PAC level grows with the logarithm of the concept class a finite set C can be encoded with log 2 C bits (by numbering the elements) each observation X fixes one bit of the solution 13

Generalization Concept learning is too limited no noise fixed loss function Agnostic PAC learnability A class of models G (functions from X to Y) is PAC-learnable with respect to a loss function l if there is an algorithm A and a function N G from [0, 1] 2 to N such that: for any 1 > ɛ > 0 and any 1 > δ > 0, for any distribution D on X Y if N N G (ɛ, δ) then { P D D N Rl (A(D)) Rl,G + ɛ } 1 δ Main questions does ERM provide agnostic PAC learnability? does that apply to infinite classes of models? 14

Uniform approximation Lemma Controlling the ERM can be done by ensuring the empirical risk is uniformly a good approximation of the true risk: R l (g ERM,l,G,D ) Rl,G 2 sup R l (g) R l (g, D) g G 15

Uniform approximation Lemma Controlling the ERM can be done by ensuring the empirical risk is uniformly a good approximation of the true risk: R l (g ERM,l,G,D ) Rl,G 2 sup R l (g) R l (g, D) g G Proof. for any g G, we have R l (g ERM,l,G,D ) R l (g) = R l (g ERM,l,G,D ) R l (g ERM,l,G,D, D) + R l (g ERM,l,G,D, D) R l (g), R l (g ERM,l,G,D ) R l (g ERM,l,G,D, D) + R l (g, D) R l (g), R l (g ERM,l,G,D ) R l (g ERM,l,G,D, D) + R l (g, D) R l (g), 2 sup R l (g) R l (g, D), g G which leads to the conclusion. 15

Finite classes Theorem If G < and if l [a, b] them when N log( 2 G δ )(b a) 2 } P D D N X {sup R l (g) R l (g, D) ɛ 1 δ g G 2ɛ 2 Proof. very rough sketch 1. we use the union technique to focus on a single model g 2. then we use the Hoeffding inequality to bound the difference between an empirical average and an expectation. In our context it says ( ) { Rl P D D N (g) R } ɛ 2 l (g, D) ɛ 2 exp 2N X (b a) 2 3. the conclusion is obtained as in the simple case of concept learning 16

Finite classes Theorem If G < and if l [0, 1] the ERM provides agnostic PAC-learnability 2 log( with N G (ɛ, δ) = 2 G δ )(b a) 2 Discussion ɛ 2 obvious consequence of the uniform approximation result the limitation l [a, b] can be lifted but only asymptotically the dependency of the data size to the quality (i.e. to ɛ) is far less satisfactory than in the simple case: this is a consequence of allowing noise 17

Infinite classes Restriction we keep the noise but move back to a simple case Y = {0, 1} and l = l b Growth function if {v 1,..., v m } is a finite subset of X G {v1,...,v m} = {(g(v 1 ),..., g(v m )) g G} {0, 1} m the growth function of G is S G (m) = sup {v 1,...,v m} X G{v1,...,v m} 18

Interpretation Going back to finite things G{v1,...,v m} gives the number of models as seen by the inputs {v 1,..., v m } it corresponds to the number of possible classification decisions (a.k.a. binary labelling) of those inputs the growth function corresponds to the worst case analysis: the set of inputs that can be labelled in the largest number of different ways Vocabulary if G{v1,...,v m} = 2 m then {v 1,..., v m } is said to be shattered by G S G (m) is the m-th shatter coefficient of G 19

Uniform approximation Theorem For any 1 > ɛ > 0 and any 1 > δ > 0 and for any distribution D P D D N X {sup R lb (g) R lb (g, D) 4 + } log(s G (2N)) g G δ 1 δ 2N Consequences strong link between the growth function and uniform approximation useful only if log(s G(2m)) m goes to zero when m if G shatters sets of arbitrary sizes log(s G (2m)) = 2m log 2 20

Vapnik Chervonenkis dimension VC-dimension Characterization VCdim(G) = m if and only if VCdim(G) = sup {m N S G (m) = 2 m } 1. there is a set of m points {v 1,..., v m } that is shattered by G 2. no set of m + 1 points {v 1,..., v m+1 } is shattered by G Lemma (Sauer) If VCdim(G) <, for all m S G (m) VCdim(G) k=0 m VCdim(G) ( em S G (m) VCdim(G) ) VCdim(G) ( m ) k. In particular when 21

Finite VC-dimension Consequences If VCdim(G) = d <, for any 1 > ɛ > 0 and any 1 > δ > 0 and for any distribution D, if N d then P D D N X sup g G log( 2eN R lb (g) R lb (g, D) 4 + δ 2N d ) 1 δ Learnability a finite VC-dimension ensures agnostic PAC-learnability of the ERM ( ) VCdim(G)+log 1 δ it can be shown that N G (ɛ, δ) = Θ ɛ 2 22

VC-dimension VC-dimension calculation is very difficult! A useful result: Theorem Let F be a vector space of functions from X to R of dimension p. Let G be the class of models given by G = { g : X {0, 1} f F, X X g(x) = 1 f (X) 0 }. Then VCdim(G) p. 23

Is a finite VC-dimension needed? Theorem Let G be a class of models from X to Y = {0, 1}. Then the following properties are equivalent: 1. G is agnostic PAC-learnable with the binary loss l b 2. ERM provides agnostic PAC-learnable with the binary loss l b for G 3. VCdim(G) < Interpretation learnability in the PAC sense is therefore uniquely characterized by the VC-dimension of the class of models no algorithmic tricks can be used to circumvent this fact! but this applies only to a fix class! 24

Beyond binary classification numerous extensions are available to the regression setting (with quadratic or absolute loss) to classification with more than two classes refined complexity measures are available Rademacher complexity Covering numbers better bounds are also available in general in the noise free situation But the overall message remains the same: learnability is only possible in classes of bounded complexity. 25

Outline Introduction PAC learning ERM in practice 26

ERM in practice Empirical risk minimization g ERM,l,G,D = arg min R l (g, D) g G Implementation? what class G should we use? potential candidates? how to chose among them? how to implement the minimization part complexity? approximate solutions? 27

Model classes Some examples fixed basis models, e.g. for Y = { 1, 1} { ( K )} G = X sign α k f k (X), k=1 where the f k are fixed functions from X to R parametric basis models { ( K )} G = X sign α k f k (w k, X), k=1 where the f k (w k,.) are fixed functions from X to R and the w k are parameters that enable tuning the f k useful also for Y = R (remove the indicator function) 28

Examples Linear models X = R P the linearity is with respect to α basic model f k (X) = X k P k=1 α kf k (X) = α T X general models f k (X) can be any polynomial function on R P or more generally a function from R P to R e.g. f k (X) = X 1 X 2 2, f k (X) = log X 3, etc. 29

Examples Nonlinear models Radial Basis Function (RBF) neural networks: X = R P f k ((β, w k ), X) = exp ( β X w k 2) one hidden layer perceptron: X = R P 1 f k ((β, w k ), X) = 1 + exp( β w T k X) More complex outputs if Y <, write Y = {y 1,..., y L } possible class G = { X y t(x), with t(x) = arg max l ( exp ( ))} K α kl f kl (w kl, X) k=1 30

(Meta)Parameters Parametric view previous classes are described by parameters ERM is defined at the model level but can equivalently be considered at the parameter level { if e.g. G = X sign( } K k=1 α kf k (X)) solving min g G Rlb (g, D) is equivalent to solving min R α R K lb (g α, D), where g α is the model associated to α 31

(Meta)Parameters Parametric view previous classes are described by parameters ERM is defined at the model level but can equivalently be considered at the parameter level { if e.g. G = X sign( } K k=1 α kf k (X)) solving min g G Rlb (g, D) is equivalent to solving min R α R K lb (g α, D), where g α is the model associated to α Meta-parameters to avoid confusion, we use the term "meta-parameters" to refer to parameters of the machine learning algorithm in ERM, those are class level parameters (G itself): K the f k functions the parametric form of f k (w k,.) 31

ERM and optimization Standard optimization problem computing arg min g G Rl (g, D) is a classical optimization problem no closed-form solution in general ERM relies on standard algorithms: gradient based algorithms if possible, combinatorial optimization tools if needed Very different complexities from easy cases: linear models with quadratic loss to NP-hard ones: binary loss even with super simple models 32

Linear regression ERM version of linear regression class of models { } G = g : R P R (β 0, β), X R P g β0,β(x) = β 0 + β T x loss: l(p, t) = (p t) 2 empirical risk: R l (g β0,β, D) = 1 N N i=1 (Y i β 0 β T X i ) 2 standard solution ( β 0, β ) T = ( X T X ) 1 X T Y with X = 1 X T 1 1 X T N computational cost in Θ ( NP 2) Y = Y T 1 Y T N 33

Linear classification Linear model with binary loss class of models { } G = g : R P {0, 1} (β 0, β), X R P g β0,β(x) = sign(β 0 + β T x) loss: l b (p, t) = 1 p t empirical risk: misclassification rate in this context ERM is NP-hard and tight approximations are also NP-hard notice that the input dimension is the source of complexity Noise if the optimal model makes zero error, then ERM is polynomial! complexity comes from both noise and the binary loss 34

Gradient descent Smooth functions Y = R parametric case G = {X F(w, X)} assume the loss function l and the models in the class G are differentiable (can be extended to subgradients) gradient of the empirical loss w Rl (w, D) = 1 N N i=1 l p (F(w, X i), Y i ) w F(w, X i ) ERM through standard gradient based algorithms, such as gradient descent w t = w t 1 γ t w Rl (w t 1, D) 35

Stochastic gradient descent Finite sum leverage the structure of R l (w, D) = 1 N N i=1 l(f (w, X i), Y i ) what about updating according to only one example? stochastic gradient descent 1. start with a random w 0 2. iterate 2.1 select i t randomly uniformly in {1,..., N} 2.2 w t = w t 1 γ t l p (F (wt 1, X i t ), Y i t ) wf (w t 1, X i t ) practical tips: use the Polyak-Ruppert averaging: w m = 1 m 1 m t=0 wt γ t = (γ 0 + t) κ, κ ]0.5, 1], γ 0 0 numerous acceleration techniques such as momentum ( averaged" gradients) 36

Ad hoc methods Heuristics numerous heuristics have been proposed for ERM and related problems one of the main tools is alternate/separate optimization: optimize with respect to some parameters while holding the other constants Radial basis function { G = X K k=1 α k exp ( β X w k 2)} set β heuristically, e.g. to the inverse of the smallest squared distance between two X in the data set set the w k via a unsupervised method applied to the (X i ) 1 i N only (for instance the K-means algorithm) consider β and the w k fixed and apply standard ERM to the α k, e.g. linear regression if the loss function is quadratic and Y = R 37

ERM and statistics Maximum likelihood the classical way of estimating parameters in statistics consists in maximizing the likelihood function this is empirical risk minimization in disguise learnability results apply! Linear regression in linear regression one assumes that the following conditional distribution: Y i X i = x N (β 0 + β T x, σ 2 ) the MLE estimate of β 0 and β is obtained as MLE=ERM here (β 0, β) MLE = arg min β 0,β N ) 2 (Y i β 0 β T X i i=1 38

Logistic Regression MLE in logistic regression one assumes that (with Y = {0, 1}) P(Y i = 1 X i = x) = 1 1 + exp( β 0 β T x) = h β 0,β(x) the MLE estimate is obtained by maximizing over (β 0, β) the following function N (Y i log h β0,β(x i ) + (1 Y i ) log(1 h β0,β(x i ))) i=1 39

ERM version Machine learning view assume Y = R use again the class of linear models the loss is given by l(p, t) = t log(1 + exp( p)) + (1 t) log(1 + exp(p)) Extended ML framework the standard ML approach consists in looking for g in the set of functions from X to Y the logistic regression does not model directly the link between X and Y but rather a probabilistic link the ML version is based on a new ML paradigm where the loss function is defined on Y Y and g is a function from X to Y 40

Relaxation Simplifying the ERM ERM with binary loss is complex: non convex loss with no meaningful gradient goal: keep the binary decision but remove the binary loss solution: ask to the model a score rather than a binary decision build a loss function that compares a score to the binary decision use a decision technique consistent with the score generally simpler to formulate with Y = { 1, 1} and the sign function this a relaxation as we do not look anymore for a crisp 0/1 solution but for a continuous one 41

Examples Numerous possible solutions Y = { 1, 1} with decision based on sign(p) logistic loss: l logi (p, t) = log(1 + exp( pt)) perceptron loss: l per (p, t) = max(0, pt) hinge loss (Support Vector Machine): l hinge (p, t) = max(0, 1 pt) exponential loss (Ada boost): l exp (p, t) = exp( pt) quadratic loss: l 2 (p, t) = (p t) 2 42

Beyond ERM This is not ERM anymore! sign g is a model in the original sense (a function from X to Y) but l relax (g(x), Y ) l b (sign(g(x)), Y) surrogate loss minimization some theoretical results are available but in general no guarantee to find the best model with a surrogate loss Are there other reasons to avoid ERM? 43

Negative results (binary classification) VCdim(G) = consider a fixed ML algorithm that picks up a classifier in G with infinite VC dimension (using whatever criterion) for all ɛ > 0 and all N, there is D such that R G = 0 and E D D N (R(g D )) 1 2e ɛ VCdim(G) < for all ɛ > 0, there is D such that R G R > 1 2 ɛ 44

Conclusion Summary The empirical risk minimization framework seems appealing at first but it has several limitations the binary loss is associated to practical difficulties: implementation is difficult (because of the lack of smoothness) complexity can be high in the case of noisy data learnability is guaranteed but only for model classes with finite VC dimension which are strictly limited! Beyond ERM surrogate loss function data adaptive model class 45

Licence This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-sa/4.0/ 46

Version Last git commit: 2018-06-06 By: Fabrice Rossi (Fabrice.Rossi@apiacoa.org) Git hash: 1b39c1bacfc1b07f96d689db230b2586549a62d4 47