Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018
Outline Introduction PAC learning ERM in practice 2
General setting Data X the input space and Y the output space D a fixed and unknown distribution on X Y Loss function A loss function l is a function from Y Y to R + such that Y Y, l(y, Y) = 0 Model, loss and risk a model g is a function from X to Y given a loss function l the risk of g is R l (g) = E (X,Y) D (l(g(x), Y)) optimal risk R l = inf g R l (g) 3
Supervised learning Data set D = ((X i, Y i )) 1 i N (X i, Y i ) D (i.i.d.) D D N (product distribution) General problem a learning algorithm creates from D a model g D does R l (g D ) reaches Rl when D goes to infinity? if so, how quickly? 4
Empirical risk minimization Empirical risk R l (g, D) = 1 N N i=1 l(g(x i ), Y i ) = 1 D (x,y) D l(g(x), y) ERM algorithm choose a class of functions G from X to Y define g ERM,l,G,D = arg min R l (g, D) g G is ERM a good machine learning algorithm? 5
Three distinct problems 1. an optimization problem given l and G how difficult is finding arg min g G Rl (g, D)? given limited computational resources, how close can we get to arg min g G Rl (g, D)? 2. an estimation problem given G a class of function, define R l,g = inf g G R l (g) can we bound R l (g D) R l,g? 3. an approximation problem can be bound R l,g R l? in a way that is compatible with estimation? 6
Three distinct problems 1. an optimization problem given l and G how difficult is finding arg min g G Rl (g, D)? given limited computational resources, how close can we get to arg min g G Rl (g, D)? 2. an estimation problem given G a class of function, define R l,g = inf g G R l (g) can we bound R l (g D) R l,g? 3. an approximation problem can be bound R l,g R l? in a way that is compatible with estimation? Focus of this course the estimation problem and then the approximation problem with a few words about the optimization problem 6
Outline Introduction PAC learning ERM in practice 7
A simplified case Learning concepts a concept c is a mapping from X to Y = {0, 1} in concept learning, the loss function l b with l b (p, t) = 1 p t we consider only a distribution D X over X risk and empirical risk definitions are adapted to this setting: risk: R(g) = E X DX (1 g(x) c(x) ) empirical risk: R(g, D) = 1 N N i=1 1 g(x i ) c(x i ) in essence the pair (D X, c) replaces D: this corresponds to a noise free situation as a consequence a data set is D = {X 1,..., X N } and has to complemented by a concept to learn 8
PAC learning Notations If A is a learning algorithm, then A(D) is the model produced by running A on the data set D Definition A concept class C (i.e. a set of concepts) is PAC-learnable if there is an algorithm A and a function N C from [0, 1] 2 to N such that: for any 1 > ɛ > 0 and any 1 > δ > 0, for any distribution D X and any concept c C, if N N C (ɛ, δ) then P D D N X {R (A(D)) ɛ} 1 δ probably 1 δ approximately correct ɛ 9
Concept learning and ERM Remark the concept to learn c is in C thus R G = 0 in addition, for any D, R(g ERM,G,D, D) = 0 then ERM provides PAC-learnability if for any g C such that R(g, D) = 0, P D D N X {R (g) ɛ} 1 δ Theorem Let C be a finite concept class and let A be an algorithm that outputs A(D) such that R(A(D), D) = 0. Then when N, P D D N X {R (A(D)) ɛ} 1 δ 1 ɛ log C δ 10
Proof 1. we consider ways to break the AC part, i.e. having both R(g, D) = 0 and R(g) > ɛ. We have Q = P( g C, R(g, D) = 0 and R(g) > ɛ) = P ) ( R(g, D) = 0 and R(g) > ɛ g C 2. union bound Q g C P( R(g, D) = 0 and R(g) > ɛ) 3. then we have P( R(g, D) = 0 and R(g) > ɛ) = P( R(g, D) = 0 R(g) > ɛ)p(r(g) > ɛ) P( R(g, D) = 0 R(g) > ɛ) 11
Proof cont. notice that R(g) = P X DX (g(x) c(x)) thus P X DX (g(x) = c(x) R(g) > ɛ) 1 ɛ as the observations are i.i.d, P( R(g, D) = 0 R(g) > ɛ) (1 ɛ) N e Nɛ finally P( g C, R(g, D) = 0 and R(g) > ɛ) C e Nɛ then if R(A(D), D) = 0, P D D N X {R (A(D)) ɛ} 1 C e Nɛ we want C e Nɛ δ, which happens when N 1 ɛ log C δ 12
PAC concept learning ERM ERM provides PAC-learnability for finite concept classes optimization computational cost in Θ (N C ) Data consumption the data needed to reach some PAC level grows with the logarithm of the concept class a finite set C can be encoded with log 2 C bits (by numbering the elements) each observation X fixes one bit of the solution 13
Generalization Concept learning is too limited no noise fixed loss function Agnostic PAC learnability A class of models G (functions from X to Y) is PAC-learnable with respect to a loss function l if there is an algorithm A and a function N G from [0, 1] 2 to N such that: for any 1 > ɛ > 0 and any 1 > δ > 0, for any distribution D on X Y if N N G (ɛ, δ) then { P D D N Rl (A(D)) Rl,G + ɛ } 1 δ Main questions does ERM provide agnostic PAC learnability? does that apply to infinite classes of models? 14
Uniform approximation Lemma Controlling the ERM can be done by ensuring the empirical risk is uniformly a good approximation of the true risk: R l (g ERM,l,G,D ) Rl,G 2 sup R l (g) R l (g, D) g G 15
Uniform approximation Lemma Controlling the ERM can be done by ensuring the empirical risk is uniformly a good approximation of the true risk: R l (g ERM,l,G,D ) Rl,G 2 sup R l (g) R l (g, D) g G Proof. for any g G, we have R l (g ERM,l,G,D ) R l (g) = R l (g ERM,l,G,D ) R l (g ERM,l,G,D, D) + R l (g ERM,l,G,D, D) R l (g), R l (g ERM,l,G,D ) R l (g ERM,l,G,D, D) + R l (g, D) R l (g), R l (g ERM,l,G,D ) R l (g ERM,l,G,D, D) + R l (g, D) R l (g), 2 sup R l (g) R l (g, D), g G which leads to the conclusion. 15
Finite classes Theorem If G < and if l [a, b] them when N log( 2 G δ )(b a) 2 } P D D N X {sup R l (g) R l (g, D) ɛ 1 δ g G 2ɛ 2 Proof. very rough sketch 1. we use the union technique to focus on a single model g 2. then we use the Hoeffding inequality to bound the difference between an empirical average and an expectation. In our context it says ( ) { Rl P D D N (g) R } ɛ 2 l (g, D) ɛ 2 exp 2N X (b a) 2 3. the conclusion is obtained as in the simple case of concept learning 16
Finite classes Theorem If G < and if l [0, 1] the ERM provides agnostic PAC-learnability 2 log( with N G (ɛ, δ) = 2 G δ )(b a) 2 Discussion ɛ 2 obvious consequence of the uniform approximation result the limitation l [a, b] can be lifted but only asymptotically the dependency of the data size to the quality (i.e. to ɛ) is far less satisfactory than in the simple case: this is a consequence of allowing noise 17
Infinite classes Restriction we keep the noise but move back to a simple case Y = {0, 1} and l = l b Growth function if {v 1,..., v m } is a finite subset of X G {v1,...,v m} = {(g(v 1 ),..., g(v m )) g G} {0, 1} m the growth function of G is S G (m) = sup {v 1,...,v m} X G{v1,...,v m} 18
Interpretation Going back to finite things G{v1,...,v m} gives the number of models as seen by the inputs {v 1,..., v m } it corresponds to the number of possible classification decisions (a.k.a. binary labelling) of those inputs the growth function corresponds to the worst case analysis: the set of inputs that can be labelled in the largest number of different ways Vocabulary if G{v1,...,v m} = 2 m then {v 1,..., v m } is said to be shattered by G S G (m) is the m-th shatter coefficient of G 19
Uniform approximation Theorem For any 1 > ɛ > 0 and any 1 > δ > 0 and for any distribution D P D D N X {sup R lb (g) R lb (g, D) 4 + } log(s G (2N)) g G δ 1 δ 2N Consequences strong link between the growth function and uniform approximation useful only if log(s G(2m)) m goes to zero when m if G shatters sets of arbitrary sizes log(s G (2m)) = 2m log 2 20
Vapnik Chervonenkis dimension VC-dimension Characterization VCdim(G) = m if and only if VCdim(G) = sup {m N S G (m) = 2 m } 1. there is a set of m points {v 1,..., v m } that is shattered by G 2. no set of m + 1 points {v 1,..., v m+1 } is shattered by G Lemma (Sauer) If VCdim(G) <, for all m S G (m) VCdim(G) k=0 m VCdim(G) ( em S G (m) VCdim(G) ) VCdim(G) ( m ) k. In particular when 21
Finite VC-dimension Consequences If VCdim(G) = d <, for any 1 > ɛ > 0 and any 1 > δ > 0 and for any distribution D, if N d then P D D N X sup g G log( 2eN R lb (g) R lb (g, D) 4 + δ 2N d ) 1 δ Learnability a finite VC-dimension ensures agnostic PAC-learnability of the ERM ( ) VCdim(G)+log 1 δ it can be shown that N G (ɛ, δ) = Θ ɛ 2 22
VC-dimension VC-dimension calculation is very difficult! A useful result: Theorem Let F be a vector space of functions from X to R of dimension p. Let G be the class of models given by G = { g : X {0, 1} f F, X X g(x) = 1 f (X) 0 }. Then VCdim(G) p. 23
Is a finite VC-dimension needed? Theorem Let G be a class of models from X to Y = {0, 1}. Then the following properties are equivalent: 1. G is agnostic PAC-learnable with the binary loss l b 2. ERM provides agnostic PAC-learnable with the binary loss l b for G 3. VCdim(G) < Interpretation learnability in the PAC sense is therefore uniquely characterized by the VC-dimension of the class of models no algorithmic tricks can be used to circumvent this fact! but this applies only to a fix class! 24
Beyond binary classification numerous extensions are available to the regression setting (with quadratic or absolute loss) to classification with more than two classes refined complexity measures are available Rademacher complexity Covering numbers better bounds are also available in general in the noise free situation But the overall message remains the same: learnability is only possible in classes of bounded complexity. 25
Outline Introduction PAC learning ERM in practice 26
ERM in practice Empirical risk minimization g ERM,l,G,D = arg min R l (g, D) g G Implementation? what class G should we use? potential candidates? how to chose among them? how to implement the minimization part complexity? approximate solutions? 27
Model classes Some examples fixed basis models, e.g. for Y = { 1, 1} { ( K )} G = X sign α k f k (X), k=1 where the f k are fixed functions from X to R parametric basis models { ( K )} G = X sign α k f k (w k, X), k=1 where the f k (w k,.) are fixed functions from X to R and the w k are parameters that enable tuning the f k useful also for Y = R (remove the indicator function) 28
Examples Linear models X = R P the linearity is with respect to α basic model f k (X) = X k P k=1 α kf k (X) = α T X general models f k (X) can be any polynomial function on R P or more generally a function from R P to R e.g. f k (X) = X 1 X 2 2, f k (X) = log X 3, etc. 29
Examples Nonlinear models Radial Basis Function (RBF) neural networks: X = R P f k ((β, w k ), X) = exp ( β X w k 2) one hidden layer perceptron: X = R P 1 f k ((β, w k ), X) = 1 + exp( β w T k X) More complex outputs if Y <, write Y = {y 1,..., y L } possible class G = { X y t(x), with t(x) = arg max l ( exp ( ))} K α kl f kl (w kl, X) k=1 30
(Meta)Parameters Parametric view previous classes are described by parameters ERM is defined at the model level but can equivalently be considered at the parameter level { if e.g. G = X sign( } K k=1 α kf k (X)) solving min g G Rlb (g, D) is equivalent to solving min R α R K lb (g α, D), where g α is the model associated to α 31
(Meta)Parameters Parametric view previous classes are described by parameters ERM is defined at the model level but can equivalently be considered at the parameter level { if e.g. G = X sign( } K k=1 α kf k (X)) solving min g G Rlb (g, D) is equivalent to solving min R α R K lb (g α, D), where g α is the model associated to α Meta-parameters to avoid confusion, we use the term "meta-parameters" to refer to parameters of the machine learning algorithm in ERM, those are class level parameters (G itself): K the f k functions the parametric form of f k (w k,.) 31
ERM and optimization Standard optimization problem computing arg min g G Rl (g, D) is a classical optimization problem no closed-form solution in general ERM relies on standard algorithms: gradient based algorithms if possible, combinatorial optimization tools if needed Very different complexities from easy cases: linear models with quadratic loss to NP-hard ones: binary loss even with super simple models 32
Linear regression ERM version of linear regression class of models { } G = g : R P R (β 0, β), X R P g β0,β(x) = β 0 + β T x loss: l(p, t) = (p t) 2 empirical risk: R l (g β0,β, D) = 1 N N i=1 (Y i β 0 β T X i ) 2 standard solution ( β 0, β ) T = ( X T X ) 1 X T Y with X = 1 X T 1 1 X T N computational cost in Θ ( NP 2) Y = Y T 1 Y T N 33
Linear classification Linear model with binary loss class of models { } G = g : R P {0, 1} (β 0, β), X R P g β0,β(x) = sign(β 0 + β T x) loss: l b (p, t) = 1 p t empirical risk: misclassification rate in this context ERM is NP-hard and tight approximations are also NP-hard notice that the input dimension is the source of complexity Noise if the optimal model makes zero error, then ERM is polynomial! complexity comes from both noise and the binary loss 34
Gradient descent Smooth functions Y = R parametric case G = {X F(w, X)} assume the loss function l and the models in the class G are differentiable (can be extended to subgradients) gradient of the empirical loss w Rl (w, D) = 1 N N i=1 l p (F(w, X i), Y i ) w F(w, X i ) ERM through standard gradient based algorithms, such as gradient descent w t = w t 1 γ t w Rl (w t 1, D) 35
Stochastic gradient descent Finite sum leverage the structure of R l (w, D) = 1 N N i=1 l(f (w, X i), Y i ) what about updating according to only one example? stochastic gradient descent 1. start with a random w 0 2. iterate 2.1 select i t randomly uniformly in {1,..., N} 2.2 w t = w t 1 γ t l p (F (wt 1, X i t ), Y i t ) wf (w t 1, X i t ) practical tips: use the Polyak-Ruppert averaging: w m = 1 m 1 m t=0 wt γ t = (γ 0 + t) κ, κ ]0.5, 1], γ 0 0 numerous acceleration techniques such as momentum ( averaged" gradients) 36
Ad hoc methods Heuristics numerous heuristics have been proposed for ERM and related problems one of the main tools is alternate/separate optimization: optimize with respect to some parameters while holding the other constants Radial basis function { G = X K k=1 α k exp ( β X w k 2)} set β heuristically, e.g. to the inverse of the smallest squared distance between two X in the data set set the w k via a unsupervised method applied to the (X i ) 1 i N only (for instance the K-means algorithm) consider β and the w k fixed and apply standard ERM to the α k, e.g. linear regression if the loss function is quadratic and Y = R 37
ERM and statistics Maximum likelihood the classical way of estimating parameters in statistics consists in maximizing the likelihood function this is empirical risk minimization in disguise learnability results apply! Linear regression in linear regression one assumes that the following conditional distribution: Y i X i = x N (β 0 + β T x, σ 2 ) the MLE estimate of β 0 and β is obtained as MLE=ERM here (β 0, β) MLE = arg min β 0,β N ) 2 (Y i β 0 β T X i i=1 38
Logistic Regression MLE in logistic regression one assumes that (with Y = {0, 1}) P(Y i = 1 X i = x) = 1 1 + exp( β 0 β T x) = h β 0,β(x) the MLE estimate is obtained by maximizing over (β 0, β) the following function N (Y i log h β0,β(x i ) + (1 Y i ) log(1 h β0,β(x i ))) i=1 39
ERM version Machine learning view assume Y = R use again the class of linear models the loss is given by l(p, t) = t log(1 + exp( p)) + (1 t) log(1 + exp(p)) Extended ML framework the standard ML approach consists in looking for g in the set of functions from X to Y the logistic regression does not model directly the link between X and Y but rather a probabilistic link the ML version is based on a new ML paradigm where the loss function is defined on Y Y and g is a function from X to Y 40
Relaxation Simplifying the ERM ERM with binary loss is complex: non convex loss with no meaningful gradient goal: keep the binary decision but remove the binary loss solution: ask to the model a score rather than a binary decision build a loss function that compares a score to the binary decision use a decision technique consistent with the score generally simpler to formulate with Y = { 1, 1} and the sign function this a relaxation as we do not look anymore for a crisp 0/1 solution but for a continuous one 41
Examples Numerous possible solutions Y = { 1, 1} with decision based on sign(p) logistic loss: l logi (p, t) = log(1 + exp( pt)) perceptron loss: l per (p, t) = max(0, pt) hinge loss (Support Vector Machine): l hinge (p, t) = max(0, 1 pt) exponential loss (Ada boost): l exp (p, t) = exp( pt) quadratic loss: l 2 (p, t) = (p t) 2 42
Beyond ERM This is not ERM anymore! sign g is a model in the original sense (a function from X to Y) but l relax (g(x), Y ) l b (sign(g(x)), Y) surrogate loss minimization some theoretical results are available but in general no guarantee to find the best model with a surrogate loss Are there other reasons to avoid ERM? 43
Negative results (binary classification) VCdim(G) = consider a fixed ML algorithm that picks up a classifier in G with infinite VC dimension (using whatever criterion) for all ɛ > 0 and all N, there is D such that R G = 0 and E D D N (R(g D )) 1 2e ɛ VCdim(G) < for all ɛ > 0, there is D such that R G R > 1 2 ɛ 44
Conclusion Summary The empirical risk minimization framework seems appealing at first but it has several limitations the binary loss is associated to practical difficulties: implementation is difficult (because of the lack of smoothness) complexity can be high in the case of noisy data learnability is guaranteed but only for model classes with finite VC dimension which are strictly limited! Beyond ERM surrogate loss function data adaptive model class 45
Licence This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-sa/4.0/ 46
Version Last git commit: 2018-06-06 By: Fabrice Rossi (Fabrice.Rossi@apiacoa.org) Git hash: 1b39c1bacfc1b07f96d689db230b2586549a62d4 47