Class 2 & 3 Overfitting & Regularization

Size: px

Start display at page:

Download "Class 2 & 3 Overfitting & Regularization"

Miles Briggs
5 years ago
Views:

1 Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017

2 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating the lowest expected risk inf f:x Y E(f), E(f) = X Y l(f(x), y) dρ(x, y) given only a finite number of (training) examples (x i, y i ) n i=1 sampled independently from the unknown distribution ρ.

3 Last Class: The SLT Wishlist What does good estimator mean? Low excess risk E(f n ) E(f ) Consistency. Does E(f n ) E(f ) 0 as n + in Expectation? in Probability? with respect to a training set S = (x i, y i ) n i=1 of points randomly sampled from ρ. Learning Rates. How fast is consistency achieved? Nonasymptotic bounds: finite sample complexity, tail bounds, error bounds...

4 Last Class (Expected Vs Empirical Risk) Approximate the expected risk of f : X Y via its empirical risk E n (f) = 1 n n l(f(x i ), y i ) i=1 Expectation: E E n (f) E(f) V f n Probability (e.g. using Chebyshev): P ( E n (f) E(f) ɛ) V f nɛ 2 ɛ > 0 where V f = Var (x,y) ρ (l(f(x), y)).

5 Last Class (Empirical Risk Minimization) Idea: if E n is a good approximation to E, then we could use f n = argmin E n (f) f F to approximate f. This is known as empirical risk minimization (ERM) Note: If we sample the points in S = (x i, y i ) n i=1 independently from ρ, the corresponding f n = f S is a random variable and we have E E(f n ) E(f ) E E(f n ) E n (f n ) Question: does E E(f n ) E n (f n ) go to zero as n increases?

6 Issues with ERM Assume X = Y = R, ρ with dense support 1 and l(y, y) = 0 y Y. For any set (x i, y i ) n i=1 s.t. x i x j i j let f n : X Y be such that { yi if x = x f n (x) = i i {1,... n} 0 otherwise Then, for any number n of training points: E E n (f n ) = 0 E E(f n ) = E(0), which is greater than zero (unless f 0) Therefore E E(f n ) E n (f n ) = E(0) 0 as n increases! 1 and such that every pair (x, y) has measure zero according to ρ

7 Overfitting An estimator f n is said to overfit the training data if for any n N: E E(f n ) E(f ) > C for a constant C > 0, and E E n (f n ) E n (f ) 0 According to this definition ERM overfits...

8 ERM on Finite Hypotheses Spaces Is ERM hopeless? Consider the case X and Y finite. Then, F = Y X = {f : X Y} is finite as well (albeit possibly large), and therefore: E E n (f n ) E(f n ) E sup E n (f) E(f) f F f F E E n (f) E(f) F V F /n where V F = sup f F V f and F denotes the cardinality of F. Then ERM works! Namely: lim n + E E(f n ) E(f) = 0

9 ERM on Finite Hypotheses (Sub) Spaces The same argument holds in general: let H F be a finite space of hypotheses. Then, In particular, if f H, then E E n (f n ) E(f n ) H V H /n E E(f n ) E(f ) H V H /n and ERM is a good estimator for the problem considered.

10 Example: Threshold functions Consider a binary classification problem Y = {0, 1}. Someone has told us that the minimizer of the risk is a threshold function f a (x) = 1 [a,+ ) with a [ 1, 1]. 1.5 a b We can learn on H = {f a a R} = [ 1, 1]. However on a computer we can only represent real numbers up to a given precision.

11 Example: Threshold Functions (with precision p) Discretization: given a p > 0, we can consider H p = {f a a [ 1, 1], a 10 p = [a 10 p ]} with [a] denoting the integer part (i.e. the closest integer) of a scalar a. The value p can be interpreted as the precision of our space of functions H p. Note that H p = 2 10 p If f H p, then we have automatically that E E(f n ) E(f ) H p V H /n 10 p / n (V H 1 since l is the 0-1 loss and therefore l(f(x), y) 1 for any f H)

12 Rates in Expectation Vs Probability In practice, even for small values of p E E(f n ) E(f ) 10 p / n will need a very large n in order to have a meaningful bound on the expected error. Interestingly, we can get much better constants (not rates though!) by working in probability...

13 Hoeffding s Inequality Let X 1,..., X n independent random variables s.t. X i [a i, b i ]. Let X = 1 n n i=1 X i. Then, P ( ) X E X 2n ɛ 2 exp ( 2 ɛ 2 ) n i=1 (b i a i ) 2

14 Applying Hoeffding s inequality Assume that f H, x X, y Y the loss is bounded l(f(x), y) M by some constant M > 0. Then, for any f H we have P ( E n (f) E(f) ɛ) exp( nɛ2 2M 2 )

15 Controlling the Generalization Error We would like to control the generalization error E n (f n ) E(f n ) of our estimator in probability. One possible way to do that is by controlling the generalzation error of the whole set H. ( ) P ( E n (f n ) E(f n ) ɛ) P sup f H E n (f) E(f) ɛ The latter term is the probability that least one of the events E n (f) E(f) ɛ occurs for f H. In other words the probability of the union of such events. Therefore ( ) P sup f H E n (f) E(f) ɛ by the so-called union bound. f H P ( E n (f) E(f) ɛ)

16 Hoeffding the Generalization Error By applying Hoeffding s inequality, P ( E n (f n ) E(f n ) ɛ) 2 H exp( nɛ2 2M 2 ) Or, equivalently, that for any δ (0, 1], 2M 2 log(2 H /δ) E n (f n ) E(f n ) n with probability at least 1 δ.

17 Example: Threshold Functions (in Probability) Going back to H p space of threshold functions p 2 log δ E n (f n ) E(f n ) n since M = 1 and log 2 H = log 4 10 p = log 4 + p log p. For example, let δ = We can say that 6p + 18 E n (f n ) E(f n ) n holds at least 99.9% of the times.

18 Bounds in Expectation Vs Probability Comparing the two bounds E E n (f n ) E(f n ) 10 p / n (Expectation) While, with probability greater than 99.9% E n (f n ) E(f n ) 6p + 18 n (Probability) Although we cannot be 100% sure of it, we can be quite confident that the generalization error will be much smaller than what the bound in expectation tells us... Rates: note however that the rates of convergence to 0 are the same (i.e. O(1/ n)).

19 Improving the bound in Expectation Exploiting the bound in probability and the knowledge that on H p the excess risk is bounded by a constant, we can improve the bound in expectation... Let X be a random variable s.t. X < M for some constant M > 0. Then, for any ɛ > 0 we have E X ɛ P ( X ɛ) + MP ( X > ɛ) Applying to our problem: for any δ (0, 1] 2M 2 log(2 H p /δ) E E n (f n ) E(f n ) (1 δ) + δm n Therefore only log H p appears (no H p alone).

20 Infinite Hypotheses Spaces What if f H \ H p for any p > 0? ERM on H p will never minimize the expected risk. There will always be a gap for E(f n,p ) E(f ). For p + it is natural to expect such gap to decrease... BUT if p increases too fast (with respect to the number n of examples) we cannot control the generalization error anymore! 6p + 18 E n (f n ) E(f n ) + for p + n Therefore we need to increase p gradually as a function p(n) of the number of training examples. This approach is known as regularization.

21 Approximation Error for Threshold Functions Let s consider f p = 1 [ap,+ ) argmin f H p E(f) with a p [ 1, 1]. Consider the error decomposition of the excess risk E(f n ) E(f ): E(f n ) E n (f n ) + E n (f n ) E n (f p ) +E n (f p ) E(f p ) + E(f p ) E(f ) }{{} 0 We already know how to control the generalization of f n (via the supremum over H p ) and f p (since it is a single function). Moreover, we have that the approximation error is E(f p ) E(f ) a p a 10 p Note that it does not depend on training data! (why?)

22 Approximation Error for Threshold Functions II Putting everything together we have that, for any δ [0, 1) and p 0, 4 + 6p 2 log δ E(f n ) E(f ) p = φ(n, δ, p) n holds with probability greater or equal to 1 δ. In particular, for any n and δ, we can choose the best precision as p(n, δ) = argmin φ(n, δ, p) p 0 which leads to an error bound ɛ(n, δ) = φ(n, δ, p(n, δ)) holding with probability larger or equal than 1 δ.

23 Regularization Most hypotheses spaces are too large and therefore prone to overfitting. Regularization is the process of controlling the freedom of an estimator as a function on the number of training examples. Idea. Parametrize H as a union H = γ>0 H γ of hypotheses spaces H γ that are not prone to overfitting (e.g. finite spaces). γ is known as the regularization parameter (e.g. the precision p in our examples). Assume H γ H γ if γ γ. Regularization Algorithm. Given n training point, find an estimator f γ,n on H γ (e.g. ERM on H γ ). Let γ = γ(n) increase as n +.

24 Regularization and Decomposition of the Excess Risk Let γ > 0 and f γ = argmin f H γ E(f) We can decompose the excess risk E(f γ,n ) E(f ) as E(f γ,n ) E(f γ ) }{{} Sample error + E(f γ ) inf f H E(f) }{{} Approximation error + inf f H E(f) E(f ) }{{} Irreducible error

25 Irreducible Error inf E(f) E(f ) f H Recall: H is the largest possible Hypotheses space we are considering. If the irreducible error is zero, H is called universal (e.g. the RKHS induced by the Gaussian kernel is a universal Hypotheses space).

26 Approximation Error E(f γ ) inf f H E(f) Does not depend on the dataset (deterministic). Does depend on the distribution ρ. Also referred to as bias.

27 Convergence of the Approximation Error Under mild assumptions, lim γ + E(f γ) inf f H E(f) = 0

28 Density Results lim E(f γ) E(f ) = 0 γ + Convergence of Approximation error + Universal Hypotheses space Note: It corresponds to a density property of the space H in F = {f : X Y}

29 Approximation error bounds E(f γ ) inf E(f) A(ρ, γ) f H No rates without assumptions related to the so-called No Free Lunch Theorem. Studied in Approximation Theory using tools such as Kolmogorov n-width, K-functionals, interpolation spaces... Prototypical result: If f has smoothness 2. A(ρ, γ) = cγ s. 2 Some abstract notion of regularity parametrizing the class of target functions. Typical example: f in a Sobolev space W s,2.

30 Sample Error E(f γ,n ) E(f γ ) Random quantity depending on the data. Two main ways to study it: Capacity/Complexity estimates on H γ. Stability.

31 Sample Error Decomposition We have seen how to decompose the sample error E(f γ,n ) E(f γ ) in E(f γ,n ) E n (f γ,n ) }{{} Generalization error + E n (f γ,n ) E n (f γ ) }{{} Excess empirical Risk + E n (f γ ) E(f γ ) }{{} Generalization error

32 Generalization Error(s) As we have observed, E(f γ,n ) E n (f γ,n ) and E n (f γ ) E(f γ ) Can be controlled by studying the empirical process sup E n (f) E(f) f H γ Example: we have already observed that for a finite space H γ ( ) P sup E n (f) E(f) ɛ 2 H γ exp( nɛ2 f H γ 2M 2 )

33 ERM on Finite Spaces and Computational Efficiency The strategy used for threshold functions can be generalized to any H for which it is possible to find a finite discretization H p with respect to the L 1 (X, ρ X ) norm (e.g. H compact with respect to such norm). However, in general, it could be computationally very expensive to find the empirical risk minimizer on a discretization H p, since in principle it could be necessary to evaluate E n (f) for any f H p. As it turns out, ERM on e.g. convex (thus dense) spaces is often much more amenable to computations, but we have observed that on infinite hypotheses spaces it is difficult to control the generalization error. Interestingly, we can leverage the discretization argument to control the generalization error of ERM also for special dense hypotheses spaces.

34 Risks for Continuous functions Let X R d be a compact space and C(X ) be the space of continuous functions. Let be defined for any f C(X ) as f = sup x X f(x) ). If the loss function l : Y Y R is such that l(, y) is uniformly Lipschitz with constant C > 0, for any y Y, we have that 1) E(f 1 ) E(f 2 ) C f 1 f 2 L1 (X,ρ X ) C f 1 f 2, and 2) E n (f 1 ) E n (f 2 ) 1 n n l(f 1 (x i ), y i ) l(f 2 (x i ), y i ) C f 1 f 2 i=1 Therefore, close functions in will have similar expected and empirical risks!

35 Compact Spaces in C(X ) Idea. if H C(X ) admits a finite discretization H p = {h 1,..., h N } of precision p with respect to the (e.g. H is compact with respect to ), then we can control the generalization error over it as sup f H E n (f) E(f) sup E n (f) E(h f ) + E n (h f ) E(h f ) + E(h f ) E(f) f H 2L 10 p + sup h H p E n (h) E(h) where we have denoted h f = argmin h Hp h f Note. we know how to control sup h Hp E n (h) E(h) since H p is finite!

36 Covering numbers We define the covering number of H of radius η > 0 as the cardinality of a minimal cover of H with balls of radius η. N (H, η) = inf{m H m B η (h i ) h i H} i=1 Example. If H = B R (0) is a ball of radius R in R d : N (B R (0), η) = (4R/η) d Image credits: Lorenzo Rosasco.

37 Example: Covering numbers (continued) Putting the two together we have that for any δ [0, 1), sup E(f n ) E(f) 2Lη + f H 2M 2 log(2n (H, η)/δ) n holds with probability at least 1 δ. For η 0 the covering number N (H, η) +. However for n + the bound tends to zero. It is typically possible to show that there exists an η(n) for which the bound tends to zero as n +.

38 Complexity Measures In general, the error sup E n (f) E(f) f H γ Can be controlled via capacity/complexity measures: covering numbers, combinatorial dimension, e.g. VC-dimension, fat-hattering dimension Rademacher complexities Gaussian complexities...

39 Prototypical Results A prototypical result (under suitable assumptions, e.g. regularity of f ): E(f γ,n ) E(f ) E(f γ,n E(f γ )) + E(f γ ) E(f ) }{{}}{{} γ β n α γ τ (Variance) (Bias) Goal: find the γ(n) achieving the best Bias - Variance trade-off

40 Choosing γ(n) in practice The best γ(n) depends on the unknown distribution ρ. So how can we choose such parameter in practice? Problem known as model selection. Possible approaches: Cross validation, complexity regularization/structural risk minimization, balancing principles....

41 Abstract Regularization We just got our first introduction to the concept of regularization: controlling the expressiveness of the hypotheses space according to the number of training examples in order to guarantee good prediciton performance and consistency. There are many ways to implement this strategy in practice ( we will see some of them in this course): Tikonov (and Ivanov) regularization Spectral filtering Early stopping Random sampling...

42 Wrapping Up This class: Overfitting Controlling the Generalization error Abstract Regularization Next class: Tikhonov Regularization

Generalization theory

Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1