12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Size: px

Start display at page:

Download "12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016"

Gervais Garrison
6 years ago
Views:

1 12. Structural Risk Minimization ECE 830 & CS 761, Spring / 23

2 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses Y Definition: predictor / classifier A predictor is a function f : X Y, where X is the feature space and Y is the label space. Let F be a collection of predictors. We also have a loss function l : Y Y R +. e.g. y is true label, ŷ is predicted label, and we measure l(y, ŷ) 0. In this lecture we focus on l(y, ŷ) = 1 {y ŷ} (0/1 loss). Main assumption: (x i, y i ) iid P iid, where P is unknown. 2 / 23

3 Goal: Select an f F so that it minimizes Risk = R(f) = E (x,y) P [l(y, f(x))] So far we have focused on empirical risk minimization, where ˆR(f) := 1 n n l(y i, f(x i )) i=1 is the empirical risk. Then the empirical risk minimizer (ERM) is ˆf = arg min ˆR(f). 3 / 23

4 ERM performance What can be said about the ERM s performance? Case 1: Finite sets of classifiers. We showed that with probability 1 δ R(f) ˆR(f) + which leads to the bound log F + log(1/δ) n f F E[R( ˆf)] min R(f) log F + log n n 4 / 23

5 ERM performance What can be said about the ERM s performance? Case 2: Classifiers with finite VC dimension. Let S(F, n) be the Shatter coefficient for F, representing the number of different effective classifiers are in F for n training samples; the VC dimension is V F = arg max 1 {S(F,k)=2 k } k 1 and can be thought of as the largest number of examples that can be arbitrarily labeled. We showed using Sauer s lemma, S(F, n) (n + 1) V (F) that with probability 1 δ R(f) ˆR(f) V log n + log(1/δ) n which leads to the bound E[R( ˆf)] min R(f) V log n. n 4 / 23

6 Example: Histogram classifiers Let X = [0, 1] d, Y = {0, 1} with 0/1 loss. Let F k denote the set of histogram classifiers with k bins. Note F k = 2 k. If we fix m = k classifier bins, then we have with probability at least 1 δ R(f) ˆR(f) m log 2 + log(2/δ). n 5 / 23

7 Example: Histograms continued Let R be the Bayes Risk: R = min f R(f) where the minimization is over all classifiers, including those potentially not in F and based on the unknown distribution P. Then for the empirical risk minimizer over F = F m we have E[R( ˆf)] R = E[R( ˆf)] min R(f) + min R(f) R }{{}}{{} estimation error approximation error m log(n) + min R(f) R n }{{} decreases with m How should we choose m? Choosing m in advance based on n means that we cannot optimally balance between the two terms in the bound for all distributions P. We might consider F = k 1 F k, but this set has F = V (F) =. 6 / 23

8 Countably Infinite Sets of Classifiers Suppose that F is a countable, possibly infinite, collection of candidate functions. Example: Histogram classifiers F = k 1 F k Further suppose that we have some prior distribution p over this set, so that p(f) 0 f F and p(f) = 1. This provides two advantages: 1. By choosing p(f) larger for certain f, we can preferentially treat those candidates 2. We do not need F to be finite and we only require p(f) = 1 7 / 23

9 Let then we have c(f) = log p(f); e c(f) = 1. The numbers c(f) can be interpreted as -log of prior probabilities codelengths measures of complexity 8 / 23

10 Now recall Hoeffding s inequality. For each f and every ɛ > 0 ( P R(f) ˆR ) n (f) ɛ e ɛ2 or for every δ > 0 ( ) P R(f) ˆR log(1/δ) n (f) δ Suppose δ > 0 is specified. Using the values c(f) for f F, define δ(f) := δe c(f) Then we have ( ) P R(f) ˆR log(1/δ(f)) n (f) δ(f) 9 / 23

11 Furthermore we can apply the union bound as follows P R(f) ˆR log(1/δ(f)) n (f) ( ) P R(f) ˆR log(1/δ(f)) n (f) δ(f) = e c(f) δ = δ 10 / 23

12 Summary We have that f F and δ > 0 with probability at least 1-δ R(f) ˆR log(1/δ(f)) n (f) + = ˆR c(f) + log(1/δ) n (f) + 11 / 23

13 Example: Finite sets Suppose F is finite and c(f) = log F prior). Then and f F (this is a uniform e c(f) = e log F = 1 F = 1 δ(f) = δ F which implies f F, F <, and δ > 0 with probability at least 1 δ R(f) ˆR log F + log(1/δ) n (f) + Note that this is precisely the PAC bound we derived in the last lectures. 12 / 23

14 Example: Histogram Classifiers X = [0, 1] d, Y = {0, 1}. Let F k, k=1, 2,... denote the collection of histogram classification rules with k equal volume bins, and let F = k 1 F k. For f F k, we choose c(f) = 2k. (We will how to derive this and that it satisfies e c(f) 1 in the next lecture.) Then f k 1 F k and δ > 0, with probability at least 1 δ R(f) ˆR n (f) + 2kf log 2 + log(1/δ) where k f is the number of bins in histogram corresponding to f. 13 / 23

15 Example: Histograms continued Contrast with the bound we had for the class of m bin histograms alone: f F m and δ > 0, with probability 1 δ R(f) ˆR m log 2 + log(1/δ) n (f) + Notice the bound for all histograms rules is almost as good as the bound for only the m-bin rules. That is, when k f = m the bounds are within a factor of 2. On the other hand, the new bound is a big improvement, since it also gives us a guide for selecting the number of bins. 14 / 23

16 Beyond ERMs The above bounds can be used to derive a useful alternative to empirical risk minimization one which exploits the prior incapsulated by p(f) = e c(f). In general, we want to choose a classifier ˆf F so that E[R( f n )] inf R(f) is as small as possible. We d like to choose ˆf = arg min R(f) but we cannot measure R(f) to minimize it. However, we can minimize its upper bound! 15 / 23

17 Structural risk minimization Definition: Structural risk minimizer For a countably infinite or finite class F, let c(f) be a function such that e c(f) 1. Then for any δ (0, 1), where f δ n = arg min C(f, n, δ) is the structural risk minimizer. { Rn (f) + C(f, n, δ)} c(f) + log(2/δ) 16 / 23

18 According to the PAC bound, f F and δ > 0, with probability 1 δ, and in particular, R(f) R n (f) + C(f, n, δ) R( f δ n) R n ( f δ n) + C( f δ n, n, δ) so, by the definition of f δ n, f F R( f δ n) R n (f) + C(f, n, δ) We will make use of the inequality above in a moment. First note that f F E[R( f δ n)] R(f) = E[R( f δ n) R n (f)] + E[ R n (f) R(f)] The second term is exactly 0, since E[ R n (f)] = R(f). 17 / 23

19 Now consider the first term E[R( f δ n) R n (f)]. Let Ω be the set of events on which R( f δ n) R n (f) C(f, n, δ), f F From our PAC bound, we know that P (Ω) 1 δ. Thus, E[R( f δ n) R n (f)] =E[R( f δ n) R n (f) Ω]P (Ω) + E[R( f δ n) R n (f) Ω c ](1 P (Ω)) C(f, n, δ) + δ (since 0 R, R 1, P (Ω) 1 and 1 P (Ω) δ) c(f) + log(2/δ) = + δ c(f) = log n + 1 (by setting δ = 1 ) n n 18 / 23

20 We can summarize our analysis with the following theorem. Theorem: Complexity Regularized Model Selection Let F be a collection of functions, and assign a positive number c(f) to each f F such that e c(f) 1. Define the structural risk minimizer f n = arg min R n (f) + c(f) log n Then, E[R( f n )] inf R(f) + c(f) log n + 1 n 19 / 23

21 This shows that R n (f) + c(f) log n is a reasonable surrogate for c(f) R(f) + log n 20 / 23

22 Example: Histogram Classifiers Let X = [0, 1] d be the input space and Y = {0, 1} be the output space. Let F k, k = 1, 2,... denote the collection of histogram classification rules with k equal volume bins. Let f n be the structural risk minimizer ˆf n = arg min ˆR(f) + c(f) log n Recall our choice c(f) = 2k for f a k-bin histogram classifier. Then equivalently ˆf n = min k 1 min 2k Rn (f) + log n k 21 / 23

23 Example: Histograms continued That is, for each k, let f (k) n = arg min k Rn (f) Then select the best k according to k = arg min k 1 R (k) 2k + n ( f 1 2 n ) + log n and set f n = f ( k) n Then, E[R( f n )] inf k 1 min k R(f) + 2k log n + 1 n 22 / 23

24 Example: Histograms continued From the third homework, we know that if d = 2 and the Bayes decision boundary is a 1-d curve, then by setting k = n and selecting the best f from F n we have E[R( f n )] = O(n 1/4 ) It is a simple exercise to show that the complexity regularized classifier will perform just as well automatically. That is, the proper k is selected automatically, without user intervention. 23 / 23

Probably Approximately Correct (PAC) Learning

ECE91 Spring 24 Statistical Regularization and Learning Theory Lecture: 6 Probably Approximately Correct (PAC) Learning Lecturer: Rob Nowak Scribe: Badri Narayan 1 Introduction 1.1 Overview of the Learning