Fast Rates for Exp-concave Empirial Risk Minimization

Size: px

Start display at page:

Download "Fast Rates for Exp-concave Empirial Risk Minimization"

Collin Ball
5 years ago
Views:

1 Fast Rates for Exp-concave Empirial Risk Minimization Tomer Koren, Kfir Y. Levy Presenter: Zhe Li September 16, 2016

2 High Level Idea Why could we use Regularized Empirical Risk Minimization? Learning theory perspective If we could use Regularized Empirical Risk Minimization, how to solve? Optimization algorithms

3 Problem setting Consider the problem of minimizing a stochastic objective F (w) = E[f (w, Z)] w W R d, a closed and convex domain W

4 Problem setting Consider the problem of minimizing a stochastic objective F (w) = E[f (w, Z)] w W R d, a closed and convex domain W random variable Z distributed according to an unknow distribution over a parameter space Z

5 Problem setting Consider the problem of minimizing a stochastic objective F (w) = E[f (w, Z)] w W R d, a closed and convex domain W random variable Z distributed according to an unknow distribution over a parameter space Z given n samples z 1,, z n of the random variable Z

6 Problem setting Consider the problem of minimizing a stochastic objective F (w) = E[f (w, Z)] w W R d, a closed and convex domain W random variable Z distributed according to an unknow distribution over a parameter space Z given n samples z 1,, z n of the random variable Z the goal: to produce an estimate ŵ W such that is small. E[F (ŵ)] min w F (w)

7 Assumptions f (, z) is α-exp-concave over the domain W for some α > 0. discuss later. f (, z) is β-smooth over W with repect to Euclidean norm. f (, z) is bounded over W.

8 Main results How to construct an estimate ŵ? Based on the sample z 1,, z n, construct where ˆF (w) = 1 n ŵ = argmin ˆF (w) w W n f (w, z i ) + 1 n R(w) i=1 R(w) : W R: a regularizer, 1-strongly-convex w.r.t Euclidean norm. Assump that R(w) R(w ) B for all w, w W for constant B > 0

9 Main results Theorem Let f : W Z R be a loss function defined over a closed and convex domain W R d, which α-exp-concave,β-smooth and C bounded w.r.t its first argument. Let R : W R be a 1-strongly-convex and B-bounded regularization function. Then for the regularized ERM estimate ŵ based on an i.i.d samples z 1,, z n, the expected excess loss is bounded as 24βd E[F (ŵ)] min F (w) w W αn + 100Cd n + B n = O(d/n) Don t care about whatever optimazition algorithms Care about the learning framework

10 Proof Technique Uniform Stability Average leave-one-out stablity Rademancher Complexity Local Rademancher Complexity

11 Average leave-one-out stablity Define the empirical leave-one-out risk for each i = 1,, n ˆF i (w) = 1 f (w, z j ) + 1 n n R(w) j i Let ŵ i = argmin ˆF i (w) w W The average leave-one-out stability of ŵ is defined as 1 n n (f (ŵ i, z i ) f (ŵ, z i )) i=1

12 Average leave-one-out stablity Theorem (Average leave-one-out stability). For any z 1,, z n Z and for ŵ 1,, ŵ n and ŵ as defined previously, we have 1 n n i=1 (f (ŵ i, z i ) f (ŵ, z i )) 24βd αn + 100Cd n

13 Proof of Main Theorem fix an arbitrary w W, we have F (w ) + 1 n R(w ) = E[ ˆF (w )] E[ ˆF (ŵ)] E[F (ŵ n )] F (w ) E[F (ŵ n ) ˆF (ŵ)] + 1 n R(w ) since the random variable ŵ 1,, ŵ n have same distribution: E[F (ŵ n )] = 1 n n E[F (ŵ i )] = 1 n i=1 n E[f (ŵ i, z i )] i=1

14 Proof of Main Theorem E[ ˆF (ŵ)] = 1 n n i=1 E[f (ŵ, z i)] + 1 n E[R(ŵ)] Combining the above inqualities, E[F (ŵ n )] F (w ) 1 n E[f (ŵ i, z i ) f (ŵ, z i )] + 1 n n E[R(w ) R(ŵ)] i=1 24βd αn + 100Cd n + B n = O(d n ) using the average leave-one-out stability theorem and the assumption.

15 Proof of Average Leave-one-out Stability Theorem

16 Local Strongly Convexity and Stability Definition (Local strong convexity). We say that a function g : K R is locally δ-strongly convex over a domain K R d at x with respect to a norm,if y K, g(y) g(x) + g(x)(y x) + δ y x 2 2

17 Local Strongly Convexity and Stability Lemma (Lemma 5). Let g 1, g 2 : K R be two convex functions defined over a closed and convex domain K R d, and let x 1 argming 1 (x) and x 2 argming 2 (x). Assume that g 2 is locally x K x K δ-strongly convex at x 1 with repect to a norm. Then, for h = g 2 g 1 we have Futhermore, if h is convex then x 2 x 1 2 δ h(x 1) 0 h(x 1 ) h(x 2 ) 2 δ ( h(x 1) ) 2 (1)

18 Average Stability Analysis Some Definitions Lemma f i ( ) = f (, z i ) for all i, h i = f i (ŵ) H = 1 δ I d + n i=1 h ih T i and H i = 1 δ I d + n j i h ih T i x M = x T Mx denotes the norm induced by a positive definite matrix M, dual norm x M = x T M 1 x (Lemma 6) For all i = 1,, n it holds that f i (ŵ i ) f i (ŵ) 6β δ ( h i H i ) 2

19 Average Stability Analysis Lemma (Lemma 8) Let I = {i [n] : h i H > 1 2 }. Then I 2d and we have ( h i H i ) 2 2d i I Lemma 6 + Lemma 8 = Stability Theorem Proof: i I (f i(ŵ i ) f i (ŵ)) C I 1 n 1 n i I (f i(ŵ i ) f i (ŵ)) 6β δn summing up. n 2Cd n i I ( h i H i ) 2 12βd δn

20 Continue to prove of Lemma 6 and Lemma 8?

21 Proof of Lemma 8 Lemma (Lemma 8) Let I = {i [n] : h i H > 1 2 }. Then I 2d and we have ( h i H i ) 2 2d Proof: i I a i = h T i H 1 h i for i = 1,, n, a i > 0. i a i d I 2d. ( h i H i ) 2 = hi T H 1 i hi T = a i + a2 i 1 a i 2a i i I ( h i H i ) 2 2 i I a i i a i = 2d

22 Proof of Lemma 6 Lemma (Lemma 6) For all i = 1,, n it holds that f i (ŵ i ) f i (ŵ) 6β δ ( h i H i ) 2 Proof: Using property of α-exp-concave of function. Smoothness Assumption. Lemma 5.

23 α-exp-concave function Definition The function f (w) is α-exp-concave over the domain W for some α > 0, if that the function exp( αf (w)) is concave over W. Lemma (Lemma 7) Let f : K R be an α-exp-concave function over a convex domain K R d such that f (x) f (y) C for any x, y K. Then for any δ 1 2 min{ 1 4C, α}, it holds that x, y K, f (y) f (x) + f (x) T (y x) + δ 2 ( f (x)t (y x)) 2

24 Proof of Lemma 6 Proof: Let g 1 = ˆF and g 2 = ˆF i, h i = 1 n f i Lemma7 = ˆF i is locally (δ/n) strongly convex at ŵ w.r.t Hi Lemma5 = ŵ i ŵ Hi 2n δ h(ŵ) H i = 2 δ h i H i f i is convex f i (ŵ i ) f i (ŵ) f i (ŵ i ) T (ŵ i ŵ) = f i (ŵ) T (ŵ i ŵ) + ( f i (ŵ i ) f i (ŵ)) T (ŵ i ŵ)

25 Proof of Lemma 6 f i (ŵ) T (ŵ i ŵ) = h T i (ŵ) T (ŵ i ŵ) h i H i ŵ i ŵ Hi 2 δ ( h i H i ) 2 ( f i (ŵ i ) f i (ŵ)) T (ŵ i ŵ) β ŵ i ŵ 2 2 ŵ i ŵ 2 2 δ ŵ i ŵ 2 H i H i (1/δ)I d. 4 δ ( h i H i ) 2, by using

26 Open Question Is smoothness assumption necessary? not necessary from online-batch convertion. limition of the analysis? Excess risk with high probability? Morkov s inequality?

27 Open Question Is smoothness assumption necessary? not necessary from online-batch convertion. limition of the analysis? Excess risk with high probability? Morkov s inequality?

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.