Fast Rates for Exp-concave Empirial Risk Minimization Tomer Koren, Kfir Y. Levy Presenter: Zhe Li September 16, 2016
High Level Idea Why could we use Regularized Empirical Risk Minimization? Learning theory perspective If we could use Regularized Empirical Risk Minimization, how to solve? Optimization algorithms
Problem setting Consider the problem of minimizing a stochastic objective F (w) = E[f (w, Z)] w W R d, a closed and convex domain W
Problem setting Consider the problem of minimizing a stochastic objective F (w) = E[f (w, Z)] w W R d, a closed and convex domain W random variable Z distributed according to an unknow distribution over a parameter space Z
Problem setting Consider the problem of minimizing a stochastic objective F (w) = E[f (w, Z)] w W R d, a closed and convex domain W random variable Z distributed according to an unknow distribution over a parameter space Z given n samples z 1,, z n of the random variable Z
Problem setting Consider the problem of minimizing a stochastic objective F (w) = E[f (w, Z)] w W R d, a closed and convex domain W random variable Z distributed according to an unknow distribution over a parameter space Z given n samples z 1,, z n of the random variable Z the goal: to produce an estimate ŵ W such that is small. E[F (ŵ)] min w F (w)
Assumptions f (, z) is α-exp-concave over the domain W for some α > 0. discuss later. f (, z) is β-smooth over W with repect to Euclidean norm. f (, z) is bounded over W.
Main results How to construct an estimate ŵ? Based on the sample z 1,, z n, construct where ˆF (w) = 1 n ŵ = argmin ˆF (w) w W n f (w, z i ) + 1 n R(w) i=1 R(w) : W R: a regularizer, 1-strongly-convex w.r.t Euclidean norm. Assump that R(w) R(w ) B for all w, w W for constant B > 0
Main results Theorem Let f : W Z R be a loss function defined over a closed and convex domain W R d, which α-exp-concave,β-smooth and C bounded w.r.t its first argument. Let R : W R be a 1-strongly-convex and B-bounded regularization function. Then for the regularized ERM estimate ŵ based on an i.i.d samples z 1,, z n, the expected excess loss is bounded as 24βd E[F (ŵ)] min F (w) w W αn + 100Cd n + B n = O(d/n) Don t care about whatever optimazition algorithms Care about the learning framework
Proof Technique Uniform Stability Average leave-one-out stablity Rademancher Complexity Local Rademancher Complexity
Average leave-one-out stablity Define the empirical leave-one-out risk for each i = 1,, n ˆF i (w) = 1 f (w, z j ) + 1 n n R(w) j i Let ŵ i = argmin ˆF i (w) w W The average leave-one-out stability of ŵ is defined as 1 n n (f (ŵ i, z i ) f (ŵ, z i )) i=1
Average leave-one-out stablity Theorem (Average leave-one-out stability). For any z 1,, z n Z and for ŵ 1,, ŵ n and ŵ as defined previously, we have 1 n n i=1 (f (ŵ i, z i ) f (ŵ, z i )) 24βd αn + 100Cd n
Proof of Main Theorem fix an arbitrary w W, we have F (w ) + 1 n R(w ) = E[ ˆF (w )] E[ ˆF (ŵ)] E[F (ŵ n )] F (w ) E[F (ŵ n ) ˆF (ŵ)] + 1 n R(w ) since the random variable ŵ 1,, ŵ n have same distribution: E[F (ŵ n )] = 1 n n E[F (ŵ i )] = 1 n i=1 n E[f (ŵ i, z i )] i=1
Proof of Main Theorem E[ ˆF (ŵ)] = 1 n n i=1 E[f (ŵ, z i)] + 1 n E[R(ŵ)] Combining the above inqualities, E[F (ŵ n )] F (w ) 1 n E[f (ŵ i, z i ) f (ŵ, z i )] + 1 n n E[R(w ) R(ŵ)] i=1 24βd αn + 100Cd n + B n = O(d n ) using the average leave-one-out stability theorem and the assumption.
Proof of Average Leave-one-out Stability Theorem
Local Strongly Convexity and Stability Definition (Local strong convexity). We say that a function g : K R is locally δ-strongly convex over a domain K R d at x with respect to a norm,if y K, g(y) g(x) + g(x)(y x) + δ y x 2 2
Local Strongly Convexity and Stability Lemma (Lemma 5). Let g 1, g 2 : K R be two convex functions defined over a closed and convex domain K R d, and let x 1 argming 1 (x) and x 2 argming 2 (x). Assume that g 2 is locally x K x K δ-strongly convex at x 1 with repect to a norm. Then, for h = g 2 g 1 we have Futhermore, if h is convex then x 2 x 1 2 δ h(x 1) 0 h(x 1 ) h(x 2 ) 2 δ ( h(x 1) ) 2 (1)
Average Stability Analysis Some Definitions Lemma f i ( ) = f (, z i ) for all i, h i = f i (ŵ) H = 1 δ I d + n i=1 h ih T i and H i = 1 δ I d + n j i h ih T i x M = x T Mx denotes the norm induced by a positive definite matrix M, dual norm x M = x T M 1 x (Lemma 6) For all i = 1,, n it holds that f i (ŵ i ) f i (ŵ) 6β δ ( h i H i ) 2
Average Stability Analysis Lemma (Lemma 8) Let I = {i [n] : h i H > 1 2 }. Then I 2d and we have ( h i H i ) 2 2d i I Lemma 6 + Lemma 8 = Stability Theorem Proof: i I (f i(ŵ i ) f i (ŵ)) C I 1 n 1 n i I (f i(ŵ i ) f i (ŵ)) 6β δn summing up. n 2Cd n i I ( h i H i ) 2 12βd δn
Continue to prove of Lemma 6 and Lemma 8?
Proof of Lemma 8 Lemma (Lemma 8) Let I = {i [n] : h i H > 1 2 }. Then I 2d and we have ( h i H i ) 2 2d Proof: i I a i = h T i H 1 h i for i = 1,, n, a i > 0. i a i d I 2d. ( h i H i ) 2 = hi T H 1 i hi T = a i + a2 i 1 a i 2a i i I ( h i H i ) 2 2 i I a i i a i = 2d
Proof of Lemma 6 Lemma (Lemma 6) For all i = 1,, n it holds that f i (ŵ i ) f i (ŵ) 6β δ ( h i H i ) 2 Proof: Using property of α-exp-concave of function. Smoothness Assumption. Lemma 5.
α-exp-concave function Definition The function f (w) is α-exp-concave over the domain W for some α > 0, if that the function exp( αf (w)) is concave over W. Lemma (Lemma 7) Let f : K R be an α-exp-concave function over a convex domain K R d such that f (x) f (y) C for any x, y K. Then for any δ 1 2 min{ 1 4C, α}, it holds that x, y K, f (y) f (x) + f (x) T (y x) + δ 2 ( f (x)t (y x)) 2
Proof of Lemma 6 Proof: Let g 1 = ˆF and g 2 = ˆF i, h i = 1 n f i Lemma7 = ˆF i is locally (δ/n) strongly convex at ŵ w.r.t Hi Lemma5 = ŵ i ŵ Hi 2n δ h(ŵ) H i = 2 δ h i H i f i is convex f i (ŵ i ) f i (ŵ) f i (ŵ i ) T (ŵ i ŵ) = f i (ŵ) T (ŵ i ŵ) + ( f i (ŵ i ) f i (ŵ)) T (ŵ i ŵ)
Proof of Lemma 6 f i (ŵ) T (ŵ i ŵ) = h T i (ŵ) T (ŵ i ŵ) h i H i ŵ i ŵ Hi 2 δ ( h i H i ) 2 ( f i (ŵ i ) f i (ŵ)) T (ŵ i ŵ) β ŵ i ŵ 2 2 ŵ i ŵ 2 2 δ ŵ i ŵ 2 H i H i (1/δ)I d. 4 δ ( h i H i ) 2, by using
Open Question Is smoothness assumption necessary? not necessary from online-batch convertion. limition of the analysis? Excess risk with high probability? Morkov s inequality?
Open Question Is smoothness assumption necessary? not necessary from online-batch convertion. limition of the analysis? Excess risk with high probability? Morkov s inequality?