Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Size: px

Start display at page:

Download "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer"

Lawrence Barnett
5 years ago
Views:

1 Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011

2 Outline

3 Notation minimize a number of functions φ t (x) subject to the constraint x X. diameter of set in l 2 norm is D 2 sup x,y X x y 2 D = sup x X x x Π X (y) = argmin x y 2 2 x X

4 Online Convex Optimization Online convex optimization algorithm [Zinkevich, 2003]. projected gradient method takes steps g t φ t (x t ). the steps are x t+1 = Π X (x t ηg t ) (1)

5 Online Convex Optimization The regret of this algorithm is: R(T ) 2D T 2 g t 2 2 (2) t=1 This bound is tight [Abernethy et al., 2008].

6 Problem 1: l 1 regularization slow convergence rate means x t might not be sparse Regularized Dual Averaging and Mirror Descent are improved algorithms developed as optimization algorithms for offline(batch) problems

7 Problem 2: Adapting to Data sparse data, such as text classification gradient steps with fixed stepsize can take exponentially long for weights to update. adaptive method is like having a different learning rate for each feature.

8 Outline

9 Mirror Descent Projected gradient method is slow to converge Mirror descent [Beck and Teboulle, 2003] replaces the l 2 norm with a Bregman Divergence B ψ (w, v) = ψ(w) ψ(v) ψ(v), w v (3) The mirror descent update is: x t+1 = argmin x X converges faster (offline setting) B ψ (x, x t ) + η φ t(x t ), x x t (4)

10 Composite Objective Mirror Descent Usually minimize a function of the form: φ t = f t (x) + ϕ(x) (5) ϕ is the regularization term, which does not depend on t. the COMID modification [Duchi et al., 2010c] does not linearize around ϕ. g t f t (x t ) update rule: x t+1 = argmin x X η g t (x t ), x + B ψ (x, x t ) + ηϕ(x) (6)

11 COMID Regret This algorithm produces sparse solutions and the regret of COMID [Duchi et al., 2010c] is similar to the basic projected gradient algorithm. R φ (T ) 1 η B ψ(x, x 1 ) + η 2 T g t (x t ) 2 (7) t=1

12 Outline

13 RDA notation Regularized Dual Averaging [Xiao, 2010] keeps an average of gradients ḡ t = 1 2 t τ=1 we are minimizing (again) this function g τ φ t = f t (x) + ϕ(x) (8) RDA combines the loss f, a regularizer ϕ and a strongly convex term ψ.

14 Regularized Dual Averaging similarly to COMID, we separate out the regularizer so our solutions are sparse update for Regularized Dual Averaging (RDA) is: x t+1 = argmin x X η ḡ t, x + ηϕ(x) + 1 ψ(x) (9) t combine last two terms to perform a closed form update example: soft-thresholding for l 1 this update can be very aggressive

15 RDA The Regret of RDA [Xiao, 2010] is : R ϕ (T ) T ψ(x ) T T g t (x t ) 2 (10) t=1

16 Outline

17 Adaptive projections for projected gradient Before we had replace Π X with Π X (y) = argmin x y 2 2 (11) x X Π A X (y) = argmin x y, A(x y) (12) x X

18 Why Change the Norm? Slides are reproduced from [Boyd and Vandenberghe, 2004].

22 Regret Motivation COMID Regret R φ (T ) 1 η B ψ(x, x 1 ) + η 2 T g t (x t ) 2 (13) t=1 RDA Regret R φ (T ) T ψ(x ) T T g t (x t ) 2 (14) dominant term of each is the sum of previous gradients. t=1 make the sum smaller, we can lower regret

23 Notation for ADAGRAD collect all previous gradients g 1:t = [g 1, g 2,..., g t ] G = t τ=1 g τ g T τ A = G 1 2 or A = diag(g) 1 2 is a good choice[duchi et al., 2010a, Duchi et al., 2010b] focus on diagonal case g 1:t,i is the row corresponding a feature across all gradients d i=1 g 1:T,i 2 occurs in bound s t,i = g 1:t,i 2 H t = δi + diag(s t ) ψ t = 1 2 x, H tx

24 diagonal ADAGRAD The update for RDA becomes x t+1 = argmin x X x t+1 = argmin x X η ḡ t, x + ηϕ(x) + 1 ψ(x) (15) t η ḡ t, x + ηϕ(x) + 1 t ψ t(x) (16)

25 diagonal ADAGRAD For Mirror Descent x t+1 = argmin x X new update x t+1 = argmin x X η g t (x t ), x + B ψ (x, x t ) + ηϕ(x) (17) η g t (x t ), x + B ψt (x, x t ) + ηϕ(x) (18)

26 Lemma 5 bounds sum of gradient terms T d g t, diag(s t ) 1 g t 2 g 1:T,i 2 (19) t=1 i=1

27 Main Theorem δ max t g t regret of the primal-dual method is: R φ (T ) δ η x η x 2 regret of COMID is: R φ (T ) 1 2η max t T x x t 2 d g 1:T,i 2 + η i=1 d g 1:T,i 2 + η i=1 d g 1:T,i 2 i=1 (20) d g 1:T,i 2 i=1 (21)

28 Cleaning up theorem defining γ T = d g 1:T,i 2 (22) i=1 For primal-dual with η = x : R φ (T ) 2 x γ T + δ x 2 2 x 2 x γ T + δ x 1 (23) For composite mirror descent, with η = D 2 R φ (T ) d 2D g 1:T,i 2 = 2D γ T (24) i=1

29 Outline

30 Experiments comparisons are to Passive Aggressive and AROW algorithms these are adaptive, but arise from mistake bounds FOBOS is an earlier non-adaptive algorithm

31 Experiment 1 RCV1 RCV1 is a standard text dataset 4 categories documents 2 million features wordcounts (5000 features per vector) hinge loss, l 1 regularization

32 The important point here is that it not only performs well, but with far fewer features (sparser predictor vector).

33 Experiment 2: Image Ranking ranking hinge loss, l 1 regularization 2 million images 15,000 classes score is precision-at-k

35 Experiment 3: MNIST MNIST is standard digit recognition task 60,000 examples 30,000 features classifier is a Gaussian kernel machine

38 Experiment 4 Census UCI dataset, predict income level from features > $ features, binary features 199,523 training samples

39 Census

41 Conclusions RDA and COMID exploit regularizer better can derive adaptive version of these algorithms can achieve low regret good predictive accuracy better sparsity than comparable algorithms

42 Not Covered lowering regret for strongly convex functions regret bounds and algorithm for full matrix algorithm implementation details (Section 6 of tech report).

43 Abernethy, J., Bartlett, P., Rakhlin, A., and Tewari, A. (2008). Optimal strategies and minimax lower bounds for online convex games. In Proceedings of the Nineteenth Annual Conference on Computational Learning Theory. Beck, A. and Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3): Boyd, S. and Vandenberghe, L. (2004). Convex optimization. Cambridge Univ Pr. Duchi, J., Hazan, E., and Singer, Y. (2010a). Adaptive subgradient methods for online learning and stochastic optimization. UC Berkeley EECS Technical Report, 24:1 41.

44 Duchi, J., Hazan, E., and Singer, Y. (2010b). Adaptive subgradient methods for online learning and stochastic optimization. In Proceedings of the Twenty Third Annual Conference on Computational Learning Theory, number 1. Duchi, J., Shalev-Shwartz, S., Singer, Y., Tewari, A., and Chicago, T. (2010c). Composite objective mirror descent. In Proceedings of the Twenty Third Annual Conference on Computational Learning Theory, number 1. Xiao, L. (2010). Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization. Journal of Machine Learning Research, 11: Zinkevich, M. (2003). Online Convex Programming and Generalized Infinitesimal Gradient Ascent.

45 In International Conference on Machine Learning, pages

Composite Objective Mirror Descent

Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research