Composite Objective Mirror Descent

Size: px

Start display at page:

Download "Composite Objective Mirror Descent"

Eustacia Fleming
5 years ago
Views:

1 Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research 4 Toyota Technological Institute, Chicago June 29, 2010 Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

2 Large scale logistic regression Problem: n huge, 1 n min log(1+exp( a i,x )) +λ x x n 1 i=1 }{{} =f(x) Usual approach: online gradient descent (Zinkevich 03). Let g t = log(1+exp( a t,x t )) x t+1 = x t η t g t η t λsign(x t ) Then perform online to batch conversion Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

3 Problems with usual approach Regret bound/convergence rate: set G = max t g t +λsign(x t ) 2 f(x T )+λ x T 1 = f(x )+λ x 1 +O ( ) x 2 G T But G = Θ( d) additional penalty of sign(x t ) No sparsity in x T Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

4 Problems with usual approach Regret bound/convergence rate: set G = max t g t +λsign(x t ) 2 f(x T )+λ x T 1 = f(x )+λ x 1 +O ( ) x 2 G T But G = Θ( d) additional penalty of sign(x t ) No sparsity in x T Why should we suffer from 1 term? Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

5 Online Gradient Descent Let g t = log(1+exp( a t,x t ))+λsign(x t ). OGD step (Zinkevich 03): { x t+1 = x t ηg t = argmin η g t,x + 1 } x 2 x x t 2 2 f(x) + λ x 1 f(x t ) + g t, x - x t Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

6 Online Gradient Descent Let g t = log(1+exp( a t,x t ))+λsign(x t ). OGD step (Zinkevich 03): { x t+1 = x t ηg t = argmin η g t,x + 1 } x 2 x x t 2 2 g t, x + B ψ (x, x t ) Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

7 Problems with Subgradient Methods Subgradients are non-informative at singularities Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

8 Problems with Subgradient Methods Subgradients are non-informative at singularities Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

9 Problems with Subgradient Methods Subgradients are non-informative at singularities Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

10 Composite Objective Approach Let g t = log(1+exp( a t,x t )). Truncated gradient (Langford et al. 08, Duchi & Singer 09): x t+1 = argmin x { 1 2 x x t 2 +η g t,x +ηλ x 1 = sign(x t ηg t ) [ x t ηg t ηλ] + x t - ηg t } [ x t - ηg t - ηλ] + Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

11 Composite Objective Approach Update is Two nice things: Sparsity from [ ] + x t+1 = sign(x t ηg t ) [ x t ηg t ηλ] + Convergence rate: let G = max t g t 2 f(x T )+λ x T 1 = f(x )+λ x 1 +O ( ) x 2 G T No extra penalty from λ x 1! Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

12 Abstraction to Regularized Online Convex Optimization Repeat: Learner plays point x t Receive f t +ϕ (ϕ known) Suffer loss f t (x t )+ϕ(x t ) Goal: attain small regret R(T) := T t=1 f t (x t )+ϕ(x t ) inf x X T f t (x)+ϕ(x) t=1 Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

13 Composite Objective MIrror Descent Let g t = f t (x t ). Comid step: x t+1 = argmin{b ψ (x,x t )+η g t,x +ηϕ(x)} x X f(x) f(x) + ϕ(x) ϕ(x) Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

14 Composite Objective MIrror Descent Let g t = f t (x t ). Comid step: x t+1 = argmin{b ψ (x,x t )+η g t,x +ηϕ(x)} x X f(x) + ϕ(x) B ψ (x, x t ) + g, x + ϕ(x) Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

15 Convergence Results Old (online gradient/mirror descent): Theorem: For any x X, T f t (x t )+ϕ(x t ) f t (x ) ϕ(x ) t=1 T 1 η B ψ(x,x 1 )+ η 2 t=1 f t (x t )+ ϕ(x t ) 2 Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

16 Convergence Results Old (online gradient/mirror descent): Theorem: For any x X, T f t (x t )+ϕ(x t ) f t (x ) ϕ(x ) t=1 T 1 η B ψ(x,x 1 )+ η 2 t=1 f t (x t )+ ϕ(x t ) 2 New (Comid): Theorem: For any x X, T f t (x t )+ϕ(x t ) f t (x ) ϕ(x ) 1 η B ψ(x,x 1 )+ η 2 t=1 T f t (x t ) 2 t=1 Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

17 Derived Algorithms FOBOS (Duchi & Singer, 2009) p-norm divergences Mixed-norm regularization Matrix Comid Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

18 p-norms Better l 1 algorithms: ϕ(x) = λ x 1 Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

19 p-norms Better l 1 algorithms: ϕ(x) = λ x 1 Idea: non-euclidean geometry (e.g. dense gradients, sparse x ) Recall 1 2(p 1) x 2 p is strongly convex over Rd w.r.t. l p, 1 < p 2 Take ψ(x) = 1 2 x 2 p Corollary: When f t(x t ) G, take p = 1+1/logd to get ) R(T) = O ( x 1 G T logd Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

20 Derived p-norm algorithms SMIDAS (Shalev-Shwartz & Tewari 2009): take ϕ(x) = λ x 1. Assume sign([ ψ(x)] j ) = sign(x j ), define S λ (z) = sign(z) [ z λ] + Then x t+1 = ( ψ) 1 S ηλ ( ψ(x t ) ηf t(x t )) λη t 0 x t+1 x t+1/2 Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22 0

21 Comid with mixed norms ϕ(x) = X l1 /l q = X = x 1 x 2. x d d x j q j=1 x 1 q x 2 q. x d q Separable and solvable using previous methods Multitask and multiclass learning xj associated with feature j Penalize xj once Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

22 Mixed-norm p-norm algorithms Specialize problem to min x v,x x 2 p +λ x Closed form? No. Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

23 Mixed-norm p-norm algorithms Specialize problem to min x v,x x 2 p +λ x Closed form? No. Dual problem (x = v β): min β v β q subject to β 1 λ Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

24 Mixed-norm p-norm algorithms Problem: min β v β q subject to β 1 λ Observation: Monotonicity of β, so v i v j implies β i β j Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

25 Mixed-norm p-norm algorithms Problem: min β v β q subject to β 1 λ Observation: Monotonicity of β, so v i v j implies β i β j Root-finding problem: β 6 (θ) λ = d β i (θ) = i=1 d [ vi θ 1/(q 1)] + i=1 v 4 v 6 v 8 Solve with median-like search Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

26 Matrix Comid Idea: get sparsity in spectrum of X R d 1 d 2. Take ϕ(x) = X 1 = min{d 1,d 2 } i=1 σ i (X) Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

27 Matrix Comid Idea: get sparsity in spectrum of X R d 1 d 2. Take ϕ(x) = X 1 = min{d 1,d 2 } i=1 σ i (X) Schatten p-norms: apply p-norms to columns of X R d 1 d 2 X p = σ(x) p = Important fact: for 1 < p 2, ψ(x) = (min{d 1,d 2 } i=1 1 2(p 1) X 2 p is strongly convex w.r.t. p (Ball et al., 1994) σ i (X) p ) 1/p Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

28 Matrix Comid Schatten p-norms: apply p-norms to columns of X R d 1 d 2 X p = σ(x) p = Important fact: for 1 < p 2, ψ(x) = (min{d 1,d 2 } i=1 1 2(p 1) X 2 p σ i (X) p ) 1/p is strongly convex w.r.t. p (Ball et al., 1994) Consequence: Take p = 1+1/logd, G f t(x t ). Comid with above ψ has ) R(T) = O (G X 1 T logd Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

29 Trace-norm Regularization Idea: get sparsity in spectrum, take ϕ(x) = X 1 = i σ i(x) X t+1 = argminη f t(x t ),X +B ψ (X,X t )+ηλ X 1 X X For 1 < p 2, update is Compute SVD X t Gradient step X t+ 1 2 Compute SVD X t+ 1 2 = Uσ(X t )V = U diag( ψ(σ(x t )))V ηf t(x t ) = Ũσ(X t+ 1)Ṽ 2 Shrinkage X t+1 = [( ψ) Ũ diag 1 S ηλ (σ(x t+ 1)) 2 ]Ṽ Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

30 Trace-norm Regularization Example Proximal function: ψ(x) = 1 2 X 2 2 = 1 2 X 2 Fr Update: X t+ 1 2 Shrinkage: = X t ηf t(x t ) (= UΣ t+ 1V ) X t+1 = U [ ] Σ t+ 1 ηλ V u 2 σ 2 u 3 σ 3 u 4 σ 4 u 1 σ 1 Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

31 Trace-norm Regularization Example Proximal function: ψ(x) = 1 2 X 2 2 = 1 2 X 2 Fr Update: X t+ 1 2 = X t ηf t(x t ) (= UΣ t+ 1V ) 2 u 2 [σ 2 - λ] + u 1 [σ 1 - λ] + Shrinkage: X t+1 = U [ ] Σ t+ 1 ηλ V 2 + Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

32 Proof ideas for trace-norm Idea: Unitary invariance to reduce to vector case (Lewis 1995) ψ(x) = U diag[ ψ(σ(x))]v X 1 = U diag( σ(x) 1 )V Simply reduce to vector case with l 1 -regularization Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

33 Conclusions and Related Work All derivations apply to Regularized Dual Averaging (Xiao 2009) { x t+1 = argmin η x X t τ=1 } g τ,x +ηtϕ(x)+ψ(x) Analysis of online convex programming for regularized objectives Unify several previous algorithms (projected gradient, mirror descent, forward-backward splitting) Derived algorithms for several regularization functions Duchi (UC Berkeley & Google) Composite Objective Mirror Descent June 29, / 22

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ