Stochastic and online algorithms

Size: px

Start display at page:

Download "Stochastic and online algorithms"

Leon Pitts
5 years ago
Views:

1 Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1

2 Stochastic optimization problem minimize x X { } F(x) def = E ξ f(x,ξ) X R n is a (bounded) closed convex set ξ is a random vector whose distribution P is supported on set Ξ R d f : X Ξ R, and the expectation E ξ f(x,ξ) = Ξ f(x,ξ)dp(ξ) is well defined and has finite value for every x X F( ) continuous and convex on X, and optimal value F attained at x (e.g., F( ) is convex if f(,ξ) is convex for every ξ Ξ) Stochastic and online optimization 6 2

3 Sample average approximation minimize x X { ˆF N (x) def = 1 N N j=1 } f(x,ξ j ) assumption: {ξ j } N j=1 is a sequence of independent random outcomes reasonably efficient when solved by appropriate (deterministic) algorithm sample complexity: suppose f has bounded variation, and let V = max { f(x 1,ξ 1 ) f(x 2,ξ 2 ) : x 1,x 2 X, ξ 1,ξ 2 Ξ } then for any ǫ > 0 and ρ (0,1), sample size N = V 2 ln 2 2ǫ 2 ρ guarantees prob ( ˆF N (x) F(x) ǫ ) 1 ρ, x X (proved using Hoeding inequality) Stochastic and online optimization 6 3

4 Stochastic approximation choose x (1) X, and iterate for k = 1,2,... ) x (k+1) = π X (x (k) t k g(x (k),ξ k ) g(x,ξ) is a stochastic subgradient, i.e., g(x,ξ) x f(x,ξ) and F (x) def = E ξ g(x,ξ) F(x) assumption: there exist a constant G such that E ξ [ g(x,ξ) 2 2 ] G 2, x X π X ( ) denotes projection onto X: π X (x) = argmin y X y x 2 2 Stochastic and online optimization 6 4

5 Convergence analysis consider squared distance to x, and let r k = E [ ] x (k) x 2 2 x (k+1) x 2 2 = πx ( x (k) t k g(x (k),ξ k ) ) π X (x ) 2 2 x (k) t k g(x (k),ξ k ) x 2 2 = x (k) x 2 2t 2 k(x (k) x ) T g(x (k),ξ k )+t 2 k g(x (k),ξ k ) 2 2 since x (k) is a function of ξ [k 1] = (ξ 0,...,ξ k 1 ), it is independent of ξ k E [ (x (k) x ) T g(x (k),ξ k ) ] = E { E [ (x (k) x ) T g(x (k) ]},ξ k ) ξ [k 1] = E { (x (k) x ) T E [ g(x (k) ]},ξ k ) ξ [k 1] = E [ (x (k) x ) T F (x (k) ) ] therefore r k+1 r k 2t k E [ (x (k) x ) T F (x (k) ) ] +t 2 kg 2 (1) Stochastic and online optimization 6 5

6 by convexity of F, it holds F(x ) F(x (k) )+(x x (k) ) T F (x (k) ), hence E [ (x (k) x ) T F (x (k) ) ] E [ F(x (k) ) F ] combining with (1) gives t k E [ F(x (k) ) F ] 1 ( rk r k+1 +t 2 2 kg 2) summing over j = 1,...,k yields k t j E [ F(x (j) ) F ] = 1 ( k r 1 r k +G 2 2 j=1 j=1 t 2 k ) 1 2 ( r 1 +G 2 k j=1 t 2 k ) let ν (k) j = t j k i=1 t i and x (k) = k j=1 ν(k) j x (j) (note k j=1 ν(k) j = 1), then E [ F( x (k) ) F ] [ k E j=1 ν (k) j F(x (j) ) F ] r 1+G 2 k j=1 t2 k 2 k j=1 t j Stochastic and online optimization 6 6

7 Fixed step size suppose the number of iterations N is known in advance, then E [ F( x (k) ) F ] D2 +G 2 Nt 2 2Nt where D = max x X x x 2, so that r 1 = E x (1) x 2 2 D 2 minimizing upper bound over t > 0 gives t = D G N and E [ F( x (k) ) F ] DG N if t = θd G for some constant θ > 0, then N E [ F( x (k) ) F ] max{θ,θ 1 } DG N therefore, O(1/ N) convergence robust against step size choices Stochastic and online optimization 6 7

8 Diminishing step size following the halving trick in deterministic subgradient method, redefine t j x (j) if the step sizes are chosen as x (k) = k/2 j k k/2 j k t k = θd G k then the following holds with a small constant C > 1 E [ F( x (k) ) F ] Cmax{θ,θ 1 } DG k t j O(1/ k) convergence rate is optimal for general convex functions Stochastic and online optimization 6 8

9 Analysis for strongly convex functions assume F = E ξ f(x,ξ) is differentiable and strongly convex F(y) F(x)+ F(x) T (y x)+ µ 2 y x 2 2 x,y X or equivalently (x y) T ( F(x) F(y)) µ x y 2 2, x,y X by optimality of x, (x x ) T F(x ) 0, x X therefore (x x ) T F(x) µ x x 2 2, x X (2) Stochastic and online optimization 6 9

10 combining (1) and (2) gives r k+1 (1 2µt k )r k +t 2 kg 2 let s take step size t k = θ/k for some constant θ > 1/(2µ), then r k+1 (1 2µθ/k)r k +θ 2 G 2 /k 2 it follows by induction that (Nemirovski et al. 2009) E [ x (k) x 2 ] 2 = rk Q(θ) k where Q(θ) = max{θ 2 G 2 (2µθ 1) 1, x (1) x 2 2} if in addition F is Lipschitz continuous with constant L > 0, then E [ F(x (k) ) F ] L 2 E[ x (k) x 2 ] LQ(θ) 2 2k Stochastic and online optimization 6 10

11 Sensitivity to priori knowledge of µ example: let F(x) = x 2 /10, X = [ 1,1], µ = 0.2, and there is no noise if θ = 1 (which violates the condition θ > 1/(2µ)), then x (k+1) = x (k) 1 k F (x (k) ) = starting with x (1) = 1 leads to x (k) > 0.8k 1/5 error is larger than even after 10 9 iterations! ( 1 1 ) x (k) 5k if θ = 1/µ = 5, then x = 0 is obtained in one iteration step size t k = θ/k too small if F is not strongly convex Stochastic and online optimization 6 11

12 Online convex optimization explained as online game: for k = 1,2,3,..., player chooses x (k) X based on previous information adversary reveals cost function f k, and player encounters loss f k (x (k) ) assumptions: f k convex; X bounded, closed and convex player wants to minimize regret: R N N k=1 online subgradient method ( fk (x (k) ) ) min x X { N } f k (x) k=1 x (k+1) = π X ( x (k) t k g (k)), g (k) f k (x (k) ) with appropriate step size, can show R N O( N) Stochastic and online optimization 6 12

13 Connection to stochastic approximation a more general framework without stochastic assumptions suppose f k (x) def = f(x,ξ k ), and let x (N) = 1 N N k=1 x(k), then proof: E [ F( x (N) ) ] F 1 N E [ F( x (N) ) ] F 1 N E[R N] = 1 N N (E [ F(x (k) ) ] F ) k=1 N k=1 = 1 N E [ N ( E [ f(x (k),ξ k ) ] min x E [ f(x,ξ k ) ]) k=1 ( ) ] f(x (k),ξ k ) f(x,ξ k ) Stochastic and online optimization 6 13

14 Dual averaging method (Nesterov) initialize: choose x (1) R n and set s (0) = 0 iterate for k = 0,1,2, compute g (k) f k (x (k) ) and set 2. update: x (k+1) = argmin x X s (k) = s (k 1) +g (k) { s (k),x + β } k 2 x x(0) 2 2 = π X ( x (0) 1 β k s (k) ) choice of {β k }: e.g., β k = γ k with γ > 0 can also work with composite objectives: minimize x f(x)+ψ(x) Stochastic and online optimization 6 14

15 A soft support function for any β 0 and any x (0) X, define V β (s) = max x X { s,x x (0) β } 2 x x(0) 2 2 V β (s) 0 for any β 0; if β 2 β 1 > 0, then V β2 (s) V β1 (s) V β ( ) is convex and differentiable V β is Lipschitz continuous with constant 1/β V β (s 1 ) V β (s 2 ) 2 1 β s 1 s 2 2, s 1,s 2 R n therefore V β (s+δ) V β (s)+ δ, V β (s) + 1 2β δ 2 2 Stochastic and online optimization 6 15

16 lemma: let D = max x X x x (0) 2, then proof: max x X s,x x(0) βd2 2 +V β(s) { max x X s,x x(0) = max s,x x (0) : x X = max x X min β 0 min max β 0 x X max x X βd2 2 +V β(s) 1 2 x x(0) } 2 D2 { s,x x (0) + β 2 (D2 x x (0) 2 2) { s,x x (0) + β } 2 (D2 x x (0) 2 2) { s,x x (0) + β 2 (D2 x x (0) 2 2) } } Stochastic and online optimization 6 16

17 Convergence analysis V βk ( s (k) ) V βk 1 ( s (k) ) therefore V βk 1 ( s (k 1) )+ g (k), V βk 1 ( s (k 1) ) + 1 2β k 1 g (k) 2 2 = V βk 1 ( s (k 1) ) g (k),x (k) x (0) + 1 2β k 1 g (k) 2 2 g (k),x (k) x (0) V βk 1 ( s (k 1) ) V βk ( s (k) )+ 1 2β k 1 g (k) 2 2 summing over k = 2,...,N and choose x (0) = x (1) results in N g (k),x (k) x (0) V β1 ( s (1) ) V βn ( s (N) )+ k=1 N k=2 1 2β k 1 g (k) 2 2 Stochastic and online optimization 6 17

18 δ N def = max x X = = N g (k),x (k) x k=1 N g (k),x (k) x (0) +max k=1 x X N g (k),x (0) x k=1 N g (k),x (k) x (0) +max x X s(n),x x (0) k=1 V β1 ( s (1) ) V βn ( s (N) )+ 1 2β 1 g (1) 2 2+ β ND N 1 k=0 N k=2 N k=2 1 g (k) 2 2β 2+ β ND 2 k g (k) 2 2β 2+ β ND 2 k 1 2 G 2 2β k (for convenience, define β 0 = β 1 ) +V βn ( s (N) ) Stochastic and online optimization 6 18

19 by convexity, δ N def = max x X max x X = N k=1 k g (i),x (i) x i=1 N ( fk (x (k) ) f k (x) ) k=1 f k (x (k) ) min x X N f k (x) k=1 therefore, R N δ N, so def R N = N k=1 f k (x (k) ) min x X N f k (x) β ND 2 k=1 2 + N 1 k=0 G 2 2β k Stochastic and online optimization 6 19

20 choose parameters β k = γ k, k 1 and let β 0 = β 1, then N 1 k=0 G 2 = G2 2β k 2γ ( 1+ N 1 k=1 ) 1 k G2 2γ ( 2+ N 1 ) 1 dt t = G2 N γ finally, ) N R N (γ D2 2 + G2 γ upper bound is minimized by choosing γ = 2 G D which yields R N 2GD N Stochastic and online optimization 6 20

21 Minimizing finite average of convex functions problem minimize F(x) def = 1 n n f i (x) stochastic gradient method: pick i k {1,...,n} randomly and update i=1 x k+1 = x k η k f ik (x k ) two perspectives: stochastic optimization: a special case of minimizing E ξ f(x,ξ) deterministic optimization: a randomized incremental gradient method for a structured convex problem Stochastic and online optimization 6 21

22 Mind the problem structure stochastic optimization perspective: a general method used to solve a special case complexity theory: O( 1 ) or O( 1 ǫ 2 ǫ ) with strong convexity deterministic optimization perspective: a special method for solving a structured problem sanity test: should at least beat full gradient methods: complexity O(n L µ log 1 ǫ ) or O(n L µ log 1 ǫ ) recent progresses: SAG and SVRG by exploiting finite average structure Stochastic and online optimization 6 22

23 Stochastic average gradient (SAG) SAG method (Le Roux, Schmidt, Bach 2012) x k+1 = x k α k n n i=1 g (i) k where g (i) { k = fi (x k ) if i = i k g (i) otherwise k 1 a randomized variant of incremental aggregated gradient (IAG) of Blatt, Hero, & Gauchman (2007) complexity (# component gradient evaluations): O(max{n, L µ }log 1 ǫ ) cf. full gradient method: O(n L µ log 1 ǫ ), and stochastic gradient: O(1 ǫ ) need to store most recent gradient of each component, but can be avoided for some structured problems Stochastic and online optimization 6 23

24 Stochastic variance reduced gradient (SVRG) SVRG (Johnson & Zhang 2013, Mahdavi, Zhang & Jin 2013) x k+1 = x k η( f ik (x k ) f ik ( x)+ F( x)) and update x periodically (every few passes) still a stochastic gradient method E[ f ik (x k ) f ik ( x)+ F( x)] = F(x k ) F( x)+ F( x) = F(x k ) expected update direction is the same as Ef ik (x k ) variance can be diminishing if x updated periodically ( ) ( ) complexity: O (n+ L µ )log 1 ǫ, cf. SAG: O max{n, L µ }log 1 ǫ Stochastic and online optimization 6 24

25 Stochastic variance reduced gradient (SVRG) computational cost per iteration: unlike SAG, no need to store gradients for each components need to compute two gradients at each iteration, and also full gradient periodically (no more than three per epoch) for many structured problems, two gradients at each iteration can be reduced to only one intuition of variance reduction f ik ( x) f ik (x k ) x F( x) f ik ( x) x k F( x) f ik ( x) F( x) F(x k ) Stochastic and online optimization 6 25

26 Problem statement and assumptions assumptions: minimize x R d F(x) = 1 n n f i (x) i=1 each f i (x), for i = 1,...,n, is convex each f i (x) is smooth with Lipschitz constant L f i (x) f i (y) L x y (which implies that F(x) also has Lipschitz constant L) F(x) strongly convex: for all x,y R d, F(y) F(x)+ F(x) T (y x)+ µ y x 2 2 Stochastic and online optimization 6 26

27 SVRG method input: x 0, η, m iterate: for s = 1,2,... end x = x s 1 ṽ = F( x) x 0 = x iterate: for k = 1,2,...,m end pick i k {1,...,n} uniformly at random x k = x k 1 η ( f ik (x k 1 ) f ik ( x)+ṽ ) set x s = 1 m m k=1 x k 1 Stochastic and online optimization 6 27

28 Convergence analysis of SVRG theorem: suppose 0 < η 1/2L and m sufficiently large so that ρ = 1 µη(1 2Lη)m + 2Lη 1 2Lη < 1 then we have geometric convergence in expectation: EF( x s ) F(x ) ρ s [F( x 0 ) F(x )] more concretely, if η = θ/l, then ρ = L/µ θ(1 2θ)m + 2θ 1 2θ choosing θ = 0.1 and m = 50(L/µ) results in ρ = 1/2 overall complexity: O (( L µ +n ) log ( ) ) 1 ǫ Stochastic and online optimization 6 28

29 Proof let g k = f ik (x x 1 ) f ik ( x)+ F( x), then x k = x k 1 ηg k, and E ik [g k ] = F(x k 1 ) similar as in classical analysis of stochastic gradient methods E x k x 2 = E x k 1 ηg k x 2 = x k 1 x 2 2η(x k 1 x ) T E[g k ]+η 2 E[ g k 2 ] = x k 1 x 2 2η(x k 1 x ) T F(x k 1 )+η 2 E[ g k 2 ] x k 1 x 2 2η(F(x k 1 ) F(x ))+η 2 E[ g k 2 ] then need to bound E[ g k 2 ] carefully using the finite average structure Stochastic and online optimization 6 29

30 by smoothness of f i (x), fi (x) f i (x ) 2 2L [ f i (x) f i (x ) f i (x ) T (x x ) ] summing above inequalities over i = 1,...,n and using F(x ) = 0, 1 n n fi (x) f i (x ) 2 2L [ F(x) F(x )] i=1 E g k 2 = E fik (x k 1 ) f ik (x )+ f ik (x ) f ik ( x)+ F( x) 2 2E fik (x k 1 ) f ik (x ) 2 +2E fik ( x) f ik (x ) F( x) 2 = 2E fik (x k 1 ) f ik (x ) 2 +2E f ik ( x) f ik (x ) E[ f ik ( x) f ik (x )] 2 2E fik (x k 1 ) f ik (x ) 2 +2E fik ( x) f ik (x ) 2 4L [ F(x k 1 ) F(x )+F( x) F(x ) ] Stochastic and online optimization 6 30

31 continue derivation on page 6 29 E x k x 2 x k 1 x 2 2η(1 2Lη)[F(x k 1 ) F(x )]+4Lη 2[ F( x) F(x ) ] summing over k = 1,...,m, and take expectation w.r.t. whole history E x m x 2 +2η(1 2Lη) m 1 k=0 E x 0 x 2 +4Lmη 2 E[F(x 0 ) F(x )] E[F(x k ) F(x )] 2 µ E[F(x 0) F(x )]+4Lmη 2 E[F(x 0 ) F(x )] therefore, for each stage s E[F( x s ) F(x )] 1 m m 1 k=0 E[F(x k ) F(x )] 1 2η(1 2Lη)m ( ) 2 µ +4Lmη2 E[F(x 0 ) F(x )] Stochastic and online optimization 6 31

32 Numerical experiments binary classification: (a 1,b 1 ),...,(a n,b n ) with a i R d, b i {+1, 1} regularized logistic regression 1 minimize x R d n n i=1 log ( 1+exp( b i a T i x) ) + λ 2 2 x 2 2+λ 1 x 1 nonsmooth term x 1 handled by proximal gradient methods data sets and characteristics: data sets n d λ 2 λ 1 rcv1 20,242 47, covertype 581, sido0 12,678 4, Stochastic and online optimization 6 32

33 10 0 objective gap: P(xk) P η = 1.0/L η = 0.1/L η = 0.01/L number of effective passes SVRG on rcv1 dataset: varying step size η with m = 2n Stochastic and online optimization 6 33

34 objective gap: P(xk) P m = 1n m = 2n m = 4n m = 6n m = 8n number of effective passes SVRG on rcv1 dataset with λ 2 = 10 4 and stepsize η = 0.1/L: varying the period m between full gradient evaluations Stochastic and online optimization 6 34

35 objective gap: P(xk) P m = 1n m = 2n m = 4n m = 6n m = 8n number of effective passes SVRG on rcv1 dataset with λ 2 = 10 5 and stepsize η = 0.1/L: varying the period m between full gradient evaluations Stochastic and online optimization 6 35

36 10 0 objective gap: P(xk) P Prox-SG RDA Prox-FG Prox-AFG Prox-SAG Prox-SVRG Prox-SDCA number of effective passes comparison with related algorithms on rcv1 datasets Stochastic and online optimization 6 36

37 10 0 objective gap: P(xk) P Prox-FG Prox-AFG Prox-SG RDA Prox-SAG Prox-SDCA Prox-SVRG Prox-SVRG number of effective passes comparison with related algorithms on covertype datasets Stochastic and online optimization 6 37

38 objective gap: P(xk) P Prox-FG Prox-AFG Prox-SG RDA Prox-SAG Prox-SDCA Prox-SVRG Prox-SVRG number of effective passes comparison with related algorithms on sido0 datasets Stochastic and online optimization 6 38

39 References A. Nemirovski, A. Juditsky, G. Lan and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM Journal on Optimization (2009) Yu. Nesterov, Primal-dual subgradient methods for convex problems, Mathematical Programming (2009) L. Xiao, Dual averaging methods for regularized stochastic learning and online optimization, Journal of Machine Learning Research (2010) N. Le Roux, M. Schmidt and F. Bach, A stochastic gradient method with an exponential convergence rate for strongly convex optimization with finite training sets, NIPS (2012) R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction NIPS (2013) L. Xiao and T. Zhang, A proximal stochastic gradient method with progressive variance reduction, manuscript (2014) Stochastic and online optimization 6 39

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction