Stochastic and online algorithms

Size: px
Start display at page:

Download "Stochastic and online algorithms"

Transcription

1 Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1

2 Stochastic optimization problem minimize x X { } F(x) def = E ξ f(x,ξ) X R n is a (bounded) closed convex set ξ is a random vector whose distribution P is supported on set Ξ R d f : X Ξ R, and the expectation E ξ f(x,ξ) = Ξ f(x,ξ)dp(ξ) is well defined and has finite value for every x X F( ) continuous and convex on X, and optimal value F attained at x (e.g., F( ) is convex if f(,ξ) is convex for every ξ Ξ) Stochastic and online optimization 6 2

3 Sample average approximation minimize x X { ˆF N (x) def = 1 N N j=1 } f(x,ξ j ) assumption: {ξ j } N j=1 is a sequence of independent random outcomes reasonably efficient when solved by appropriate (deterministic) algorithm sample complexity: suppose f has bounded variation, and let V = max { f(x 1,ξ 1 ) f(x 2,ξ 2 ) : x 1,x 2 X, ξ 1,ξ 2 Ξ } then for any ǫ > 0 and ρ (0,1), sample size N = V 2 ln 2 2ǫ 2 ρ guarantees prob ( ˆF N (x) F(x) ǫ ) 1 ρ, x X (proved using Hoeding inequality) Stochastic and online optimization 6 3

4 Stochastic approximation choose x (1) X, and iterate for k = 1,2,... ) x (k+1) = π X (x (k) t k g(x (k),ξ k ) g(x,ξ) is a stochastic subgradient, i.e., g(x,ξ) x f(x,ξ) and F (x) def = E ξ g(x,ξ) F(x) assumption: there exist a constant G such that E ξ [ g(x,ξ) 2 2 ] G 2, x X π X ( ) denotes projection onto X: π X (x) = argmin y X y x 2 2 Stochastic and online optimization 6 4

5 Convergence analysis consider squared distance to x, and let r k = E [ ] x (k) x 2 2 x (k+1) x 2 2 = πx ( x (k) t k g(x (k),ξ k ) ) π X (x ) 2 2 x (k) t k g(x (k),ξ k ) x 2 2 = x (k) x 2 2t 2 k(x (k) x ) T g(x (k),ξ k )+t 2 k g(x (k),ξ k ) 2 2 since x (k) is a function of ξ [k 1] = (ξ 0,...,ξ k 1 ), it is independent of ξ k E [ (x (k) x ) T g(x (k),ξ k ) ] = E { E [ (x (k) x ) T g(x (k) ]},ξ k ) ξ [k 1] = E { (x (k) x ) T E [ g(x (k) ]},ξ k ) ξ [k 1] = E [ (x (k) x ) T F (x (k) ) ] therefore r k+1 r k 2t k E [ (x (k) x ) T F (x (k) ) ] +t 2 kg 2 (1) Stochastic and online optimization 6 5

6 by convexity of F, it holds F(x ) F(x (k) )+(x x (k) ) T F (x (k) ), hence E [ (x (k) x ) T F (x (k) ) ] E [ F(x (k) ) F ] combining with (1) gives t k E [ F(x (k) ) F ] 1 ( rk r k+1 +t 2 2 kg 2) summing over j = 1,...,k yields k t j E [ F(x (j) ) F ] = 1 ( k r 1 r k +G 2 2 j=1 j=1 t 2 k ) 1 2 ( r 1 +G 2 k j=1 t 2 k ) let ν (k) j = t j k i=1 t i and x (k) = k j=1 ν(k) j x (j) (note k j=1 ν(k) j = 1), then E [ F( x (k) ) F ] [ k E j=1 ν (k) j F(x (j) ) F ] r 1+G 2 k j=1 t2 k 2 k j=1 t j Stochastic and online optimization 6 6

7 Fixed step size suppose the number of iterations N is known in advance, then E [ F( x (k) ) F ] D2 +G 2 Nt 2 2Nt where D = max x X x x 2, so that r 1 = E x (1) x 2 2 D 2 minimizing upper bound over t > 0 gives t = D G N and E [ F( x (k) ) F ] DG N if t = θd G for some constant θ > 0, then N E [ F( x (k) ) F ] max{θ,θ 1 } DG N therefore, O(1/ N) convergence robust against step size choices Stochastic and online optimization 6 7

8 Diminishing step size following the halving trick in deterministic subgradient method, redefine t j x (j) if the step sizes are chosen as x (k) = k/2 j k k/2 j k t k = θd G k then the following holds with a small constant C > 1 E [ F( x (k) ) F ] Cmax{θ,θ 1 } DG k t j O(1/ k) convergence rate is optimal for general convex functions Stochastic and online optimization 6 8

9 Analysis for strongly convex functions assume F = E ξ f(x,ξ) is differentiable and strongly convex F(y) F(x)+ F(x) T (y x)+ µ 2 y x 2 2 x,y X or equivalently (x y) T ( F(x) F(y)) µ x y 2 2, x,y X by optimality of x, (x x ) T F(x ) 0, x X therefore (x x ) T F(x) µ x x 2 2, x X (2) Stochastic and online optimization 6 9

10 combining (1) and (2) gives r k+1 (1 2µt k )r k +t 2 kg 2 let s take step size t k = θ/k for some constant θ > 1/(2µ), then r k+1 (1 2µθ/k)r k +θ 2 G 2 /k 2 it follows by induction that (Nemirovski et al. 2009) E [ x (k) x 2 ] 2 = rk Q(θ) k where Q(θ) = max{θ 2 G 2 (2µθ 1) 1, x (1) x 2 2} if in addition F is Lipschitz continuous with constant L > 0, then E [ F(x (k) ) F ] L 2 E[ x (k) x 2 ] LQ(θ) 2 2k Stochastic and online optimization 6 10

11 Sensitivity to priori knowledge of µ example: let F(x) = x 2 /10, X = [ 1,1], µ = 0.2, and there is no noise if θ = 1 (which violates the condition θ > 1/(2µ)), then x (k+1) = x (k) 1 k F (x (k) ) = starting with x (1) = 1 leads to x (k) > 0.8k 1/5 error is larger than even after 10 9 iterations! ( 1 1 ) x (k) 5k if θ = 1/µ = 5, then x = 0 is obtained in one iteration step size t k = θ/k too small if F is not strongly convex Stochastic and online optimization 6 11

12 Online convex optimization explained as online game: for k = 1,2,3,..., player chooses x (k) X based on previous information adversary reveals cost function f k, and player encounters loss f k (x (k) ) assumptions: f k convex; X bounded, closed and convex player wants to minimize regret: R N N k=1 online subgradient method ( fk (x (k) ) ) min x X { N } f k (x) k=1 x (k+1) = π X ( x (k) t k g (k)), g (k) f k (x (k) ) with appropriate step size, can show R N O( N) Stochastic and online optimization 6 12

13 Connection to stochastic approximation a more general framework without stochastic assumptions suppose f k (x) def = f(x,ξ k ), and let x (N) = 1 N N k=1 x(k), then proof: E [ F( x (N) ) ] F 1 N E [ F( x (N) ) ] F 1 N E[R N] = 1 N N (E [ F(x (k) ) ] F ) k=1 N k=1 = 1 N E [ N ( E [ f(x (k),ξ k ) ] min x E [ f(x,ξ k ) ]) k=1 ( ) ] f(x (k),ξ k ) f(x,ξ k ) Stochastic and online optimization 6 13

14 Dual averaging method (Nesterov) initialize: choose x (1) R n and set s (0) = 0 iterate for k = 0,1,2, compute g (k) f k (x (k) ) and set 2. update: x (k+1) = argmin x X s (k) = s (k 1) +g (k) { s (k),x + β } k 2 x x(0) 2 2 = π X ( x (0) 1 β k s (k) ) choice of {β k }: e.g., β k = γ k with γ > 0 can also work with composite objectives: minimize x f(x)+ψ(x) Stochastic and online optimization 6 14

15 A soft support function for any β 0 and any x (0) X, define V β (s) = max x X { s,x x (0) β } 2 x x(0) 2 2 V β (s) 0 for any β 0; if β 2 β 1 > 0, then V β2 (s) V β1 (s) V β ( ) is convex and differentiable V β is Lipschitz continuous with constant 1/β V β (s 1 ) V β (s 2 ) 2 1 β s 1 s 2 2, s 1,s 2 R n therefore V β (s+δ) V β (s)+ δ, V β (s) + 1 2β δ 2 2 Stochastic and online optimization 6 15

16 lemma: let D = max x X x x (0) 2, then proof: max x X s,x x(0) βd2 2 +V β(s) { max x X s,x x(0) = max s,x x (0) : x X = max x X min β 0 min max β 0 x X max x X βd2 2 +V β(s) 1 2 x x(0) } 2 D2 { s,x x (0) + β 2 (D2 x x (0) 2 2) { s,x x (0) + β } 2 (D2 x x (0) 2 2) { s,x x (0) + β 2 (D2 x x (0) 2 2) } } Stochastic and online optimization 6 16

17 Convergence analysis V βk ( s (k) ) V βk 1 ( s (k) ) therefore V βk 1 ( s (k 1) )+ g (k), V βk 1 ( s (k 1) ) + 1 2β k 1 g (k) 2 2 = V βk 1 ( s (k 1) ) g (k),x (k) x (0) + 1 2β k 1 g (k) 2 2 g (k),x (k) x (0) V βk 1 ( s (k 1) ) V βk ( s (k) )+ 1 2β k 1 g (k) 2 2 summing over k = 2,...,N and choose x (0) = x (1) results in N g (k),x (k) x (0) V β1 ( s (1) ) V βn ( s (N) )+ k=1 N k=2 1 2β k 1 g (k) 2 2 Stochastic and online optimization 6 17

18 δ N def = max x X = = N g (k),x (k) x k=1 N g (k),x (k) x (0) +max k=1 x X N g (k),x (0) x k=1 N g (k),x (k) x (0) +max x X s(n),x x (0) k=1 V β1 ( s (1) ) V βn ( s (N) )+ 1 2β 1 g (1) 2 2+ β ND N 1 k=0 N k=2 N k=2 1 g (k) 2 2β 2+ β ND 2 k g (k) 2 2β 2+ β ND 2 k 1 2 G 2 2β k (for convenience, define β 0 = β 1 ) +V βn ( s (N) ) Stochastic and online optimization 6 18

19 by convexity, δ N def = max x X max x X = N k=1 k g (i),x (i) x i=1 N ( fk (x (k) ) f k (x) ) k=1 f k (x (k) ) min x X N f k (x) k=1 therefore, R N δ N, so def R N = N k=1 f k (x (k) ) min x X N f k (x) β ND 2 k=1 2 + N 1 k=0 G 2 2β k Stochastic and online optimization 6 19

20 choose parameters β k = γ k, k 1 and let β 0 = β 1, then N 1 k=0 G 2 = G2 2β k 2γ ( 1+ N 1 k=1 ) 1 k G2 2γ ( 2+ N 1 ) 1 dt t = G2 N γ finally, ) N R N (γ D2 2 + G2 γ upper bound is minimized by choosing γ = 2 G D which yields R N 2GD N Stochastic and online optimization 6 20

21 Minimizing finite average of convex functions problem minimize F(x) def = 1 n n f i (x) stochastic gradient method: pick i k {1,...,n} randomly and update i=1 x k+1 = x k η k f ik (x k ) two perspectives: stochastic optimization: a special case of minimizing E ξ f(x,ξ) deterministic optimization: a randomized incremental gradient method for a structured convex problem Stochastic and online optimization 6 21

22 Mind the problem structure stochastic optimization perspective: a general method used to solve a special case complexity theory: O( 1 ) or O( 1 ǫ 2 ǫ ) with strong convexity deterministic optimization perspective: a special method for solving a structured problem sanity test: should at least beat full gradient methods: complexity O(n L µ log 1 ǫ ) or O(n L µ log 1 ǫ ) recent progresses: SAG and SVRG by exploiting finite average structure Stochastic and online optimization 6 22

23 Stochastic average gradient (SAG) SAG method (Le Roux, Schmidt, Bach 2012) x k+1 = x k α k n n i=1 g (i) k where g (i) { k = fi (x k ) if i = i k g (i) otherwise k 1 a randomized variant of incremental aggregated gradient (IAG) of Blatt, Hero, & Gauchman (2007) complexity (# component gradient evaluations): O(max{n, L µ }log 1 ǫ ) cf. full gradient method: O(n L µ log 1 ǫ ), and stochastic gradient: O(1 ǫ ) need to store most recent gradient of each component, but can be avoided for some structured problems Stochastic and online optimization 6 23

24 Stochastic variance reduced gradient (SVRG) SVRG (Johnson & Zhang 2013, Mahdavi, Zhang & Jin 2013) x k+1 = x k η( f ik (x k ) f ik ( x)+ F( x)) and update x periodically (every few passes) still a stochastic gradient method E[ f ik (x k ) f ik ( x)+ F( x)] = F(x k ) F( x)+ F( x) = F(x k ) expected update direction is the same as Ef ik (x k ) variance can be diminishing if x updated periodically ( ) ( ) complexity: O (n+ L µ )log 1 ǫ, cf. SAG: O max{n, L µ }log 1 ǫ Stochastic and online optimization 6 24

25 Stochastic variance reduced gradient (SVRG) computational cost per iteration: unlike SAG, no need to store gradients for each components need to compute two gradients at each iteration, and also full gradient periodically (no more than three per epoch) for many structured problems, two gradients at each iteration can be reduced to only one intuition of variance reduction f ik ( x) f ik (x k ) x F( x) f ik ( x) x k F( x) f ik ( x) F( x) F(x k ) Stochastic and online optimization 6 25

26 Problem statement and assumptions assumptions: minimize x R d F(x) = 1 n n f i (x) i=1 each f i (x), for i = 1,...,n, is convex each f i (x) is smooth with Lipschitz constant L f i (x) f i (y) L x y (which implies that F(x) also has Lipschitz constant L) F(x) strongly convex: for all x,y R d, F(y) F(x)+ F(x) T (y x)+ µ y x 2 2 Stochastic and online optimization 6 26

27 SVRG method input: x 0, η, m iterate: for s = 1,2,... end x = x s 1 ṽ = F( x) x 0 = x iterate: for k = 1,2,...,m end pick i k {1,...,n} uniformly at random x k = x k 1 η ( f ik (x k 1 ) f ik ( x)+ṽ ) set x s = 1 m m k=1 x k 1 Stochastic and online optimization 6 27

28 Convergence analysis of SVRG theorem: suppose 0 < η 1/2L and m sufficiently large so that ρ = 1 µη(1 2Lη)m + 2Lη 1 2Lη < 1 then we have geometric convergence in expectation: EF( x s ) F(x ) ρ s [F( x 0 ) F(x )] more concretely, if η = θ/l, then ρ = L/µ θ(1 2θ)m + 2θ 1 2θ choosing θ = 0.1 and m = 50(L/µ) results in ρ = 1/2 overall complexity: O (( L µ +n ) log ( ) ) 1 ǫ Stochastic and online optimization 6 28

29 Proof let g k = f ik (x x 1 ) f ik ( x)+ F( x), then x k = x k 1 ηg k, and E ik [g k ] = F(x k 1 ) similar as in classical analysis of stochastic gradient methods E x k x 2 = E x k 1 ηg k x 2 = x k 1 x 2 2η(x k 1 x ) T E[g k ]+η 2 E[ g k 2 ] = x k 1 x 2 2η(x k 1 x ) T F(x k 1 )+η 2 E[ g k 2 ] x k 1 x 2 2η(F(x k 1 ) F(x ))+η 2 E[ g k 2 ] then need to bound E[ g k 2 ] carefully using the finite average structure Stochastic and online optimization 6 29

30 by smoothness of f i (x), fi (x) f i (x ) 2 2L [ f i (x) f i (x ) f i (x ) T (x x ) ] summing above inequalities over i = 1,...,n and using F(x ) = 0, 1 n n fi (x) f i (x ) 2 2L [ F(x) F(x )] i=1 E g k 2 = E fik (x k 1 ) f ik (x )+ f ik (x ) f ik ( x)+ F( x) 2 2E fik (x k 1 ) f ik (x ) 2 +2E fik ( x) f ik (x ) F( x) 2 = 2E fik (x k 1 ) f ik (x ) 2 +2E f ik ( x) f ik (x ) E[ f ik ( x) f ik (x )] 2 2E fik (x k 1 ) f ik (x ) 2 +2E fik ( x) f ik (x ) 2 4L [ F(x k 1 ) F(x )+F( x) F(x ) ] Stochastic and online optimization 6 30

31 continue derivation on page 6 29 E x k x 2 x k 1 x 2 2η(1 2Lη)[F(x k 1 ) F(x )]+4Lη 2[ F( x) F(x ) ] summing over k = 1,...,m, and take expectation w.r.t. whole history E x m x 2 +2η(1 2Lη) m 1 k=0 E x 0 x 2 +4Lmη 2 E[F(x 0 ) F(x )] E[F(x k ) F(x )] 2 µ E[F(x 0) F(x )]+4Lmη 2 E[F(x 0 ) F(x )] therefore, for each stage s E[F( x s ) F(x )] 1 m m 1 k=0 E[F(x k ) F(x )] 1 2η(1 2Lη)m ( ) 2 µ +4Lmη2 E[F(x 0 ) F(x )] Stochastic and online optimization 6 31

32 Numerical experiments binary classification: (a 1,b 1 ),...,(a n,b n ) with a i R d, b i {+1, 1} regularized logistic regression 1 minimize x R d n n i=1 log ( 1+exp( b i a T i x) ) + λ 2 2 x 2 2+λ 1 x 1 nonsmooth term x 1 handled by proximal gradient methods data sets and characteristics: data sets n d λ 2 λ 1 rcv1 20,242 47, covertype 581, sido0 12,678 4, Stochastic and online optimization 6 32

33 10 0 objective gap: P(xk) P η = 1.0/L η = 0.1/L η = 0.01/L number of effective passes SVRG on rcv1 dataset: varying step size η with m = 2n Stochastic and online optimization 6 33

34 objective gap: P(xk) P m = 1n m = 2n m = 4n m = 6n m = 8n number of effective passes SVRG on rcv1 dataset with λ 2 = 10 4 and stepsize η = 0.1/L: varying the period m between full gradient evaluations Stochastic and online optimization 6 34

35 objective gap: P(xk) P m = 1n m = 2n m = 4n m = 6n m = 8n number of effective passes SVRG on rcv1 dataset with λ 2 = 10 5 and stepsize η = 0.1/L: varying the period m between full gradient evaluations Stochastic and online optimization 6 35

36 10 0 objective gap: P(xk) P Prox-SG RDA Prox-FG Prox-AFG Prox-SAG Prox-SVRG Prox-SDCA number of effective passes comparison with related algorithms on rcv1 datasets Stochastic and online optimization 6 36

37 10 0 objective gap: P(xk) P Prox-FG Prox-AFG Prox-SG RDA Prox-SAG Prox-SDCA Prox-SVRG Prox-SVRG number of effective passes comparison with related algorithms on covertype datasets Stochastic and online optimization 6 37

38 objective gap: P(xk) P Prox-FG Prox-AFG Prox-SG RDA Prox-SAG Prox-SDCA Prox-SVRG Prox-SVRG number of effective passes comparison with related algorithms on sido0 datasets Stochastic and online optimization 6 38

39 References A. Nemirovski, A. Juditsky, G. Lan and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM Journal on Optimization (2009) Yu. Nesterov, Primal-dual subgradient methods for convex problems, Mathematical Programming (2009) L. Xiao, Dual averaging methods for regularized stochastic learning and online optimization, Journal of Machine Learning Research (2010) N. Le Roux, M. Schmidt and F. Bach, A stochastic gradient method with an exponential convergence rate for strongly convex optimization with finite training sets, NIPS (2012) R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction NIPS (2013) L. Xiao and T. Zhang, A proximal stochastic gradient method with progressive variance reduction, manuscript (2014) Stochastic and online optimization 6 39

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x),

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x), SIAM J. OPTIM. Vol. 24, No. 4, pp. 2057 2075 c 204 Society for Industrial and Applied Mathematics A PROXIMAL STOCHASTIC GRADIENT METHOD WITH PROGRESSIVE VARIANCE REDUCTION LIN XIAO AND TONG ZHANG Abstract.

More information

A Proximal Stochastic Gradient Method with Progressive Variance Reduction

A Proximal Stochastic Gradient Method with Progressive Variance Reduction A Proximal Stochastic Gradient Method with Progressive Variance Reduction arxiv:403.4699v [math.oc] 9 Mar 204 Lin Xiao Tong Zhang March 8, 204 Abstract We consider the problem of minimizing the sum of

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Randomized Smoothing for Stochastic Optimization

Randomized Smoothing for Stochastic Optimization Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

A Universal Catalyst for Gradient-Based Optimization

A Universal Catalyst for Gradient-Based Optimization A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58 Collaborators Hongzhou Lin Zaid Harchaoui Publication

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Randomized Smoothing Techniques in Optimization

Randomized Smoothing Techniques in Optimization Randomized Smoothing Techniques in Optimization John Duchi Based on joint work with Peter Bartlett, Michael Jordan, Martin Wainwright, Andre Wibisono Stanford University Information Systems Laboratory

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology February 2014

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

arxiv: v2 [stat.ml] 16 Jun 2015

arxiv: v2 [stat.ml] 16 Jun 2015 Semi-Stochastic Gradient Descent Methods Jakub Konečný Peter Richtárik arxiv:1312.1666v2 [stat.ml] 16 Jun 2015 School of Mathematics University of Edinburgh United Kingdom June 15, 2015 (first version:

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Minimizing finite sums with the stochastic average gradient

Minimizing finite sums with the stochastic average gradient Math. Program., Ser. A (2017) 162:83 112 DOI 10.1007/s10107-016-1030-6 FULL LENGTH PAPER Minimizing finite sums with the stochastic average gradient Mark Schmidt 1 Nicolas Le Roux 2 Francis Bach 3 Received:

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Adaptive Restarting for First Order Optimization Methods

Adaptive Restarting for First Order Optimization Methods Adaptive Restarting for First Order Optimization Methods Nesterov method for smooth convex optimization adpative restarting schemes step-size insensitivity extension to non-smooth optimization continuation

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Large-scale Machine Learning and Optimization

Large-scale Machine Learning and Optimization 1/84 Large-scale Machine Learning and Optimization Zaiwen Wen http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Dimitris Papailiopoulos and Shiqian Ma lecture notes

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

How hard is this function to optimize?

How hard is this function to optimize? How hard is this function to optimize? John Duchi Based on joint work with Sabyasachi Chatterjee, John Lafferty, Yuancheng Zhu Stanford University West Coast Optimization Rumble October 2016 Problem minimize

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Non-Smooth, Non-Finite, and Non-Convex Optimization

Non-Smooth, Non-Finite, and Non-Convex Optimization Non-Smooth, Non-Finite, and Non-Convex Optimization Deep Learning Summer School Mark Schmidt University of British Columbia August 2015 Complex-Step Derivative Using complex number to compute directional

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Tomoya Murata Taiji Suzuki More recently, several authors have proposed accelerarxiv:70300439v4

More information

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

MACHINE learning and optimization theory have enjoyed a fruitful symbiosis over the last decade. On the one hand,

MACHINE learning and optimization theory have enjoyed a fruitful symbiosis over the last decade. On the one hand, 1 Analysis and Implementation of an Asynchronous Optimization Algorithm for the Parameter Server Arda Aytein Student Member, IEEE, Hamid Reza Feyzmahdavian, Student Member, IEEE, and Miael Johansson Member,

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

arxiv: v2 [cs.lg] 4 Oct 2016

arxiv: v2 [cs.lg] 4 Oct 2016 Appearing in 2016 IEEE International Conference on Data Mining (ICDM), Barcelona. Efficient Distributed SGD with Variance Reduction Soham De and Tom Goldstein Department of Computer Science, University

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity

Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity Noname manuscript No. (will be inserted by the editor) Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity Jason D. Lee Qihang Lin Tengyu Ma Tianbao

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information