Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Size: px
Start display at page:

Download "Minimizing Finite Sums with the Stochastic Average Gradient Algorithm"

Transcription

1 Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia

2 Context: Machine Learning for Big Data Large-scale machine learning: large N, large P N: number of observations (inputs) P: dimension of each observation Regularized empirical risk minimization: find x solution of 1 min x R P N N l(x T a i ) + λr(x) i=1 data fitting term + regularizer

3 Context: Machine Learning for Big Data Large-scale machine learning: large N, large P N: number of observations (inputs) P: dimension of each observation Regularized empirical risk minimization: find x solution of 1 min x R P N N l(x T a i ) + λr(x) i=1 data fitting term + regularizer Applications to any data-oriented field: Vision, bioinformatics, speech, natural language, web. Main practical challenges: Choosing regularizer r and data-fitting term l. Designing/learning good features a i. Efficiently solving the problem when N or P are very large.

4 This talk: Big-N Problems We want to minimize the sum of a finite set of smooth functions: min x R P g(x) := 1 N N f i (x). i=1

5 This talk: Big-N Problems We want to minimize the sum of a finite set of smooth functions: min x R P g(x) := 1 N N f i (x). i=1 We are interested in cases where N is very large.

6 This talk: Big-N Problems We want to minimize the sum of a finite set of smooth functions: min x R P g(x) := 1 N N f i (x). i=1 We are interested in cases where N is very large. We will focus on strongly-convex functions g. Simplest example is l 2 -regularized least-squares, f i (x) := (a T i x b i ) 2 + λ 2 x 2. Other examples include any l 2 -regularized convex loss: logistic regression, Huber regression, smooth SVMs, CRFs, etc.

7 Stochastic vs. Deterministic Gradient Methods We consider minimizing g(x) = 1 N N i=1 f i(x).

8 Stochastic vs. Deterministic Gradient Methods We consider minimizing g(x) = 1 N N i=1 f i(x). Deterministic gradient method [Cauchy, 1847]: x t+1 = x t α t g (x t ) = x t α t N Linear convergence rate: O(ρ t ). Iteration cost is linear in N. Fancier methods exist, but still in O(N) N i=1 f i (x t).

9 Stochastic vs. Deterministic Gradient Methods We consider minimizing g(x) = 1 N N i=1 f i(x). Deterministic gradient method [Cauchy, 1847]: x t+1 = x t α t g (x t ) = x t α t N Linear convergence rate: O(ρ t ). Iteration cost is linear in N. Fancier methods exist, but still in O(N) N i=1 f i (x t). Stochastic gradient method [Robbins & Monro, 1951]: Random selection of i(t) from {1, 2,..., N}, x t+1 = x t α tf i(t)(x t). Iteration cost is independent of N. Sublinear convergence rate: O(1/t).

10 nimizing g(θ) = nimizing Stochastic g(θ) vs. = 1 n f i (θ) with f i (θ) =l Deterministic f i (θ) withgradient f i (θ) =l Methods ( y i, θ Φ(x i ) y i, θ Φ(x i ) ) + µ n i=1 + µ n i=1 tch gradient descent: θ t = θ t 1 γ We consider minimizing g(x) = 1 tn g (θ t 1 )=θ t 1 γ n t tch gradient descent: θ N i=1 f i(x). t = θ t 1 γ t g (θ t 1 )=θ t 1 γ n f i nt i=1f i n Deterministic gradient method [Cauchy, 1847]: i=1 chastic gradient descent: θ t = θ t 1 γ t f Stochastic gradient method [Robbins & Monro, i(t) (θ t 1) 1951]: chastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1)

11 Motivation for New Methods FG method Stochastic has O(N) vs. cost deterministic with O(ρ t ) rate. methods SG Goal method = best hasof O(1) both cost worlds: with linearratewitho(1) O(1/t) rate. iteration cost log(excess cost) stochastic deterministic time

12 Motivation for New Methods FG method Stochastic has O(N) vs. cost deterministic with O(ρ t ) rate. methods SG Goal method = best hasof O(1) both cost worlds: with linearratewitho(1) O(1/t) rate. iteration cost log(excess cost) hybrid time stochastic deterministic

13 Motivation for New Methods FG method Stochastic has O(N) vs. cost deterministic with O(ρ t ) rate. methods SG Goal method = best hasof O(1) both cost worlds: with linearratewitho(1) O(1/t) rate. iteration cost log(excess cost) hybrid time stochastic deterministic Goal is O(1) cost with O(ρ k ) rate.

14 Prior Work on Speeding up SG Methods A variety of methods have been proposed to speed up SG methods: Step-size strategies, momentum, gradient/iterate averaging Polyak & Juditsky (1992), Tseng (1998), Kushner & Yin (2003) Nesterov (2009), Xiao (2010), Hazan & Kale (2011), Rakhlin et al. (2012) Stochastic version of accelerated and Newton-like methods Bordes et al. (2009), Sunehag et al. (2009), Ghadimi and Lan (2010), Martens (2010), Xiao (2010), Duchi et al. (2011)

15 Prior Work on Speeding up SG Methods A variety of methods have been proposed to speed up SG methods: Step-size strategies, momentum, gradient/iterate averaging Polyak & Juditsky (1992), Tseng (1998), Kushner & Yin (2003) Nesterov (2009), Xiao (2010), Hazan & Kale (2011), Rakhlin et al. (2012) Stochastic version of accelerated and Newton-like methods Bordes et al. (2009), Sunehag et al. (2009), Ghadimi and Lan (2010), Martens (2010), Xiao (2010), Duchi et al. (2011) None of these methods improve on the O(1/t) rate

16 Prior Work on Speeding up SG Methods Existing linear convergence results: Constant step-size SG, accelerated SG Kesten (1958), Delyon and Juditsky (1993), Nedic and Bertsekas (2000) Linear convergence up to a fixed tolerance: O(ρ t ) + O(α). Hybrid methods, incremental average gradient Bertsekas (1997), Blatt et al. (2007), Friedlander and Schmidt (2012) Linear rate but iterations make full passes through the data

17 Prior Work on Speeding up SG Methods Existing linear convergence results: Constant step-size SG, accelerated SG Kesten (1958), Delyon and Juditsky (1993), Nedic and Bertsekas (2000) Linear convergence up to a fixed tolerance: O(ρ t ) + O(α). Hybrid methods, incremental average gradient Bertsekas (1997), Blatt et al. (2007), Friedlander and Schmidt (2012) Linear rate but iterations make full passes through the data Special Problems Classes Collins et al. (2008), Strohmer & Vershynin (2009), Schmidt and Le Roux (2012), Shalev-Shwartz and Zhang (2012) Linear rate but limited choice for the f i s

18 Stochastic Average Gradient Assume only that: f i is convex, f i is L continuous, g is µ-strongly convex.

19 Stochastic Average Gradient Assume only that: f i is convex, f i is L continuous, g is µ-strongly convex. Is it possible to have an O(ρ t ) rate with an O(1) cost?

20 Stochastic Average Gradient Assume only that: f i is convex, f i is L continuous, g is µ-strongly convex. Is it possible to have an O(ρ t ) rate with an O(1) cost? YES!

21 Stochastic Average Gradient Assume only that: f i is convex, f i is L continuous, g is µ-strongly convex. Is it possible to have an O(ρ t ) rate with an O(1) cost? YES! The stochastic average gradient (SAG) algorithm: Randomly select i(t) from {1, 2,..., N} and compute f i(t) (x t ). x t+1 = x t αt N N i=1 f i (x t )

22 Stochastic Average Gradient Assume only that: f i is convex, f i is L continuous, g is µ-strongly convex. Is it possible to have an O(ρ t ) rate with an O(1) cost? YES! The stochastic average gradient (SAG) algorithm: Randomly select i(t) from {1, 2,..., N} and compute f i(t) (x t ). x t+1 = x t αt N N i=1 y t i

23 Stochastic Average Gradient Assume only that: f i is convex, f i is L continuous, g is µ-strongly convex. Is it possible to have an O(ρ t ) rate with an O(1) cost? YES! The stochastic average gradient (SAG) algorithm: Randomly select i(t) from {1, 2,..., N} and compute f i(t) (x t ). x t+1 = x t αt N Memory: y t i = f i (x t ) from the last iteration t where i was selected. N i=1 y t i

24 Stochastic Average Gradient Assume only that: f i is convex, f i is L continuous, g is µ-strongly convex. Is it possible to have an O(ρ t ) rate with an O(1) cost? YES! The stochastic average gradient (SAG) algorithm: Randomly select i(t) from {1, 2,..., N} and compute f i(t) (x t ). x t+1 = x t αt N Memory: y t i = f i (x t ) from the last iteration t where i was selected. Assumes that gradients of other examples don t change. This assumption becomes accurate as x t+1 x t 0. Stochastic variant of increment aggregated gradient (IAG). [Blatt et al. 2007] N i=1 y t i

25 Convergence Rate of SAG: Attempt 1 Proposition 1. With α t = 1 2NL the SAG iterations satisfy ( E[g(x t ) g(x )] 1 µ ) t C. 8LN

26 Convergence Rate of SAG: Attempt 1 Proposition 1. With α t = 1 2NL the SAG iterations satisfy ( E[g(x t ) g(x )] 1 µ ) t C. 8LN Convergence rate of O(ρ t ) with cost of O(1). Is this useful?!?

27 Convergence Rate of SAG: Attempt 1 Proposition 1. With α t = 1 2NL the SAG iterations satisfy ( E[g(x t ) g(x )] 1 µ ) t C. 8LN Convergence rate of O(ρ t ) with cost of O(1). Is this useful?!? This rate is very slow: performance similar to cyclic method.

28 Convergence Rate of SAG: Attempt 2 1 Proposition 2. With α t [ 2Nµ, 1 16L ] and N 8 L µ, the SAG iterations satisfy ( E[g(x t ) g(x )] 1 1 ) t C. 8N

29 Convergence Rate of SAG: Attempt 2 1 Proposition 2. With α t [ 2Nµ, 1 16L ] and N 8 L µ, the SAG iterations satisfy ( E[g(x t ) g(x )] 1 1 ) t C. 8N Much bigger step-sizes: µ << L and L << NL (causes cyclic algorithm to diverge)

30 Convergence Rate of SAG: Attempt 2 1 Proposition 2. With α t [ 2Nµ, 1 16L ] and N 8 L µ, the SAG iterations satisfy ( E[g(x t ) g(x )] 1 1 ) t C. 8N Much bigger step-sizes: µ << L and L << NL (causes cyclic algorithm to diverge) Gives constant non-trivial reduction per pass: ( 1 1 ) N ( exp 1 ) = N 8 N O( L µ ) has been called big data condition.

31 Convergence Rate of SAG: Attempt 3 Theorem. With α t = 1 16L the SAG iterations satisfy ( { }) t µ E[g(x t ) g(x )] 1 min 16L, 1 C. 8N

32 Convergence Rate of SAG: Attempt 3 Theorem. With α t = 1 16L the SAG iterations satisfy ( { }) t µ E[g(x t ) g(x )] 1 min 16L, 1 C. 8N This rate is very fast : Well-conditioned problems: constant non-trivial reduction per pass.

33 Convergence Rate of SAG: Attempt 3 Theorem. With α t = 1 16L the SAG iterations satisfy ( { }) t µ E[g(x t ) g(x )] 1 min 16L, 1 C. 8N This rate is very fast : Well-conditioned problems: constant non-trivial reduction per pass. Badly-conditioned problems: almost same as deterministic method: g(x t ) g(x ) ( 1 µ L ) 2t C, with α t = 1, but SAG is N times faster. L

34 Convergence Rate of SAG: Attempt 3 Theorem. With α t = 1 16L the SAG iterations satisfy ( { }) t µ E[g(x t ) g(x )] 1 min 16L, 1 C. 8N This rate is very fast : Well-conditioned problems: constant non-trivial reduction per pass. Badly-conditioned problems: almost same as deterministic method: g(x t ) g(x ) ( 1 µ L ) 2t C, with α t = 1, but SAG is N times faster. L Still get linear rate for any α t 1 16L.

35 Rate of Convergence Comparison Assume that N = , L = 0.25, µ = 1/N:

36 Rate of Convergence Comparison Assume that N = , L = 0.25, µ = 1/N: ( ) 2 Gradient method has rate L µ L+µ =

37 Rate of Convergence Comparison Assume that N = , L = 0.25, µ = 1/N: ( ) 2 Gradient method has rate L µ L+µ = Accelerated gradient method has rate ( 1 ) µ L =

38 Rate of Convergence Comparison Assume that N = , L = 0.25, µ = 1/N: ( ) 2 Gradient method has rate L µ L+µ = Accelerated gradient method has rate ( 1 ) µ L = SAG (N iterations) has rate ( 1 min { µ, }) 1 N 16L 8N =

39 Rate of Convergence Comparison Assume that N = , L = 0.25, µ = 1/N: ( ) 2 Gradient method has rate L µ L+µ = Accelerated gradient method has rate ( 1 ) µ L = SAG (N iterations) has rate ( 1 min { µ, }) 1 N 16L 8N = Fastest possible first-order method: ( L µ L+ µ ) 2 =

40 Rate of Convergence Comparison Assume that N = , L = 0.25, µ = 1/N: ( ) 2 Gradient method has rate L µ L+µ = Accelerated gradient method has rate ( 1 ) µ L = SAG (N iterations) has rate ( 1 min { µ, }) 1 N 16L 8N = Fastest possible first-order method: ( L µ L+ µ ) 2 = SAG beats two lower bounds: Stochastic gradient bound (of O(1/t)). Deterministic gradient bound (for typical L, µ, and N).

41 Rate of Convergence Comparison Assume that N = , L = 0.25, µ = 1/N: ( ) 2 Gradient method has rate L µ L+µ = Accelerated gradient method has rate ( 1 ) µ L = SAG (N iterations) has rate ( 1 min { µ, }) 1 N 16L 8N = Fastest possible first-order method: ( L µ L+ µ ) 2 = SAG beats two lower bounds: Stochastic gradient bound (of O(1/t)). Deterministic gradient bound (for typical L, µ, and N). Number of f i evaluations to reach ɛ:

42 Rate of Convergence Comparison Assume that N = , L = 0.25, µ = 1/N: ( ) 2 Gradient method has rate L µ L+µ = Accelerated gradient method has rate ( 1 ) µ L = SAG (N iterations) has rate ( 1 min { µ, }) 1 N 16L 8N = Fastest possible first-order method: ( L µ L+ µ ) 2 = SAG beats two lower bounds: Stochastic gradient bound (of O(1/t)). Deterministic gradient bound (for typical L, µ, and N). Number of f i evaluations to reach ɛ: Stochastic: O( L µ (1/ɛ)).

43 Rate of Convergence Comparison Assume that N = , L = 0.25, µ = 1/N: ( ) 2 Gradient method has rate L µ L+µ = Accelerated gradient method has rate ( 1 ) µ L = SAG (N iterations) has rate ( 1 min { µ, }) 1 N 16L 8N = Fastest possible first-order method: ( L µ L+ µ ) 2 = SAG beats two lower bounds: Stochastic gradient bound (of O(1/t)). Deterministic gradient bound (for typical L, µ, and N). Number of f i evaluations to reach ɛ: Stochastic: O( L µ (1/ɛ)). Gradient: O(N L µ log(1/ɛ)).

44 Rate of Convergence Comparison Assume that N = , L = 0.25, µ = 1/N: ( ) 2 Gradient method has rate L µ L+µ = Accelerated gradient method has rate ( 1 ) µ L = SAG (N iterations) has rate ( 1 min { µ, }) 1 N 16L 8N = Fastest possible first-order method: ( L µ L+ µ ) 2 = SAG beats two lower bounds: Stochastic gradient bound (of O(1/t)). Deterministic gradient bound (for typical L, µ, and N). Number of f i evaluations to reach ɛ: Stochastic: O( L µ (1/ɛ)). Gradient: O(N L log(1/ɛ)). µ L Accelerated: O(N log(1/ɛ)). µ

45 Rate of Convergence Comparison Assume that N = , L = 0.25, µ = 1/N: ( ) 2 Gradient method has rate L µ L+µ = Accelerated gradient method has rate ( 1 ) µ L = SAG (N iterations) has rate ( 1 min { µ, }) 1 N 16L 8N = Fastest possible first-order method: ( L µ L+ µ ) 2 = SAG beats two lower bounds: Stochastic gradient bound (of O(1/t)). Deterministic gradient bound (for typical L, µ, and N). Number of f i evaluations to reach ɛ: Stochastic: O( L µ (1/ɛ)). Gradient: O(N L log(1/ɛ)). µ L Accelerated: O(N log(1/ɛ)). µ SAG: O(max{N, L } log(1/ɛ)). µ

46 Proof Technique: Lyapunov Function We define a Lyapunov function of the form L(θ t ) = 2h[g(x t +de y t ) g(x )]+(θ t θ ) [ A B B C ] (θ t θ ), with θ t = y t 1. y N t x t, θ = f i (x ) I A = a. f N (x ), e = 1 ee + a 2 I,., B = be, I C = ci. x

47 Proof Technique: Lyapunov Function We define a Lyapunov function of the form L(θ t ) = 2h[g(x t +de y t ) g(x )]+(θ t θ ) [ A B B C ] (θ t θ ), with θ t = y t 1. y N t x t, θ = f i (x ) I A = a. f N (x ), e = 1 ee + a 2 I,., B = be, I C = ci. x Proof involves finding {α, a 1, a 2, b, c, d, h, δ, γ} such that E(L(θ t ) F t 1 ) (1 δ)l(θ t 1 ), L(θ t ) γ[g(x t ) g(x )]. Apply recursively and initial Lyapunov function gives constant.

48 Constants in Convergence Rate What are the constants?

49 Constants in Convergence Rate What are the constants? If we initialize with y 0 i = 0 we have C = [g(x 0 ) g(x )] + 4L N x 0 x 2 + σ2 16L.

50 Constants in Convergence Rate What are the constants? If we initialize with y 0 i = 0 we have C = [g(x 0 ) g(x )] + 4L N x 0 x 2 + σ2 16L. If we initialize with y 0 i = f i (x 0 ) g (x 0 ) we have C = 3 2 [g(x 0 ) g(x )] + 4L N x 0 x 2.

51 Constants in Convergence Rate What are the constants? If we initialize with y 0 i = 0 we have C = [g(x 0 ) g(x )] + 4L N x 0 x 2 + σ2 16L. If we initialize with y 0 i = f i (x 0 ) g (x 0 ) we have C = 3 2 [g(x 0 ) g(x )] + 4L N x 0 x 2. If we initialize with N stochastic gradient iterations, [g(x 0 ) g(x )] = O(1/N), x 0 x 2 = O(1/N).

52 Convergence Rate in Convex Case Assume only that: f i is convex, f i is L continuous, some x exists.

53 Convergence Rate in Convex Case Assume only that: f i is convex, f i is L continuous, some x exists. Theorem. With α t 1 16L the SAG iterations satisfy E[g(x t ) g(x )] = O(1/t) Faster than SG lower bound of O(1/ t).

54 Convergence Rate in Convex Case Assume only that: f i is convex, f i is L continuous, some x exists. Theorem. With α t 1 16L the SAG iterations satisfy E[g(x t ) g(x )] = O(1/t) Faster than SG lower bound of O(1/ t). Same algorithm and step-size as strongly-convex case: Algorithm is adaptive to strong-convexity. Faster convergence rate if µ is locally bigger around x.

55 Convergence Rate in Convex Case Assume only that: f i is convex, f i is L continuous, some x exists. Theorem. With α t 1 16L the SAG iterations satisfy E[g(x t ) g(x )] = O(1/t) Faster than SG lower bound of O(1/ t). Same algorithm and step-size as strongly-convex case: Algorithm is adaptive to strong-convexity. Faster convergence rate if µ is locally bigger around x. Same algorithm could be used in non-convex case.

56 Convergence Rate in Convex Case Assume only that: f i is convex, f i is L continuous, some x exists. Theorem. With α t 1 16L the SAG iterations satisfy E[g(x t ) g(x )] = O(1/t) Faster than SG lower bound of O(1/ t). Same algorithm and step-size as strongly-convex case: Algorithm is adaptive to strong-convexity. Faster convergence rate if µ is locally bigger around x. Same algorithm could be used in non-convex case. Contrast with stochastic dual coordinate ascent: Requires explicit strongly-convex regularizer. Not adaptive to µ, does not allow µ = 0.

57 Comparing FG and SG Methods quantum (n = 50000, p = 78) and rcv1 (n = , p = 47236) IAG Objective minus Optimum ASG AFG IAG SG Objective minus Optimum SG L BFGS ASG AFG 10 6 L BFGS Effective Passes Effective Passes Comparison of competitive deterministic and stochastic methods.

58 SAG Compared to FG and SG Methods quantum (n = 50000, p = 78) and rcv1 (n = , p = 47236) Objective minus Optimum AFG IAG ASG SAG LS SG L BFGS Objective minus Optimum L BFGS SG SAG LS ASG IAG AFG Effective Passes Effective Passes SAG starts fast and stays fast.

59 SAG Compared to Coordinate-Based Methods quantum (n = 50000, p = 78) and rcv1 (n = , p = 47236) Objective minus Optimum DCA SAG LS ASG PCD L L BFGS Objective minus Optimum DCA L BFGS ASG SAG LS PCD L Effective Passes Effective Passes PCD/DCA are similar on some problems, much worse on others.

60 SAG Implementation Issues while(1) Sample i from {1, 2,..., N}. Compute f i (x). d = d y i + f i (x). y i = f i (x). x = x α N d.

61 SAG Implementation Issues while(1) Issues: Sample i from {1, 2,..., N}. Compute f i (x). d = d y i + f i (x). y i = f i (x). x = x α N d. Should we normalize by N?

62 SAG Implementation Issues while(1) Issues: Sample i from {1, 2,..., N}. Compute f i (x). d = d y i + f i (x). y i = f i (x). x = x α N d. Should we normalize by N? Can we reduce the memory?

63 SAG Implementation Issues while(1) Issues: Sample i from {1, 2,..., N}. Compute f i (x). d = d y i + f i (x). y i = f i (x). x = x α N d. Should we normalize by N? Can we reduce the memory? Can we handle sparse data?

64 SAG Implementation Issues while(1) Issues: Sample i from {1, 2,..., N}. Compute f i (x). d = d y i + f i (x). y i = f i (x). x = x α N d. Should we normalize by N? Can we reduce the memory? Can we handle sparse data? How should we set the step size?

65 SAG Implementation Issues while(1) Issues: Sample i from {1, 2,..., N}. Compute f i (x). d = d y i + f i (x). y i = f i (x). x = x α N d. Should we normalize by N? Can we reduce the memory? Can we handle sparse data? How should we set the step size? When should we stop?

66 SAG Implementation Issues while(1) Issues: Sample i from {1, 2,..., N}. Compute f i (x). d = d y i + f i (x). y i = f i (x). x = x α N d. Should we normalize by N? Can we reduce the memory? Can we handle sparse data? How should we set the step size? When should we stop? Can we use mini-batches?

67 SAG Implementation Issues while(1) Issues: Sample i from {1, 2,..., N}. Compute f i (x). d = d y i + f i (x). y i = f i (x). x = x α N d. Should we normalize by N? Can we reduce the memory? Can we handle sparse data? How should we set the step size? When should we stop? Can we use mini-batches? Can we handle constraints or non-smooth problems?

68 SAG Implementation Issues while(1) Issues: Sample i from {1, 2,..., N}. Compute f i (x). d = d y i + f i (x). y i = f i (x). x = x α N d. Should we normalize by N? Can we reduce the memory? Can we handle sparse data? How should we set the step size? When should we stop? Can we use mini-batches? Can we handle constraints or non-smooth problems? Should we shuffle the data?

69 Implementation Issues: Normalization Should we normalize by N in the early iterations? The parameter update: x = x α N d.

70 Implementation Issues: Normalization Should we normalize by N in the early iterations? The parameter update: x = x α d. M We normalize by number of examples seen (M). Better performance on early iterations. Similar to doing one pass of SG.

71 Implementation Issues: Memory Requirements Can we reduce the memory? The memory update for f i (a T i x): Compute f i (a T i x). d = d + f i (a T i x) y i. y i = f i (a T i x).

72 Implementation Issues: Memory Requirements Can we reduce the memory? The memory update for f i (a T i x): Compute f i (δ), with δ = a T i x. d = d + a i (f (δ) y i ). y i = f i (δ). Use that f i (at i x) = a i f i (δ).

73 Implementation Issues: Memory Requirements Can we reduce the memory? The memory update for f i (a T i x): Compute f i (δ), with δ = a T i x. d = d + a i (f (δ) y i ). y i = f i (δ). Use that f i (at i x) = a i f i (δ). Only store the scalars f i (δ). Reduces the memory from O(NP) to O(N).

74 Implementation Issues: Memory Requirements Can we reduce the memory in general? We can re-write the SAG iteration as: x t+1 = x t α t N f i (x t ) f i (x i ) + N f j (x j ). j=1

75 Implementation Issues: Memory Requirements Can we reduce the memory in general? We can re-write the SAG iteration as: x t+1 = x t α t N f i (x t ) f i (x i ) + N f j (x j ). The SVRG/mixedGrad method ( memory-free methods ): x t+1 = x t α f i (x t ) f i ( x) + 1 N f j N ( x), where we occasionally update x. [Mahdavi et al., 2013, Johnson and Zhang, 2013, Zhang et al., 2013, Konecny and Richtarik, 2013, Xiao and Zhang, 2014] j=1 j=1

76 Implementation Issues: Memory Requirements Can we reduce the memory in general? We can re-write the SAG iteration as: x t+1 = x t α t N f i (x t ) f i (x i ) + N f j (x j ). The SVRG/mixedGrad method ( memory-free methods ): x t+1 = x t α f i (x t ) f i ( x) + 1 N f j N ( x), where we occasionally update x. [Mahdavi et al., 2013, Johnson and Zhang, 2013, Zhang et al., 2013, Konecny and Richtarik, 2013, Xiao and Zhang, 2014] Gets rid of memory for two f i j=1 j=1 evaluations per iteration.

77 Implementation Issues: Sparsity Can we handle sparse data? The parameter update for each variable j: x j = x j α M d j.

78 Implementation Issues: Sparsity Can we handle sparse data? The parameter update for each variable j: x j = x j kα d M j. For sparse data, d j is typically constant. Apply previous k updates when it changes.

79 Implementation Issues: Sparsity Can we handle sparse data? The parameter update for each variable j: x j = x j kα M d j. For sparse data, d j is typically constant. Apply previous k updates when it changes. Reduces the iteration cost from O(P) to O( f i (x) 0). Standard tricks allow l 2 -regularization and l 1 -regularization.

80 Implementation Issues: Step-Size How should we set the step size?

81 Implementation Issues: Step-Size How should we set the step size? Theory: α = 1/16L. Practice: α = 1/L.

82 Implementation Issues: Step-Size How should we set the step size? Theory: α = 1/16L. Practice: α = 1/L. What if L is unknown or smaller near x?

83 Implementation Issues: Step-Size How should we set the step size? Theory: α = 1/16L. Practice: α = 1/L. What if L is unknown or smaller near x? Start with a small L. Increase L until we satisfy: f i (x 1 L f i (x)) f i (x) 1 2L f i (x) 2. (assuming f i (x) 2 ɛ) Decrease L between iterations. (Lipschitz approximation procedure from FISTA)

84 Implementation Issues: Step-Size How should we set the step size? Theory: α = 1/16L. Practice: α = 1/L. What if L is unknown or smaller near x? Start with a small L. Increase L until we satisfy: f i (x 1 L f i (x)) f i (x) 1 2L f i (x) 2. (assuming f i (x) 2 ɛ) Decrease L between iterations. (Lipschitz approximation procedure from FISTA) For f i (a T i x), this costs O(1) in N and P: f i (a T i x f i (δ) L a i 2 ).

85 Implementation Issues: Termination Criterion When should we stop?

86 Implementation Issues: Termination Criterion When should we stop? Normally we check the size of f (x).

87 Implementation Issues: Termination Criterion When should we stop? Normally we check the size of f (x). And SAG has y i f i (x).

88 Implementation Issues: Termination Criterion When should we stop? Normally we check the size of f (x). And SAG has y i f i (x). We can check the size of 1 N d = 1 N N i=1 y i f (x)

89 SAG Implementation Issues: Mini-Batches Can we use mini-batches?

90 SAG Implementation Issues: Mini-Batches Can we use mini-batches? Yes, define each f i to include more than one example. Reduces memory requirements. Allows parallelization. But must decrease L for good performance L B 1 B i B L i max i B {L i}.

91 SAG Implementation Issues: Mini-Batches Can we use mini-batches? Yes, define each f i to include more than one example. Reduces memory requirements. Allows parallelization. But must decrease L for good performance L B 1 B i B L i max i B {L i}. In practice, Lipschitz approximation procedure on to determine L B.

92 Implementation Issues: Constraints/Non-Smooth For composite problems with non-smooth regularizers r, 1 min x N N f i (x) + r(x). i=1

93 Implementation Issues: Constraints/Non-Smooth For composite problems with non-smooth regularizers r, 1 min x N N f i (x) + r(x). We can do a proximal-gradient variant when r is simple: i=1 x = prox αr( ) [x α N d]. [Mairal, 2013, 2014, Xiao and Zhang, 2014, Defazio et al., 2014] E.g., with r(x) = λ x 1 : stochastic iterative soft-thresholding. Same converge rate as smooth case.

94 Implementation Issues: Constraints/Non-Smooth For composite problems with non-smooth regularizers r, 1 min x N N f i (x) + r(x). We can do a proximal-gradient variant when r is simple: i=1 x = prox αr( ) [x α N d]. [Mairal, 2013, 2014, Xiao and Zhang, 2014, Defazio et al., 2014] E.g., with r(x) = λ x 1 : stochastic iterative soft-thresholding. Same converge rate as smooth case. Exist ADMM variants when r is simple/linear composition, r(ax). [Wong et al., 2013]

95 Implementation Issues: Constraints/Non-Smooth For composite problems with non-smooth regularizers r, 1 min x N N f i (x) + r(x). We can do a proximal-gradient variant when r is simple: i=1 x = prox αr( ) [x α N d]. [Mairal, 2013, 2014, Xiao and Zhang, 2014, Defazio et al., 2014] E.g., with r(x) = λ x 1 : stochastic iterative soft-thresholding. Same converge rate as smooth case. Exist ADMM variants when r is simple/linear composition, r(ax). [Wong et al., 2013] If f i are non-smooth, could smooth them or use dual methods. [Nesterov, 2005, Lacoste-Julien et al., 2013, Shalev-Schwartz and Zhang, 2013]

96 SAG Implementation Issues: Non-Uniform Sampling Does re-shuffling and doing full passes work better?

97 SAG Implementation Issues: Non-Uniform Sampling Does re-shuffling and doing full passes work better? NO!

98 SAG Implementation Issues: Non-Uniform Sampling Does re-shuffling and doing full passes work better? NO! Performance is intermediate between IAG and SAG.

99 SAG Implementation Issues: Non-Uniform Sampling Does re-shuffling and doing full passes work better? NO! Performance is intermediate between IAG and SAG. Can non-uniform sampling help?

100 SAG Implementation Issues: Non-Uniform Sampling Does re-shuffling and doing full passes work better? NO! Performance is intermediate between IAG and SAG. Can non-uniform sampling help? Bias sampling towards Lipschitz constants L i.

101 SAG Implementation Issues: Non-Uniform Sampling Does re-shuffling and doing full passes work better? NO! Performance is intermediate between IAG and SAG. Can non-uniform sampling help? Bias sampling towards Lipschitz constants L i. Justification: duplicate examples proportional to L i : 1 N N i=1 f i (x) = 1 Li N L i i=1 j=1 L mean f i (x) L i, convergence rate depends on L mean instead of L max.

102 SAG Implementation Issues: Non-Uniform Sampling Does re-shuffling and doing full passes work better? NO! Performance is intermediate between IAG and SAG. Can non-uniform sampling help? Bias sampling towards Lipschitz constants L i. Justification: duplicate examples proportional to L i : 1 N N i=1 f i (x) = 1 Li N L i i=1 j=1 L mean f i (x) L i, convergence rate depends on L mean instead of L max. Combine with the line-search for adaptive sampling. (see paper/code for details)

103 SAG with Non-Uniform Sampling AGD protein (n = , p = 74) and sido (n = 12678, p = 4932) Objective minus Optimum AGD L BFGS SAG PCD L SG C IAG ASG C DCA Objective minus Optimum IAG L BFGS SG C SAG DCA PCD L ASG C Effective Passes Effective Passes Datasets where SAG had the worst relative performance.

104 SAG with Non-Uniform Sampling protein (n = , p = 74) and sido (n = 12678, p = 4932) Objective minus Optimum DCA IAG AGD L BFGS PCD L SG C ASG C SAG LS (Lipschitz) SAG Objective minus Optimum IAG AGD L BFGS SG C SAG LS (Lipschitz) DCA PCD L ASG C SAG Effective Passes Effective Passes Lipschitz sampling helps a lot.

105 Newton-like Methods Can we use a scaling matrix H? x = x α N H 1 d.

106 Newton-like Methods Can we use a scaling matrix H? x = x α N H 1 d. Didn t help in my experiments with diagonal H. Non-diagonal will lose sparsity.

107 Newton-like Methods Can we use a scaling matrix H? x = x α N H 1 d. Didn t help in my experiments with diagonal H. Non-diagonal will lose sparsity. Quasi-Newton method proposed that has empirically-faster convergence, but much overhead. [Sohl-Dickstein et al., 2014]

108 Conclusion and Discussion Faster theoretical convergence using only the sum structure.

109 Conclusion and Discussion Faster theoretical convergence using only the sum structure. Simple algorithm, empirically better than theory predicts.

110 Conclusion and Discussion Faster theoretical convergence using only the sum structure. Simple algorithm, empirically better than theory predicts. Black-box stochastic gradient algorithm: Adaptivity to problem difficulty, line-search, termination criterion.

111 Conclusion and Discussion Faster theoretical convergence using only the sum structure. Simple algorithm, empirically better than theory predicts. Black-box stochastic gradient algorithm: Adaptivity to problem difficulty, line-search, termination criterion. Recent/current developments: Memory-free variants. Non-smooth variants. Improved constants. Relaxed strong-convexity. Non-uniform sampling. Quasi-Newton variants.

112 Conclusion and Discussion Faster theoretical convergence using only the sum structure. Simple algorithm, empirically better than theory predicts. Black-box stochastic gradient algorithm: Adaptivity to problem difficulty, line-search, termination criterion. Recent/current developments: Memory-free variants. Non-smooth variants. Improved constants. Relaxed strong-convexity. Non-uniform sampling. Quasi-Newton variants. Future developments: Non-convex analysis. Parallel/distributed methods.

113 Conclusion and Discussion Faster theoretical convergence using only the sum structure. Simple algorithm, empirically better than theory predicts. Black-box stochastic gradient algorithm: Adaptivity to problem difficulty, line-search, termination criterion. Recent/current developments: Memory-free variants. Non-smooth variants. Improved constants. Relaxed strong-convexity. Non-uniform sampling. Quasi-Newton variants. Future developments: Non-convex analysis. Parallel/distributed methods. Thank you for the invitation.

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Minimizing finite sums with the stochastic average gradient

Minimizing finite sums with the stochastic average gradient Math. Program., Ser. A (2017) 162:83 112 DOI 10.1007/s10107-016-1030-6 FULL LENGTH PAPER Minimizing finite sums with the stochastic average gradient Mark Schmidt 1 Nicolas Le Roux 2 Francis Bach 3 Received:

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Primal-dual coordinate descent

Primal-dual coordinate descent Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Non-Smooth, Non-Finite, and Non-Convex Optimization

Non-Smooth, Non-Finite, and Non-Convex Optimization Non-Smooth, Non-Finite, and Non-Convex Optimization Deep Learning Summer School Mark Schmidt University of British Columbia August 2015 Complex-Step Derivative Using complex number to compute directional

More information

Large-scale machine learning and convex optimization

Large-scale machine learning and convex optimization Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE IFCAM, Bangalore - July 2014 Big data revolution? A new scientific

More information

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x),

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x), SIAM J. OPTIM. Vol. 24, No. 4, pp. 2057 2075 c 204 Society for Industrial and Applied Mathematics A PROXIMAL STOCHASTIC GRADIENT METHOD WITH PROGRESSIVE VARIANCE REDUCTION LIN XIAO AND TONG ZHANG Abstract.

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology February 2014

More information

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017 CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

On the Convergence Rate of Incremental Aggregated Gradient Algorithms

On the Convergence Rate of Incremental Aggregated Gradient Algorithms On the Convergence Rate of Incremental Aggregated Gradient Algorithms M. Gürbüzbalaban, A. Ozdaglar, P. Parrilo June 5, 2015 Abstract Motivated by applications to distributed optimization over networks

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information

A Universal Catalyst for Gradient-Based Optimization

A Universal Catalyst for Gradient-Based Optimization A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58 Collaborators Hongzhou Lin Zaid Harchaoui Publication

More information

A Proximal Stochastic Gradient Method with Progressive Variance Reduction

A Proximal Stochastic Gradient Method with Progressive Variance Reduction A Proximal Stochastic Gradient Method with Progressive Variance Reduction arxiv:403.4699v [math.oc] 9 Mar 204 Lin Xiao Tong Zhang March 8, 204 Abstract We consider the problem of minimizing the sum of

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION Aryan Mokhtari, Alec Koppel, Gesualdo Scutari, and Alejandro Ribeiro Department of Electrical and Systems

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Large-scale machine learning and convex optimization

Large-scale machine learning and convex optimization Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Allerton Conference - September 2015 Slides available at www.di.ens.fr/~fbach/gradsto_allerton.pdf

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite

More information

Incremental Quasi-Newton methods with local superlinear convergence rate

Incremental Quasi-Newton methods with local superlinear convergence rate Incremental Quasi-Newton methods wh local superlinear convergence rate Aryan Mokhtari, Mark Eisen, and Alejandro Ribeiro Department of Electrical and Systems Engineering Universy of Pennsylvania Int. Conference

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Semistochastic quadratic bound methods

Semistochastic quadratic bound methods Semistochastic quadratic bound methods Aleksandr Aravkin IBM T.J. Watson Research Center Yorktown Heights, NY 10598 saravkin@us.ibm.com Tony Jebara Columbia University NY, USA jebara@cs.columbia.edu Anna

More information

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Tomoya Murata Taiji Suzuki More recently, several authors have proposed accelerarxiv:70300439v4

More information

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning Convex Optimization Lecture 15 - Gradient Descent in Machine Learning Instructor: Yuanzhang Xiao University of Hawaii at Manoa Fall 2017 1 / 21 Today s Lecture 1 Motivation 2 Subgradient Method 3 Stochastic

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Optimized first-order minimization methods

Optimized first-order minimization methods Optimized first-order minimization methods Donghwan Kim & Jeffrey A. Fessler EECS Dept., BME Dept., Dept. of Radiology University of Michigan web.eecs.umich.edu/~fessler UM AIM Seminar 2014-10-03 1 Disclosure

More information

Randomized Smoothing for Stochastic Optimization

Randomized Smoothing for Stochastic Optimization Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

Lecture 23: November 21

Lecture 23: November 21 10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,

More information

Stochastic Optimization Part I: Convex analysis and online stochastic optimization

Stochastic Optimization Part I: Convex analysis and online stochastic optimization Stochastic Optimization Part I: Convex analysis and online stochastic optimization Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless

More information