Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Size: px
Start display at page:

Download "Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method"

Transcription

1 Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35

2 Co-Authors Mingyi Hong Tuo Zhao Zhaoran Wang Iowa State University Johns Hopkins University Princeton University To appear: NIPS 2016 Link to Online Version : Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 2 / 35

3 Introduction Outline 1 Introduction Motivation Related Works Applications 2 The Proposed Approach Convergence Analysis Comments 3 Connections with Existing Works 4 Numerical Results Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 2 / 35

4 Introduction Outline 1 Introduction Motivation Related Works Applications 2 The Proposed Approach Convergence Analysis Comments 3 Connections with Existing Works 4 Numerical Results Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 2 / 35

5 Introduction Motivation Problem Formulation Consider: min z Z f(z) := 1 N g i (z) + g 0 (z) + p(z), N }{{} i=1 }{{} regularizer (1) empirical cost g i, i = 1, N are cost functions; g 0 and p are regularizers. p(z): lower semi-continuous, convex, possibly nonsmooth. Z: Close and convex set. g i : L i -smooth, possibly nonconvex. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 3 / 35

6 Introduction Motivation Problem Formulation Consider: min z Z f(z) := 1 N g i (z) + g 0 (z) + p(z), N }{{} i=1 }{{} regularizer (1) empirical cost g i, i = 1, N are cost functions; g 0 and p are regularizers. p(z): lower semi-continuous, convex, possibly nonsmooth. Z: Close and convex set. g i : L i -smooth, possibly nonconvex. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 3 / 35

7 Introduction Motivation Problem Formulation Consider: min z Z f(z) := 1 N g i (z) + g 0 (z) + p(z), N }{{} i=1 }{{} regularizer (1) empirical cost g i, i = 1, N are cost functions; g 0 and p are regularizers. p(z): lower semi-continuous, convex, possibly nonsmooth. Z: Close and convex set. g i : L i -smooth, possibly nonconvex. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 3 / 35

8 Introduction Motivation Problem Formulation Consider: min z Z f(z) := 1 N g i (z) + g 0 (z) + p(z), N }{{} i=1 }{{} regularizer (1) empirical cost g i, i = 1, N are cost functions; g 0 and p are regularizers. p(z): lower semi-continuous, convex, possibly nonsmooth. Z: Close and convex set. g i : L i -smooth, possibly nonconvex. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 3 / 35

9 Introduction Motivation Motivation When N is very large, would like to optimize one function at a time We would like to decompose across the functions Question: 1 Find fast algorithms to achieve decomposition for non-convex problems? 2 Rigorously characterize the convergence rates? Main Idea: Utilize both primal and dual information. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 4 / 35

10 Introduction Motivation Motivation When N is very large, would like to optimize one function at a time We would like to decompose across the functions Question: 1 Find fast algorithms to achieve decomposition for non-convex problems? 2 Rigorously characterize the convergence rates? Main Idea: Utilize both primal and dual information. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 4 / 35

11 Introduction Motivation Motivation When N is very large, would like to optimize one function at a time We would like to decompose across the functions Question: 1 Find fast algorithms to achieve decomposition for non-convex problems? 2 Rigorously characterize the convergence rates? Main Idea: Utilize both primal and dual information. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 4 / 35

12 Introduction Motivation Motivation When N is very large, would like to optimize one function at a time We would like to decompose across the functions Question: 1 Find fast algorithms to achieve decomposition for non-convex problems? 2 Rigorously characterize the convergence rates? Main Idea: Utilize both primal and dual information. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 4 / 35

13 Introduction Motivation This work Study the problem from a primal-dual perspective We develop a stochastic algorithm that achieves sublinear convergence rates to first-order stationary solutions Scaling constants up to O(N) better than the (batch) gradient method For special quadratic problems, achieve linear convergence Identified some connections between the proposed primal-dual approach, with a few "primal only" methods such as IAG/SAG/SAGA Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 5 / 35

14 Introduction Related Works Related Works in Convex Case Extensive literature in convex case. Gradient Descent (GD): x r+1 = x r α 1 N N g i (x r i ) i=1 Stochastic Gradient Descent (SG) [Robbins and Monro, 1951]: Randomly pick an index i r {1,, N} x r+1 = x r 1 η ir g ir (x r ) Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 6 / 35

15 Introduction Related Works Related Works in Convex Case Stochastic Average Gradient (SAG) [Le Roux 12]: Consider a special case of problem (1) : p(x) = 0, g 0 (x) = 0 Randomly pick an index i r {1,, N} yi r r = x r, yj r = y r 1 j, j i r x r+1 = x r 1 N g i (yi r ) ηn i=1 }{{} average of past gradients = x r 1 N g i (y r 1 i ) + 1 ( gir (y r 1 i ηn ηn r ) g ir (x r ) ) i=1 Only need to track the sum N i=1 g i(y k 1 i ) and each of its components. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 7 / 35

16 Introduction Related Works Related Works in Convex Case The SAGA algorithm [Defazio et al 14]: Randomly pick an index i k {1,, N} yi k k = x k, yj k = y k 1 j N w k+1 = x k 1 ηn i=1, j i k g i (y k 1 i ) + 1 η ( gik (y k 1 i k ) g ik (x k ) ) x k+1 = prox 1/η h (wk+1 ) (Deal with nonsmooth part) It has better convergence rates. Again requires storage of past N gradients. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 8 / 35

17 Introduction Related Works Related Works in Convex Case Other related methods : MISO [Mairal, 2013]. SDCA [Shalev-Schwartz and Zhang, 2013]. But, these all introduce memory requirements Is it possible to achieve similar convergence rates as SAG, but without additional storage? Yes! Stochastic variance reduced gradient (SVRG) proposed in [Johnson-Zhang 13]. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 9 / 35

18 Introduction Related Works Related Works in Nonconvex Case Classic algorithms Incremental Methods [Bertsekas 11] SGD based algorithms [Ghadimi- Lan 13] Recent fast algorithms Non-convex SVRG [Zhu et al 16, Reddi et al 16] Non-convex SAGA [Reddi et al 16] The latter two algorithms only deal with smooth cases Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 10 / 35

19 Introduction Applications Application: High Dimensional Regression y R M observation vector generated: y = Xν + ɛ X R M P : covariate matrix ν R P : ground truth ɛ R M : noise Only know an estimate of X: A = X + W where W R M P is the noise Goal: To estimate the ground truth ν Set ˆΓ := 1/M(A A) Σ W, and ˆγ := 1/M(A y) Non-Convex Optimization [Loh, Wainwright 12]: min z z ˆΓz ˆγz s.t. z 1 R. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 11 / 35

20 The Proposed Approach Outline 1 Introduction Motivation Related Works Applications 2 The Proposed Approach Convergence Analysis Comments 3 Connections with Existing Works 4 Numerical Results Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 11 / 35

21 The Proposed Approach Outline 1 Introduction Motivation Related Works Applications 2 The Proposed Approach Convergence Analysis Comments 3 Connections with Existing Works 4 Numerical Results Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 11 / 35

22 The Proposed Approach A Reformulation Let us start using a distributed optimization setting Split z into N new variables Consider the following reformulation: 1 min x,z R d N s.t. N g i (x i ) + g 0 (z) + h(z), (2) i=1 x i = z, i = 1,, N where h(z) := ι Z (z) + p(z), x := [x 1 ; ; x N ]. N distributed agents, one central controller Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 12 / 35

23 The Proposed Approach A Reformulation Let us start using a distributed optimization setting Split z into N new variables Consider the following reformulation: 1 min x,z R d N s.t. N g i (x i ) + g 0 (z) + h(z), (2) i=1 x i = z, i = 1,, N where h(z) := ι Z (z) + p(z), x := [x 1 ; ; x N ]. N distributed agents, one central controller Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 12 / 35

24 The Proposed Approach A Reformulation Let us start using a distributed optimization setting Split z into N new variables Consider the following reformulation: 1 min x,z R d N s.t. N g i (x i ) + g 0 (z) + h(z), (2) i=1 x i = z, i = 1,, N where h(z) := ι Z (z) + p(z), x := [x 1 ; ; x N ]. N distributed agents, one central controller Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 12 / 35

25 The Proposed Approach A Reformulation Let us start using a distributed optimization setting Split z into N new variables Consider the following reformulation: 1 min x,z R d N s.t. N g i (x i ) + g 0 (z) + h(z), (2) i=1 x i = z, i = 1,, N where h(z) := ι Z (z) + p(z), x := [x 1 ; ; x N ]. N distributed agents, one central controller Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 12 / 35

26 The Proposed Approach A Reformulation Distributed Setting: g 3 g 0, h g N g 2 g 1 Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 13 / 35

27 The Proposed Approach Assumptions A-(a) f(z) is bounded from below, i.e. f := min z Z f(z) >. A-(b) Let us define g(z) = N i=1 1 N g i(z); Suppose g i s and g have Lipschitz continuous gradients, i.e., g i (y) g i (z) L i y z, y, z dom(g i ), i = 0,, N g(y) g(z) L y z, y, z dom(g). From assumption A-(b): L 1/N N i=1 L i, and the equality can be achieved in the worst case. Central node is able to broadcast to every node, and each node is able to send messages to the central node. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 14 / 35

28 The Proposed Approach Assumptions A-(a) f(z) is bounded from below, i.e. f := min z Z f(z) >. A-(b) Let us define g(z) = N i=1 1 N g i(z); Suppose g i s and g have Lipschitz continuous gradients, i.e., g i (y) g i (z) L i y z, y, z dom(g i ), i = 0,, N g(y) g(z) L y z, y, z dom(g). From assumption A-(b): L 1/N N i=1 L i, and the equality can be achieved in the worst case. Central node is able to broadcast to every node, and each node is able to send messages to the central node. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 14 / 35

29 The Proposed Approach Assumptions A-(a) f(z) is bounded from below, i.e. f := min z Z f(z) >. A-(b) Let us define g(z) = N i=1 1 N g i(z); Suppose g i s and g have Lipschitz continuous gradients, i.e., g i (y) g i (z) L i y z, y, z dom(g i ), i = 0,, N g(y) g(z) L y z, y, z dom(g). From assumption A-(b): L 1/N N i=1 L i, and the equality can be achieved in the worst case. Central node is able to broadcast to every node, and each node is able to send messages to the central node. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 14 / 35

30 The Proposed Approach Assumptions A-(a) f(z) is bounded from below, i.e. f := min z Z f(z) >. A-(b) Let us define g(z) = N i=1 1 N g i(z); Suppose g i s and g have Lipschitz continuous gradients, i.e., g i (y) g i (z) L i y z, y, z dom(g i ), i = 0,, N g(y) g(z) L y z, y, z dom(g). From assumption A-(b): L 1/N N i=1 L i, and the equality can be achieved in the worst case. Central node is able to broadcast to every node, and each node is able to send messages to the central node. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 14 / 35

31 The Proposed Approach The Proposed Algorithm: Preliminaries Augmented Lagrangian: L (x, z; λ) = N i=1 ( 1 N g i(x i ) + λ i, x i z + η ) i 2 x i z 2 + g 0 (z) + h(z) λ := {λ i } N i=1 is the set of dual variables. η := {η i > 0} N i=1 are penalty parameters. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 15 / 35

32 The Proposed Approach The Proposed Algorithm: Preliminaries Consider L(x, z, λ): L (x, z; λ) = N i=1 For variable x i, let us define: ( 1 N g i(x i ) + λ i, x i z + η ) i 2 x i z 2 + g 0 (z) + h(z) V i (x i, z; λ i ) = 1 N g i(z) + 1 N g i(z), x i z + λ i, x i z + α iη i }{{} 2 x i z 2. 1 an approximation of N g i(x i ) Remarks: 1 We have approximated 1 gi(xi) use its linear approximation N 2 Parameter α i gives some freedom to the algorithm. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 16 / 35

33 The Proposed Approach NESTT Algorithm We propose a NonconvEx primal-dual SpliTTing algorithm (NESTT) The NESTT-G Algorithm From iteration r = 1,, R, do: Pick i r {1, 2, N} with probability p ir and update (x j, λ j ): Update z: x r+1 i r = arg min x ir V ir ( xir, z r, λ r i r ) ; λ r+1 i r = λ r ( i r + α ir η ir x r+1 i r z r) ; λ r+1 j = λ r j, x r+1 j = z r, j i r ; z r+1 = arg min z Z L({xr+1 i }, z; λ r ). Output: (z m, x m, λ m ) where m randomly picked from {1, 2,, R}. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 17 / 35

34 The Proposed Approach z r NSETT Algorithm First, each node i stores (x i, λ i ) When the algorithm starts, the central node broadcast to everyone z r x ir, λ ir x 3, λ 3 z r z r z z r x N, λ N x 2, λ 2 z r x 1, λ 1 Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 18 / 35

35 The Proposed Approach NSETT Algorithm Randomly pick i r and update x ir, and related λ ir : x r+1 i r = arg min x ir V ir (x ir, z r, λ r i r ) ; λ r+1 i r = λ r i r + α ir η ir ( x r+1 i r z r) x ir, λ ir x 3, λ 3 z x N, λ N x 2, λ 2 x 1, λ 1 Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 19 / 35

36 The Proposed Approach NSETT Algorithm Aggregate (x, y) to node z; Update z: z r+1 = arg min z Z L({xr+1 i }, z; λ r ) x ir, λ ir x 3, λ 3 z r+1 x N, λ N x 2, λ 2 x 1, λ 1 Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 20 / 35

37 The Proposed Approach Convergence Analysis Main Steps of Convergence Analysis of NESTT-G Step 1) Build a Potential Function and show that it is decreasing? Q r := + N i=1 N i=1 1 N g i(z r )+g 0 (z r ) + h(z r ) 3p i η i (α i η i ) 2 1/N( g i(y r 1 i ) g i (z r )) 2 Step 2) Show that certain optimality gap is decreasing in a sublinear manner Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 21 / 35

38 The Proposed Approach Convergence Analysis Main Steps of Convergence Analysis of NESTT-G Step 1) Build a Potential Function and show that it is decreasing? Q r := + N i=1 N i=1 1 N g i(z r )+g 0 (z r ) + h(z r ) 3p i η i (α i η i ) 2 1/N( g i(y r 1 i ) g i (z r )) 2 Step 2) Show that certain optimality gap is decreasing in a sublinear manner Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 21 / 35

39 The Proposed Approach Convergence Analysis Convergence Analysis of NESTT Lemma Sufficient Descent: Suppose Assumption A holds, and pick α i = p i = βη i, where β := Then the following descent estimate holds true for NESTT 1 N i=1 η, and η i 9L i (3) i Np i N E[Q r Q r 1 F r 1 i=1 ] η i E z r z r z r N i=1 1 η i 1 N ( g i(z r 1 ) g i (y r 2 i )) 2. (4) Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 22 / 35

40 The Proposed Approach Convergence Analysis Convergence Analysis of NESTT Sublinear Convergence of NESTT: Definition The proximal gradient of problem (1) is given by (for any γ > 0) f γ (z) := γ ( z prox γ p+ι Z [z 1/γ (g(z) + g 0 (z))] ) where Optimality Gap: prox γ p+ι Z [u] := argmin u Z p(u) + γ 2 z u 2. E[G r ] := E [ 1/β f(z r ) 2] = 1 β 2 E [ z r prox 1/β h [zr β (g(z r ) + g 0 (z r ))] 2]. G = 0 if and only if a first-order stationary solution is obtained Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 23 / 35

41 The Proposed Approach Convergence Analysis Convergence Rate Analysis of NESTT Theorem (Main Theorem) Suppose Assumption A holds, and pick (for i = 1,, N) ( N ) Li/N α i = p i =, ηi = 3 Li/N, Lj/N Lj/N N j=1 1) Every limit point of NESTT is stationary solution of problem (2) w.p. 1. 2) E[G m ] }{{} + E optimality gap [ N i=1 3η 2 i j=1 ] x m i z m 1 2 } {{ } constraint violation 80 3 ( N i=1 2 E[Q Li /N) 0 Q R ]. R Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 24 / 35

42 The Proposed Approach Convergence Analysis Linear Convergence of NESTT Assumption B: B-(a) Each function g i(z) is a quadratic function of the form g i(z) = 1/2z T A iz + b, z where A i is a symmetric matrix but not necessarily positive semidefinite; B-(b) The feasible set Z is a closed compact polyhedral set; B-(c) The nonsmooth function p(z) = µ z 1, for some µ 0. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 25 / 35

43 The Proposed Approach Convergence Analysis Linear Convergence of NESTT Theorem ( Linear Convergence of NESTT) Suppose that Assumptions A, B are satisfied. Then the sequence {E[Q r+1 ]} r=1 converges Q-linearly to some Q = f(z ), where z is a stationary solution for problem (1). That is, there exists a finite r > 0, ρ (0, 1) such that for all r r,. E[Q r+1 Q ] < ρe[q r Q ] Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 26 / 35

44 Connections with Existing Works 1 Introduction Motivation Related Works Applications 2 The Proposed Approach Convergence Analysis Comments 3 Connections with Existing Works 4 Numerical Results Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 27 / 35

45 Connections with Existing Works Connections and Comparisons with Existing Works The NESTT can be written into the following compact form: ( u r+1 := z r β N 1 g i(y r 1 i ) N i=1 }{{} average of the gradients + 1 Nα ir z r+1 = arg min z h(z) + g 0(z) + 1 2β z ur+1 2 ) ( g ir (z r ) g ir (y r 1 i r )) }{{} gradient update h(z) g 0 (z) α i p i IAG [Blatt 07] cyclic SAG [Schmidt 12] /N SAGA [Defazio 14] 0 1/N 1/N Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 28 / 35

46 Connections with Existing Works Comparison in terms of number of gradient evaluations Based on the classical theory of GD to get ɛ- stationary solution GD should be run in the order of O(L/ɛ) Number of gradient evaluations in each step fo GD is N. In the worse case ( N i=1 L i/n = L) number of gradient evaluations is ( N ) O L i /ɛ i=1 Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 29 / 35

47 Connections with Existing Works Comparison in terms of number of gradient evaluations ( NESTT GD # of Gradient O ( ) ( N N ) i=1 Li /N) 2 /ɛ O i=1 L i/ɛ Case I: L i = 1, i O(N/ɛ) O(N/ɛ) Case II : O(N 2/3 ): L i = N 2/3 O(N/ɛ) O(N rest: L i = 1 4/3 /ɛ) Case III : O( N): L i = N O(N/ɛ) O(N rest:l i = 1 3/2 /ɛ) Case IV : O(1): L i = N 2 O(N/ɛ) O(N rest: L i = 1 2 /ɛ) Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 30 / 35

48 Numerical Results 1 Introduction Motivation Related Works Applications 2 The Proposed Approach Convergence Analysis Comments 3 Connections with Existing Works 4 Numerical Results Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 31 / 35

49 Numerical Results High Dimensional Regression Comparison of NESTT, SAGA, SGD. The x-axis denotes the number of passes of the dataset. Left: Uniform Sampling (p i = 1/N) with N i = N j ; Right: Non-uniform Li/N Sampling (p i = N ) while N i N j. j=1 Lj/N 10 5 Uniform Sampling 10 5 Non-Uniform Sampling Optimality gap 10-5 SGD NESTT-E( = 10) NESTT-E( = 1) NESTT-G SAGA # Grad/N Optimality gap SGD NESTT-E( = 10) NESTT-E( = 1) NESTT-G SAGA # Grad/N Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 32 / 35

50 Numerical Results High Dimensional Regression Optimality gap 1/β f(z r ) 2 for different algorithms, with 100 passes of the datasets. SGD NESTT-E (α = 10) NESTT SAGA N Uniform Non-Uni Uniform Non-Uni Uniform Non-Uni Uniform Non-Uni E E E E E E-9 5.9E-9 1.2E E E E-6 2.7E-6 4.5E-7 1.4E-7 2.5E E-4 8.1E-5 1.8E-5 3.1E-5 4.1E E E-4 1.2E-4 2.7E-4 2.5E In both uniform and non-uniform case NESTT outperforms other methods. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 33 / 35

51 Numerical Results Comparison with GD (worst case) Consider distributed SPCA problem Split a matrix into diagonal sub-matrices: Comparison of NESTT and GD Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 34 / 35

52 Numerical Results Conclusion: A Primal-Dual based algorithms for minimizing nonconvex finite sum It converges sublinearly in general, and Q-Linearly for certain nonconvex l 1 penalized quadratic problems IAG/SAG/SAGA has closed connection with NESTT Non-uniform sampling helps to improve the convergence rates Key insight: Primal-Dual based algorithm can offer faster convergence compared with primal-only algorithms Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 35 / 35

53 Numerical Results Thank You! Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 35 / 35

54 Numerical Results Comments: Linear Convergence The linear convergence results utilizes a key error bound condition from [Luo-Tseng 92] Linear convergence to stationary solutions This is different from a few recent works showing linear convergence for "non-convex" problems [Zhu-Hazan 16] [Reddi et al 16][Karimi-Schmidt 16] 1 Based on certain quadratic growth condition 2 Smooth unconstrained problems, every stationary point is a global minimum 3 Do not cover our nonconvex QP Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 35 / 35

55 Numerical Results Connections and Comparisons with Existing Works We can verify that the x i update is given by x r+1 j = z r 1 α j η j (λ r j + 1 N g j(y r j )). (5) The dual update is given by: λ r+1 i = 1 N g i(y r i ) (6) Observation: The dual variables remembers the past gradients of g i s. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 35 / 35

NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization

NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization ESTT: A onconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization Davood Hajinezhad, Mingyi Hong Tuo Zhao Zhaoran Wang Abstract We study a stochastic and distributed algorithm for

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems?

Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems? Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems? Mingyi Hong IMSE and ECpE Department Iowa State University ICCOPT, Tokyo, August 2016 Mingyi Hong (Iowa State University)

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

The Proximal Primal-Dual Approach for Nonconvex Linearly Constrained Problems

The Proximal Primal-Dual Approach for Nonconvex Linearly Constrained Problems The Proximal Primal-Dual Approach for Nonconvex Linearly Constrained Problems Presenter: Mingyi Hong Joint work with Davood Hajinezhad University of Minnesota ECE Department DIMACS Workshop on Distributed

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology February 2014

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

An interior-point stochastic approximation method and an L1-regularized delta rule

An interior-point stochastic approximation method and an L1-regularized delta rule Photograph from National Geographic, Sept 2008 An interior-point stochastic approximation method and an L1-regularized delta rule Peter Carbonetto, Mark Schmidt and Nando de Freitas University of British

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Dual Proximal Gradient Method

Dual Proximal Gradient Method Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Cutting Plane Training of Structural SVM

Cutting Plane Training of Structural SVM Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs

More information

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization Noname manuscript No. (will be inserted by the editor Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization Davood Hajinezhad and Mingyi Hong Received: date / Accepted: date Abstract

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

MATH 680 Fall November 27, Homework 3

MATH 680 Fall November 27, Homework 3 MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and

More information

BLOCK ALTERNATING OPTIMIZATION FOR NON-CONVEX MIN-MAX PROBLEMS: ALGORITHMS AND APPLICATIONS IN SIGNAL PROCESSING AND COMMUNICATIONS

BLOCK ALTERNATING OPTIMIZATION FOR NON-CONVEX MIN-MAX PROBLEMS: ALGORITHMS AND APPLICATIONS IN SIGNAL PROCESSING AND COMMUNICATIONS BLOCK ALTERNATING OPTIMIZATION FOR NON-CONVEX MIN-MAX PROBLEMS: ALGORITHMS AND APPLICATIONS IN SIGNAL PROCESSING AND COMMUNICATIONS Songtao Lu, Ioannis Tsaknakis, and Mingyi Hong Department of Electrical

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Expanding the reach of optimal methods

Expanding the reach of optimal methods Expanding the reach of optimal methods Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with C. Kempton (UW), M. Fazel (UW), A.S. Lewis (Cornell), and S. Roy (UW) BURKAPALOOZA! WCOM

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

A Universal Catalyst for Gradient-Based Optimization

A Universal Catalyst for Gradient-Based Optimization A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58 Collaborators Hongzhou Lin Zaid Harchaoui Publication

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Dual Ascent. Ryan Tibshirani Convex Optimization

Dual Ascent. Ryan Tibshirani Convex Optimization Dual Ascent Ryan Tibshirani Conve Optimization 10-725 Last time: coordinate descent Consider the problem min f() where f() = g() + n i=1 h i( i ), with g conve and differentiable and each h i conve. Coordinate

More information

Dealing with Constraints via Random Permutation

Dealing with Constraints via Random Permutation Dealing with Constraints via Random Permutation Ruoyu Sun UIUC Joint work with Zhi-Quan Luo (U of Minnesota and CUHK (SZ)) and Yinyu Ye (Stanford) Simons Institute Workshop on Fast Iterative Methods in

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu Thanks to: Aryan Mokhtari, Mark

More information

arxiv: v2 [cs.na] 3 Apr 2018

arxiv: v2 [cs.na] 3 Apr 2018 Stochastic variance reduced multiplicative update for nonnegative matrix factorization Hiroyui Kasai April 5, 208 arxiv:70.078v2 [cs.a] 3 Apr 208 Abstract onnegative matrix factorization (MF), a dimensionality

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual

More information

ALADIN An Algorithm for Distributed Non-Convex Optimization and Control

ALADIN An Algorithm for Distributed Non-Convex Optimization and Control ALADIN An Algorithm for Distributed Non-Convex Optimization and Control Boris Houska, Yuning Jiang, Janick Frasch, Rien Quirynen, Dimitris Kouzoupis, Moritz Diehl ShanghaiTech University, University of

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Incremental Quasi-Newton methods with local superlinear convergence rate

Incremental Quasi-Newton methods with local superlinear convergence rate Incremental Quasi-Newton methods wh local superlinear convergence rate Aryan Mokhtari, Mark Eisen, and Alejandro Ribeiro Department of Electrical and Systems Engineering Universy of Pennsylvania Int. Conference

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless

More information

Coordinate Descent Faceoff: Primal or Dual?

Coordinate Descent Faceoff: Primal or Dual? JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University

More information

Optimization for Machine Learning (Lecture 3-B - Nonconvex)

Optimization for Machine Learning (Lecture 3-B - Nonconvex) Optimization for Machine Learning (Lecture 3-B - Nonconvex) SUVRIT SRA Massachusetts Institute of Technology MPI-IS Tübingen Machine Learning Summer School, June 2017 ml.mit.edu Nonconvex problems are

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization

Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization Mach earn 207 06:595 622 DOI 0.007/s0994-06-5609- Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization Yiu-ming Cheung Jian ou Received:

More information

Dual and primal-dual methods

Dual and primal-dual methods ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method

More information

First-order methods of solving nonconvex optimization problems: Algorithms, convergence, and optimality

First-order methods of solving nonconvex optimization problems: Algorithms, convergence, and optimality Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2018 First-order methods of solving nonconvex optimization problems: Algorithms, convergence, and optimality

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725 Dual Methods Lecturer: Ryan Tibshirani Conve Optimization 10-725/36-725 1 Last time: proimal Newton method Consider the problem min g() + h() where g, h are conve, g is twice differentiable, and h is simple.

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Primal and Dual Variables Decomposition Methods in Convex Optimization

Primal and Dual Variables Decomposition Methods in Convex Optimization Primal and Dual Variables Decomposition Methods in Convex Optimization Amir Beck Technion - Israel Institute of Technology Haifa, Israel Based on joint works with Edouard Pauwels, Shoham Sabach, Luba Tetruashvili,

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

On the Convergence Rate of Incremental Aggregated Gradient Algorithms

On the Convergence Rate of Incremental Aggregated Gradient Algorithms On the Convergence Rate of Incremental Aggregated Gradient Algorithms M. Gürbüzbalaban, A. Ozdaglar, P. Parrilo June 5, 2015 Abstract Motivated by applications to distributed optimization over networks

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information