Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Size: px

Start display at page:

Download "Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method"

Stella Hunt
5 years ago
Views:

1 Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35

Co-Authors Mingyi Hong Tuo Zhao Zhaoran Wang

07747 Davood Hajinezhad Optimizing Nonconvex

2 Co-Authors Mingyi Hong Tuo Zhao Zhaoran Wang Iowa State University Johns Hopkins University Princeton University To appear: NIPS 2016 Link to Online Version : Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 2 / 35

3 Introduction Outline 1 Introduction Motivation Related Works Applications 2 The Proposed Approach Convergence Analysis Comments 3 Connections with Existing Works 4 Numerical Results Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 2 / 35

4 Introduction Outline 1 Introduction Motivation Related Works Applications 2 The Proposed Approach Convergence Analysis Comments 3 Connections with Existing Works 4 Numerical Results Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 2 / 35

5 Introduction Motivation Problem Formulation Consider: min z Z f(z) := 1 N g i (z) + g 0 (z) + p(z), N }{{} i=1 }{{} regularizer (1) empirical cost g i, i = 1, N are cost functions; g 0 and p are regularizers. p(z): lower semi-continuous, convex, possibly nonsmooth. Z: Close and convex set. g i : L i -smooth, possibly nonconvex. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 3 / 35

6 Introduction Motivation Problem Formulation Consider: min z Z f(z) := 1 N g i (z) + g 0 (z) + p(z), N }{{} i=1 }{{} regularizer (1) empirical cost g i, i = 1, N are cost functions; g 0 and p are regularizers. p(z): lower semi-continuous, convex, possibly nonsmooth. Z: Close and convex set. g i : L i -smooth, possibly nonconvex. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 3 / 35

7 Introduction Motivation Problem Formulation Consider: min z Z f(z) := 1 N g i (z) + g 0 (z) + p(z), N }{{} i=1 }{{} regularizer (1) empirical cost g i, i = 1, N are cost functions; g 0 and p are regularizers. p(z): lower semi-continuous, convex, possibly nonsmooth. Z: Close and convex set. g i : L i -smooth, possibly nonconvex. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 3 / 35

8 Introduction Motivation Problem Formulation Consider: min z Z f(z) := 1 N g i (z) + g 0 (z) + p(z), N }{{} i=1 }{{} regularizer (1) empirical cost g i, i = 1, N are cost functions; g 0 and p are regularizers. p(z): lower semi-continuous, convex, possibly nonsmooth. Z: Close and convex set. g i : L i -smooth, possibly nonconvex. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 3 / 35

9 Introduction Motivation Motivation When N is very large, would like to optimize one function at a time We would like to decompose across the functions Question: 1 Find fast algorithms to achieve decomposition for non-convex problems? 2 Rigorously characterize the convergence rates? Main Idea: Utilize both primal and dual information. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 4 / 35

10 Introduction Motivation Motivation When N is very large, would like to optimize one function at a time We would like to decompose across the functions Question: 1 Find fast algorithms to achieve decomposition for non-convex problems? 2 Rigorously characterize the convergence rates? Main Idea: Utilize both primal and dual information. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 4 / 35

11 Introduction Motivation Motivation When N is very large, would like to optimize one function at a time We would like to decompose across the functions Question: 1 Find fast algorithms to achieve decomposition for non-convex problems? 2 Rigorously characterize the convergence rates? Main Idea: Utilize both primal and dual information. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 4 / 35

12 Introduction Motivation Motivation When N is very large, would like to optimize one function at a time We would like to decompose across the functions Question: 1 Find fast algorithms to achieve decomposition for non-convex problems? 2 Rigorously characterize the convergence rates? Main Idea: Utilize both primal and dual information. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 4 / 35

13 Introduction Motivation This work Study the problem from a primal-dual perspective We develop a stochastic algorithm that achieves sublinear convergence rates to first-order stationary solutions Scaling constants up to O(N) better than the (batch) gradient method For special quadratic problems, achieve linear convergence Identified some connections between the proposed primal-dual approach, with a few "primal only" methods such as IAG/SAG/SAGA Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 5 / 35

14 Introduction Related Works Related Works in Convex Case Extensive literature in convex case. Gradient Descent (GD): x r+1 = x r α 1 N N g i (x r i ) i=1 Stochastic Gradient Descent (SG) [Robbins and Monro, 1951]: Randomly pick an index i r {1,, N} x r+1 = x r 1 η ir g ir (x r ) Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 6 / 35

15 Introduction Related Works Related Works in Convex Case Stochastic Average Gradient (SAG) [Le Roux 12]: Consider a special case of problem (1) : p(x) = 0, g 0 (x) = 0 Randomly pick an index i r {1,, N} yi r r = x r, yj r = y r 1 j, j i r x r+1 = x r 1 N g i (yi r ) ηn i=1 }{{} average of past gradients = x r 1 N g i (y r 1 i ) + 1 ( gir (y r 1 i ηn ηn r ) g ir (x r ) ) i=1 Only need to track the sum N i=1 g i(y k 1 i ) and each of its components. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 7 / 35

16 Introduction Related Works Related Works in Convex Case The SAGA algorithm [Defazio et al 14]: Randomly pick an index i k {1,, N} yi k k = x k, yj k = y k 1 j N w k+1 = x k 1 ηn i=1, j i k g i (y k 1 i ) + 1 η ( gik (y k 1 i k ) g ik (x k ) ) x k+1 = prox 1/η h (wk+1 ) (Deal with nonsmooth part) It has better convergence rates. Again requires storage of past N gradients. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 8 / 35

17 Introduction Related Works Related Works in Convex Case Other related methods : MISO [Mairal, 2013]. SDCA [Shalev-Schwartz and Zhang, 2013]. But, these all introduce memory requirements Is it possible to achieve similar convergence rates as SAG, but without additional storage? Yes! Stochastic variance reduced gradient (SVRG) proposed in [Johnson-Zhang 13]. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 9 / 35

18 Introduction Related Works Related Works in Nonconvex Case Classic algorithms Incremental Methods [Bertsekas 11] SGD based algorithms [Ghadimi- Lan 13] Recent fast algorithms Non-convex SVRG [Zhu et al 16, Reddi et al 16] Non-convex SAGA [Reddi et al 16] The latter two algorithms only deal with smooth cases Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 10 / 35

19 Introduction Applications Application: High Dimensional Regression y R M observation vector generated: y = Xν + ɛ X R M P : covariate matrix ν R P : ground truth ɛ R M : noise Only know an estimate of X: A = X + W where W R M P is the noise Goal: To estimate the ground truth ν Set ˆΓ := 1/M(A A) Σ W, and ˆγ := 1/M(A y) Non-Convex Optimization [Loh, Wainwright 12]: min z z ˆΓz ˆγz s.t. z 1 R. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 11 / 35

20 The Proposed Approach Outline 1 Introduction Motivation Related Works Applications 2 The Proposed Approach Convergence Analysis Comments 3 Connections with Existing Works 4 Numerical Results Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 11 / 35

21 The Proposed Approach Outline 1 Introduction Motivation Related Works Applications 2 The Proposed Approach Convergence Analysis Comments 3 Connections with Existing Works 4 Numerical Results Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 11 / 35

22 The Proposed Approach A Reformulation Let us start using a distributed optimization setting Split z into N new variables Consider the following reformulation: 1 min x,z R d N s.t. N g i (x i ) + g 0 (z) + h(z), (2) i=1 x i = z, i = 1,, N where h(z) := ι Z (z) + p(z), x := [x 1 ; ; x N ]. N distributed agents, one central controller Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 12 / 35

23 The Proposed Approach A Reformulation Let us start using a distributed optimization setting Split z into N new variables Consider the following reformulation: 1 min x,z R d N s.t. N g i (x i ) + g 0 (z) + h(z), (2) i=1 x i = z, i = 1,, N where h(z) := ι Z (z) + p(z), x := [x 1 ; ; x N ]. N distributed agents, one central controller Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 12 / 35

24 The Proposed Approach A Reformulation Let us start using a distributed optimization setting Split z into N new variables Consider the following reformulation: 1 min x,z R d N s.t. N g i (x i ) + g 0 (z) + h(z), (2) i=1 x i = z, i = 1,, N where h(z) := ι Z (z) + p(z), x := [x 1 ; ; x N ]. N distributed agents, one central controller Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 12 / 35

25 The Proposed Approach A Reformulation Let us start using a distributed optimization setting Split z into N new variables Consider the following reformulation: 1 min x,z R d N s.t. N g i (x i ) + g 0 (z) + h(z), (2) i=1 x i = z, i = 1,, N where h(z) := ι Z (z) + p(z), x := [x 1 ; ; x N ]. N distributed agents, one central controller Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 12 / 35

26 The Proposed Approach A Reformulation Distributed Setting: g 3 g 0, h g N g 2 g 1 Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 13 / 35

27 The Proposed Approach Assumptions A-(a) f(z) is bounded from below, i.e. f := min z Z f(z) >. A-(b) Let us define g(z) = N i=1 1 N g i(z); Suppose g i s and g have Lipschitz continuous gradients, i.e., g i (y) g i (z) L i y z, y, z dom(g i ), i = 0,, N g(y) g(z) L y z, y, z dom(g). From assumption A-(b): L 1/N N i=1 L i, and the equality can be achieved in the worst case. Central node is able to broadcast to every node, and each node is able to send messages to the central node. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 14 / 35

28 The Proposed Approach Assumptions A-(a) f(z) is bounded from below, i.e. f := min z Z f(z) >. A-(b) Let us define g(z) = N i=1 1 N g i(z); Suppose g i s and g have Lipschitz continuous gradients, i.e., g i (y) g i (z) L i y z, y, z dom(g i ), i = 0,, N g(y) g(z) L y z, y, z dom(g). From assumption A-(b): L 1/N N i=1 L i, and the equality can be achieved in the worst case. Central node is able to broadcast to every node, and each node is able to send messages to the central node. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 14 / 35

29 The Proposed Approach Assumptions A-(a) f(z) is bounded from below, i.e. f := min z Z f(z) >. A-(b) Let us define g(z) = N i=1 1 N g i(z); Suppose g i s and g have Lipschitz continuous gradients, i.e., g i (y) g i (z) L i y z, y, z dom(g i ), i = 0,, N g(y) g(z) L y z, y, z dom(g). From assumption A-(b): L 1/N N i=1 L i, and the equality can be achieved in the worst case. Central node is able to broadcast to every node, and each node is able to send messages to the central node. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 14 / 35

30 The Proposed Approach Assumptions A-(a) f(z) is bounded from below, i.e. f := min z Z f(z) >. A-(b) Let us define g(z) = N i=1 1 N g i(z); Suppose g i s and g have Lipschitz continuous gradients, i.e., g i (y) g i (z) L i y z, y, z dom(g i ), i = 0,, N g(y) g(z) L y z, y, z dom(g). From assumption A-(b): L 1/N N i=1 L i, and the equality can be achieved in the worst case. Central node is able to broadcast to every node, and each node is able to send messages to the central node. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 14 / 35

31 The Proposed Approach The Proposed Algorithm: Preliminaries Augmented Lagrangian: L (x, z; λ) = N i=1 ( 1 N g i(x i ) + λ i, x i z + η ) i 2 x i z 2 + g 0 (z) + h(z) λ := {λ i } N i=1 is the set of dual variables. η := {η i > 0} N i=1 are penalty parameters. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 15 / 35

32 The Proposed Approach The Proposed Algorithm: Preliminaries Consider L(x, z, λ): L (x, z; λ) = N i=1 For variable x i, let us define: ( 1 N g i(x i ) + λ i, x i z + η ) i 2 x i z 2 + g 0 (z) + h(z) V i (x i, z; λ i ) = 1 N g i(z) + 1 N g i(z), x i z + λ i, x i z + α iη i }{{} 2 x i z 2. 1 an approximation of N g i(x i ) Remarks: 1 We have approximated 1 gi(xi) use its linear approximation N 2 Parameter α i gives some freedom to the algorithm. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 16 / 35

33 The Proposed Approach NESTT Algorithm We propose a NonconvEx primal-dual SpliTTing algorithm (NESTT) The NESTT-G Algorithm From iteration r = 1,, R, do: Pick i r {1, 2, N} with probability p ir and update (x j, λ j ): Update z: x r+1 i r = arg min x ir V ir ( xir, z r, λ r i r ) ; λ r+1 i r = λ r ( i r + α ir η ir x r+1 i r z r) ; λ r+1 j = λ r j, x r+1 j = z r, j i r ; z r+1 = arg min z Z L({xr+1 i }, z; λ r ). Output: (z m, x m, λ m ) where m randomly picked from {1, 2,, R}. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 17 / 35

34 The Proposed Approach z r NSETT Algorithm First, each node i stores (x i, λ i ) When the algorithm starts, the central node broadcast to everyone z r x ir, λ ir x 3, λ 3 z r z r z z r x N, λ N x 2, λ 2 z r x 1, λ 1 Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 18 / 35

35 The Proposed Approach NSETT Algorithm Randomly pick i r and update x ir, and related λ ir : x r+1 i r = arg min x ir V ir (x ir, z r, λ r i r ) ; λ r+1 i r = λ r i r + α ir η ir ( x r+1 i r z r) x ir, λ ir x 3, λ 3 z x N, λ N x 2, λ 2 x 1, λ 1 Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 19 / 35

36 The Proposed Approach NSETT Algorithm Aggregate (x, y) to node z; Update z: z r+1 = arg min z Z L({xr+1 i }, z; λ r ) x ir, λ ir x 3, λ 3 z r+1 x N, λ N x 2, λ 2 x 1, λ 1 Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 20 / 35

37 The Proposed Approach Convergence Analysis Main Steps of Convergence Analysis of NESTT-G Step 1) Build a Potential Function and show that it is decreasing? Q r := + N i=1 N i=1 1 N g i(z r )+g 0 (z r ) + h(z r ) 3p i η i (α i η i ) 2 1/N( g i(y r 1 i ) g i (z r )) 2 Step 2) Show that certain optimality gap is decreasing in a sublinear manner Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 21 / 35

38 The Proposed Approach Convergence Analysis Main Steps of Convergence Analysis of NESTT-G Step 1) Build a Potential Function and show that it is decreasing? Q r := + N i=1 N i=1 1 N g i(z r )+g 0 (z r ) + h(z r ) 3p i η i (α i η i ) 2 1/N( g i(y r 1 i ) g i (z r )) 2 Step 2) Show that certain optimality gap is decreasing in a sublinear manner Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 21 / 35

39 The Proposed Approach Convergence Analysis Convergence Analysis of NESTT Lemma Sufficient Descent: Suppose Assumption A holds, and pick α i = p i = βη i, where β := Then the following descent estimate holds true for NESTT 1 N i=1 η, and η i 9L i (3) i Np i N E[Q r Q r 1 F r 1 i=1 ] η i E z r z r z r N i=1 1 η i 1 N ( g i(z r 1 ) g i (y r 2 i )) 2. (4) Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 22 / 35

40 The Proposed Approach Convergence Analysis Convergence Analysis of NESTT Sublinear Convergence of NESTT: Definition The proximal gradient of problem (1) is given by (for any γ > 0) f γ (z) := γ ( z prox γ p+ι Z [z 1/γ (g(z) + g 0 (z))] ) where Optimality Gap: prox γ p+ι Z [u] := argmin u Z p(u) + γ 2 z u 2. E[G r ] := E [ 1/β f(z r ) 2] = 1 β 2 E [ z r prox 1/β h [zr β (g(z r ) + g 0 (z r ))] 2]. G = 0 if and only if a first-order stationary solution is obtained Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 23 / 35

41 The Proposed Approach Convergence Analysis Convergence Rate Analysis of NESTT Theorem (Main Theorem) Suppose Assumption A holds, and pick (for i = 1,, N) ( N ) Li/N α i = p i =, ηi = 3 Li/N, Lj/N Lj/N N j=1 1) Every limit point of NESTT is stationary solution of problem (2) w.p. 1. 2) E[G m ] }{{} + E optimality gap [ N i=1 3η 2 i j=1 ] x m i z m 1 2 } {{ } constraint violation 80 3 ( N i=1 2 E[Q Li /N) 0 Q R ]. R Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 24 / 35

42 The Proposed Approach Convergence Analysis Linear Convergence of NESTT Assumption B: B-(a) Each function g i(z) is a quadratic function of the form g i(z) = 1/2z T A iz + b, z where A i is a symmetric matrix but not necessarily positive semidefinite; B-(b) The feasible set Z is a closed compact polyhedral set; B-(c) The nonsmooth function p(z) = µ z 1, for some µ 0. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 25 / 35

43 The Proposed Approach Convergence Analysis Linear Convergence of NESTT Theorem ( Linear Convergence of NESTT) Suppose that Assumptions A, B are satisfied. Then the sequence {E[Q r+1 ]} r=1 converges Q-linearly to some Q = f(z ), where z is a stationary solution for problem (1). That is, there exists a finite r > 0, ρ (0, 1) such that for all r r,. E[Q r+1 Q ] < ρe[q r Q ] Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 26 / 35

44 Connections with Existing Works 1 Introduction Motivation Related Works Applications 2 The Proposed Approach Convergence Analysis Comments 3 Connections with Existing Works 4 Numerical Results Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 27 / 35

45 Connections with Existing Works Connections and Comparisons with Existing Works The NESTT can be written into the following compact form: ( u r+1 := z r β N 1 g i(y r 1 i ) N i=1 }{{} average of the gradients + 1 Nα ir z r+1 = arg min z h(z) + g 0(z) + 1 2β z ur+1 2 ) ( g ir (z r ) g ir (y r 1 i r )) }{{} gradient update h(z) g 0 (z) α i p i IAG [Blatt 07] cyclic SAG [Schmidt 12] /N SAGA [Defazio 14] 0 1/N 1/N Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 28 / 35

46 Connections with Existing Works Comparison in terms of number of gradient evaluations Based on the classical theory of GD to get ɛ- stationary solution GD should be run in the order of O(L/ɛ) Number of gradient evaluations in each step fo GD is N. In the worse case ( N i=1 L i/n = L) number of gradient evaluations is ( N ) O L i /ɛ i=1 Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 29 / 35

47 Connections with Existing Works Comparison in terms of number of gradient evaluations ( NESTT GD # of Gradient O ( ) ( N N ) i=1 Li /N) 2 /ɛ O i=1 L i/ɛ Case I: L i = 1, i O(N/ɛ) O(N/ɛ) Case II : O(N 2/3 ): L i = N 2/3 O(N/ɛ) O(N rest: L i = 1 4/3 /ɛ) Case III : O( N): L i = N O(N/ɛ) O(N rest:l i = 1 3/2 /ɛ) Case IV : O(1): L i = N 2 O(N/ɛ) O(N rest: L i = 1 2 /ɛ) Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 30 / 35

48 Numerical Results 1 Introduction Motivation Related Works Applications 2 The Proposed Approach Convergence Analysis Comments 3 Connections with Existing Works 4 Numerical Results Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 31 / 35

49 Numerical Results High Dimensional Regression Comparison of NESTT, SAGA, SGD. The x-axis denotes the number of passes of the dataset. Left: Uniform Sampling (p i = 1/N) with N i = N j ; Right: Non-uniform Li/N Sampling (p i = N ) while N i N j. j=1 Lj/N 10 5 Uniform Sampling 10 5 Non-Uniform Sampling Optimality gap 10-5 SGD NESTT-E( = 10) NESTT-E( = 1) NESTT-G SAGA # Grad/N Optimality gap SGD NESTT-E( = 10) NESTT-E( = 1) NESTT-G SAGA # Grad/N Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 32 / 35

50 Numerical Results High Dimensional Regression Optimality gap 1/β f(z r ) 2 for different algorithms, with 100 passes of the datasets. SGD NESTT-E (α = 10) NESTT SAGA N Uniform Non-Uni Uniform Non-Uni Uniform Non-Uni Uniform Non-Uni E E E E E E-9 5.9E-9 1.2E E E E-6 2.7E-6 4.5E-7 1.4E-7 2.5E E-4 8.1E-5 1.8E-5 3.1E-5 4.1E E E-4 1.2E-4 2.7E-4 2.5E In both uniform and non-uniform case NESTT outperforms other methods. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 33 / 35

51 Numerical Results Comparison with GD (worst case) Consider distributed SPCA problem Split a matrix into diagonal sub-matrices: Comparison of NESTT and GD Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 34 / 35

52 Numerical Results Conclusion: A Primal-Dual based algorithms for minimizing nonconvex finite sum It converges sublinearly in general, and Q-Linearly for certain nonconvex l 1 penalized quadratic problems IAG/SAG/SAGA has closed connection with NESTT Non-uniform sampling helps to improve the convergence rates Key insight: Primal-Dual based algorithm can offer faster convergence compared with primal-only algorithms Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 35 / 35

53 Numerical Results Thank You! Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 35 / 35

54 Numerical Results Comments: Linear Convergence The linear convergence results utilizes a key error bound condition from [Luo-Tseng 92] Linear convergence to stationary solutions This is different from a few recent works showing linear convergence for "non-convex" problems [Zhu-Hazan 16] [Reddi et al 16][Karimi-Schmidt 16] 1 Based on certain quadratic growth condition 2 Smooth unconstrained problems, every stationary point is a global minimum 3 Do not cover our nonconvex QP Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 35 / 35

55 Numerical Results Connections and Comparisons with Existing Works We can verify that the x i update is given by x r+1 j = z r 1 α j η j (λ r j + 1 N g j(y r j )). (5) The dual update is given by: λ r+1 i = 1 N g i(y r i ) (6) Observation: The dual variables remembers the past gradients of g i s. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 35 / 35

NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization

NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization ESTT: A onconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization Davood Hajinezhad, Mingyi Hong Tuo Zhao Zhaoran Wang Abstract We study a stochastic and distributed algorithm for