Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method
|
|
- Stella Hunt
- 5 years ago
- Views:
Transcription
1 Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35
2 Co-Authors Mingyi Hong Tuo Zhao Zhaoran Wang Iowa State University Johns Hopkins University Princeton University To appear: NIPS 2016 Link to Online Version : Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 2 / 35
3 Introduction Outline 1 Introduction Motivation Related Works Applications 2 The Proposed Approach Convergence Analysis Comments 3 Connections with Existing Works 4 Numerical Results Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 2 / 35
4 Introduction Outline 1 Introduction Motivation Related Works Applications 2 The Proposed Approach Convergence Analysis Comments 3 Connections with Existing Works 4 Numerical Results Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 2 / 35
5 Introduction Motivation Problem Formulation Consider: min z Z f(z) := 1 N g i (z) + g 0 (z) + p(z), N }{{} i=1 }{{} regularizer (1) empirical cost g i, i = 1, N are cost functions; g 0 and p are regularizers. p(z): lower semi-continuous, convex, possibly nonsmooth. Z: Close and convex set. g i : L i -smooth, possibly nonconvex. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 3 / 35
6 Introduction Motivation Problem Formulation Consider: min z Z f(z) := 1 N g i (z) + g 0 (z) + p(z), N }{{} i=1 }{{} regularizer (1) empirical cost g i, i = 1, N are cost functions; g 0 and p are regularizers. p(z): lower semi-continuous, convex, possibly nonsmooth. Z: Close and convex set. g i : L i -smooth, possibly nonconvex. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 3 / 35
7 Introduction Motivation Problem Formulation Consider: min z Z f(z) := 1 N g i (z) + g 0 (z) + p(z), N }{{} i=1 }{{} regularizer (1) empirical cost g i, i = 1, N are cost functions; g 0 and p are regularizers. p(z): lower semi-continuous, convex, possibly nonsmooth. Z: Close and convex set. g i : L i -smooth, possibly nonconvex. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 3 / 35
8 Introduction Motivation Problem Formulation Consider: min z Z f(z) := 1 N g i (z) + g 0 (z) + p(z), N }{{} i=1 }{{} regularizer (1) empirical cost g i, i = 1, N are cost functions; g 0 and p are regularizers. p(z): lower semi-continuous, convex, possibly nonsmooth. Z: Close and convex set. g i : L i -smooth, possibly nonconvex. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 3 / 35
9 Introduction Motivation Motivation When N is very large, would like to optimize one function at a time We would like to decompose across the functions Question: 1 Find fast algorithms to achieve decomposition for non-convex problems? 2 Rigorously characterize the convergence rates? Main Idea: Utilize both primal and dual information. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 4 / 35
10 Introduction Motivation Motivation When N is very large, would like to optimize one function at a time We would like to decompose across the functions Question: 1 Find fast algorithms to achieve decomposition for non-convex problems? 2 Rigorously characterize the convergence rates? Main Idea: Utilize both primal and dual information. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 4 / 35
11 Introduction Motivation Motivation When N is very large, would like to optimize one function at a time We would like to decompose across the functions Question: 1 Find fast algorithms to achieve decomposition for non-convex problems? 2 Rigorously characterize the convergence rates? Main Idea: Utilize both primal and dual information. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 4 / 35
12 Introduction Motivation Motivation When N is very large, would like to optimize one function at a time We would like to decompose across the functions Question: 1 Find fast algorithms to achieve decomposition for non-convex problems? 2 Rigorously characterize the convergence rates? Main Idea: Utilize both primal and dual information. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 4 / 35
13 Introduction Motivation This work Study the problem from a primal-dual perspective We develop a stochastic algorithm that achieves sublinear convergence rates to first-order stationary solutions Scaling constants up to O(N) better than the (batch) gradient method For special quadratic problems, achieve linear convergence Identified some connections between the proposed primal-dual approach, with a few "primal only" methods such as IAG/SAG/SAGA Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 5 / 35
14 Introduction Related Works Related Works in Convex Case Extensive literature in convex case. Gradient Descent (GD): x r+1 = x r α 1 N N g i (x r i ) i=1 Stochastic Gradient Descent (SG) [Robbins and Monro, 1951]: Randomly pick an index i r {1,, N} x r+1 = x r 1 η ir g ir (x r ) Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 6 / 35
15 Introduction Related Works Related Works in Convex Case Stochastic Average Gradient (SAG) [Le Roux 12]: Consider a special case of problem (1) : p(x) = 0, g 0 (x) = 0 Randomly pick an index i r {1,, N} yi r r = x r, yj r = y r 1 j, j i r x r+1 = x r 1 N g i (yi r ) ηn i=1 }{{} average of past gradients = x r 1 N g i (y r 1 i ) + 1 ( gir (y r 1 i ηn ηn r ) g ir (x r ) ) i=1 Only need to track the sum N i=1 g i(y k 1 i ) and each of its components. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 7 / 35
16 Introduction Related Works Related Works in Convex Case The SAGA algorithm [Defazio et al 14]: Randomly pick an index i k {1,, N} yi k k = x k, yj k = y k 1 j N w k+1 = x k 1 ηn i=1, j i k g i (y k 1 i ) + 1 η ( gik (y k 1 i k ) g ik (x k ) ) x k+1 = prox 1/η h (wk+1 ) (Deal with nonsmooth part) It has better convergence rates. Again requires storage of past N gradients. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 8 / 35
17 Introduction Related Works Related Works in Convex Case Other related methods : MISO [Mairal, 2013]. SDCA [Shalev-Schwartz and Zhang, 2013]. But, these all introduce memory requirements Is it possible to achieve similar convergence rates as SAG, but without additional storage? Yes! Stochastic variance reduced gradient (SVRG) proposed in [Johnson-Zhang 13]. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 9 / 35
18 Introduction Related Works Related Works in Nonconvex Case Classic algorithms Incremental Methods [Bertsekas 11] SGD based algorithms [Ghadimi- Lan 13] Recent fast algorithms Non-convex SVRG [Zhu et al 16, Reddi et al 16] Non-convex SAGA [Reddi et al 16] The latter two algorithms only deal with smooth cases Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 10 / 35
19 Introduction Applications Application: High Dimensional Regression y R M observation vector generated: y = Xν + ɛ X R M P : covariate matrix ν R P : ground truth ɛ R M : noise Only know an estimate of X: A = X + W where W R M P is the noise Goal: To estimate the ground truth ν Set ˆΓ := 1/M(A A) Σ W, and ˆγ := 1/M(A y) Non-Convex Optimization [Loh, Wainwright 12]: min z z ˆΓz ˆγz s.t. z 1 R. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 11 / 35
20 The Proposed Approach Outline 1 Introduction Motivation Related Works Applications 2 The Proposed Approach Convergence Analysis Comments 3 Connections with Existing Works 4 Numerical Results Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 11 / 35
21 The Proposed Approach Outline 1 Introduction Motivation Related Works Applications 2 The Proposed Approach Convergence Analysis Comments 3 Connections with Existing Works 4 Numerical Results Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 11 / 35
22 The Proposed Approach A Reformulation Let us start using a distributed optimization setting Split z into N new variables Consider the following reformulation: 1 min x,z R d N s.t. N g i (x i ) + g 0 (z) + h(z), (2) i=1 x i = z, i = 1,, N where h(z) := ι Z (z) + p(z), x := [x 1 ; ; x N ]. N distributed agents, one central controller Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 12 / 35
23 The Proposed Approach A Reformulation Let us start using a distributed optimization setting Split z into N new variables Consider the following reformulation: 1 min x,z R d N s.t. N g i (x i ) + g 0 (z) + h(z), (2) i=1 x i = z, i = 1,, N where h(z) := ι Z (z) + p(z), x := [x 1 ; ; x N ]. N distributed agents, one central controller Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 12 / 35
24 The Proposed Approach A Reformulation Let us start using a distributed optimization setting Split z into N new variables Consider the following reformulation: 1 min x,z R d N s.t. N g i (x i ) + g 0 (z) + h(z), (2) i=1 x i = z, i = 1,, N where h(z) := ι Z (z) + p(z), x := [x 1 ; ; x N ]. N distributed agents, one central controller Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 12 / 35
25 The Proposed Approach A Reformulation Let us start using a distributed optimization setting Split z into N new variables Consider the following reformulation: 1 min x,z R d N s.t. N g i (x i ) + g 0 (z) + h(z), (2) i=1 x i = z, i = 1,, N where h(z) := ι Z (z) + p(z), x := [x 1 ; ; x N ]. N distributed agents, one central controller Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 12 / 35
26 The Proposed Approach A Reformulation Distributed Setting: g 3 g 0, h g N g 2 g 1 Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 13 / 35
27 The Proposed Approach Assumptions A-(a) f(z) is bounded from below, i.e. f := min z Z f(z) >. A-(b) Let us define g(z) = N i=1 1 N g i(z); Suppose g i s and g have Lipschitz continuous gradients, i.e., g i (y) g i (z) L i y z, y, z dom(g i ), i = 0,, N g(y) g(z) L y z, y, z dom(g). From assumption A-(b): L 1/N N i=1 L i, and the equality can be achieved in the worst case. Central node is able to broadcast to every node, and each node is able to send messages to the central node. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 14 / 35
28 The Proposed Approach Assumptions A-(a) f(z) is bounded from below, i.e. f := min z Z f(z) >. A-(b) Let us define g(z) = N i=1 1 N g i(z); Suppose g i s and g have Lipschitz continuous gradients, i.e., g i (y) g i (z) L i y z, y, z dom(g i ), i = 0,, N g(y) g(z) L y z, y, z dom(g). From assumption A-(b): L 1/N N i=1 L i, and the equality can be achieved in the worst case. Central node is able to broadcast to every node, and each node is able to send messages to the central node. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 14 / 35
29 The Proposed Approach Assumptions A-(a) f(z) is bounded from below, i.e. f := min z Z f(z) >. A-(b) Let us define g(z) = N i=1 1 N g i(z); Suppose g i s and g have Lipschitz continuous gradients, i.e., g i (y) g i (z) L i y z, y, z dom(g i ), i = 0,, N g(y) g(z) L y z, y, z dom(g). From assumption A-(b): L 1/N N i=1 L i, and the equality can be achieved in the worst case. Central node is able to broadcast to every node, and each node is able to send messages to the central node. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 14 / 35
30 The Proposed Approach Assumptions A-(a) f(z) is bounded from below, i.e. f := min z Z f(z) >. A-(b) Let us define g(z) = N i=1 1 N g i(z); Suppose g i s and g have Lipschitz continuous gradients, i.e., g i (y) g i (z) L i y z, y, z dom(g i ), i = 0,, N g(y) g(z) L y z, y, z dom(g). From assumption A-(b): L 1/N N i=1 L i, and the equality can be achieved in the worst case. Central node is able to broadcast to every node, and each node is able to send messages to the central node. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 14 / 35
31 The Proposed Approach The Proposed Algorithm: Preliminaries Augmented Lagrangian: L (x, z; λ) = N i=1 ( 1 N g i(x i ) + λ i, x i z + η ) i 2 x i z 2 + g 0 (z) + h(z) λ := {λ i } N i=1 is the set of dual variables. η := {η i > 0} N i=1 are penalty parameters. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 15 / 35
32 The Proposed Approach The Proposed Algorithm: Preliminaries Consider L(x, z, λ): L (x, z; λ) = N i=1 For variable x i, let us define: ( 1 N g i(x i ) + λ i, x i z + η ) i 2 x i z 2 + g 0 (z) + h(z) V i (x i, z; λ i ) = 1 N g i(z) + 1 N g i(z), x i z + λ i, x i z + α iη i }{{} 2 x i z 2. 1 an approximation of N g i(x i ) Remarks: 1 We have approximated 1 gi(xi) use its linear approximation N 2 Parameter α i gives some freedom to the algorithm. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 16 / 35
33 The Proposed Approach NESTT Algorithm We propose a NonconvEx primal-dual SpliTTing algorithm (NESTT) The NESTT-G Algorithm From iteration r = 1,, R, do: Pick i r {1, 2, N} with probability p ir and update (x j, λ j ): Update z: x r+1 i r = arg min x ir V ir ( xir, z r, λ r i r ) ; λ r+1 i r = λ r ( i r + α ir η ir x r+1 i r z r) ; λ r+1 j = λ r j, x r+1 j = z r, j i r ; z r+1 = arg min z Z L({xr+1 i }, z; λ r ). Output: (z m, x m, λ m ) where m randomly picked from {1, 2,, R}. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 17 / 35
34 The Proposed Approach z r NSETT Algorithm First, each node i stores (x i, λ i ) When the algorithm starts, the central node broadcast to everyone z r x ir, λ ir x 3, λ 3 z r z r z z r x N, λ N x 2, λ 2 z r x 1, λ 1 Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 18 / 35
35 The Proposed Approach NSETT Algorithm Randomly pick i r and update x ir, and related λ ir : x r+1 i r = arg min x ir V ir (x ir, z r, λ r i r ) ; λ r+1 i r = λ r i r + α ir η ir ( x r+1 i r z r) x ir, λ ir x 3, λ 3 z x N, λ N x 2, λ 2 x 1, λ 1 Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 19 / 35
36 The Proposed Approach NSETT Algorithm Aggregate (x, y) to node z; Update z: z r+1 = arg min z Z L({xr+1 i }, z; λ r ) x ir, λ ir x 3, λ 3 z r+1 x N, λ N x 2, λ 2 x 1, λ 1 Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 20 / 35
37 The Proposed Approach Convergence Analysis Main Steps of Convergence Analysis of NESTT-G Step 1) Build a Potential Function and show that it is decreasing? Q r := + N i=1 N i=1 1 N g i(z r )+g 0 (z r ) + h(z r ) 3p i η i (α i η i ) 2 1/N( g i(y r 1 i ) g i (z r )) 2 Step 2) Show that certain optimality gap is decreasing in a sublinear manner Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 21 / 35
38 The Proposed Approach Convergence Analysis Main Steps of Convergence Analysis of NESTT-G Step 1) Build a Potential Function and show that it is decreasing? Q r := + N i=1 N i=1 1 N g i(z r )+g 0 (z r ) + h(z r ) 3p i η i (α i η i ) 2 1/N( g i(y r 1 i ) g i (z r )) 2 Step 2) Show that certain optimality gap is decreasing in a sublinear manner Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 21 / 35
39 The Proposed Approach Convergence Analysis Convergence Analysis of NESTT Lemma Sufficient Descent: Suppose Assumption A holds, and pick α i = p i = βη i, where β := Then the following descent estimate holds true for NESTT 1 N i=1 η, and η i 9L i (3) i Np i N E[Q r Q r 1 F r 1 i=1 ] η i E z r z r z r N i=1 1 η i 1 N ( g i(z r 1 ) g i (y r 2 i )) 2. (4) Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 22 / 35
40 The Proposed Approach Convergence Analysis Convergence Analysis of NESTT Sublinear Convergence of NESTT: Definition The proximal gradient of problem (1) is given by (for any γ > 0) f γ (z) := γ ( z prox γ p+ι Z [z 1/γ (g(z) + g 0 (z))] ) where Optimality Gap: prox γ p+ι Z [u] := argmin u Z p(u) + γ 2 z u 2. E[G r ] := E [ 1/β f(z r ) 2] = 1 β 2 E [ z r prox 1/β h [zr β (g(z r ) + g 0 (z r ))] 2]. G = 0 if and only if a first-order stationary solution is obtained Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 23 / 35
41 The Proposed Approach Convergence Analysis Convergence Rate Analysis of NESTT Theorem (Main Theorem) Suppose Assumption A holds, and pick (for i = 1,, N) ( N ) Li/N α i = p i =, ηi = 3 Li/N, Lj/N Lj/N N j=1 1) Every limit point of NESTT is stationary solution of problem (2) w.p. 1. 2) E[G m ] }{{} + E optimality gap [ N i=1 3η 2 i j=1 ] x m i z m 1 2 } {{ } constraint violation 80 3 ( N i=1 2 E[Q Li /N) 0 Q R ]. R Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 24 / 35
42 The Proposed Approach Convergence Analysis Linear Convergence of NESTT Assumption B: B-(a) Each function g i(z) is a quadratic function of the form g i(z) = 1/2z T A iz + b, z where A i is a symmetric matrix but not necessarily positive semidefinite; B-(b) The feasible set Z is a closed compact polyhedral set; B-(c) The nonsmooth function p(z) = µ z 1, for some µ 0. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 25 / 35
43 The Proposed Approach Convergence Analysis Linear Convergence of NESTT Theorem ( Linear Convergence of NESTT) Suppose that Assumptions A, B are satisfied. Then the sequence {E[Q r+1 ]} r=1 converges Q-linearly to some Q = f(z ), where z is a stationary solution for problem (1). That is, there exists a finite r > 0, ρ (0, 1) such that for all r r,. E[Q r+1 Q ] < ρe[q r Q ] Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 26 / 35
44 Connections with Existing Works 1 Introduction Motivation Related Works Applications 2 The Proposed Approach Convergence Analysis Comments 3 Connections with Existing Works 4 Numerical Results Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 27 / 35
45 Connections with Existing Works Connections and Comparisons with Existing Works The NESTT can be written into the following compact form: ( u r+1 := z r β N 1 g i(y r 1 i ) N i=1 }{{} average of the gradients + 1 Nα ir z r+1 = arg min z h(z) + g 0(z) + 1 2β z ur+1 2 ) ( g ir (z r ) g ir (y r 1 i r )) }{{} gradient update h(z) g 0 (z) α i p i IAG [Blatt 07] cyclic SAG [Schmidt 12] /N SAGA [Defazio 14] 0 1/N 1/N Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 28 / 35
46 Connections with Existing Works Comparison in terms of number of gradient evaluations Based on the classical theory of GD to get ɛ- stationary solution GD should be run in the order of O(L/ɛ) Number of gradient evaluations in each step fo GD is N. In the worse case ( N i=1 L i/n = L) number of gradient evaluations is ( N ) O L i /ɛ i=1 Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 29 / 35
47 Connections with Existing Works Comparison in terms of number of gradient evaluations ( NESTT GD # of Gradient O ( ) ( N N ) i=1 Li /N) 2 /ɛ O i=1 L i/ɛ Case I: L i = 1, i O(N/ɛ) O(N/ɛ) Case II : O(N 2/3 ): L i = N 2/3 O(N/ɛ) O(N rest: L i = 1 4/3 /ɛ) Case III : O( N): L i = N O(N/ɛ) O(N rest:l i = 1 3/2 /ɛ) Case IV : O(1): L i = N 2 O(N/ɛ) O(N rest: L i = 1 2 /ɛ) Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 30 / 35
48 Numerical Results 1 Introduction Motivation Related Works Applications 2 The Proposed Approach Convergence Analysis Comments 3 Connections with Existing Works 4 Numerical Results Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 31 / 35
49 Numerical Results High Dimensional Regression Comparison of NESTT, SAGA, SGD. The x-axis denotes the number of passes of the dataset. Left: Uniform Sampling (p i = 1/N) with N i = N j ; Right: Non-uniform Li/N Sampling (p i = N ) while N i N j. j=1 Lj/N 10 5 Uniform Sampling 10 5 Non-Uniform Sampling Optimality gap 10-5 SGD NESTT-E( = 10) NESTT-E( = 1) NESTT-G SAGA # Grad/N Optimality gap SGD NESTT-E( = 10) NESTT-E( = 1) NESTT-G SAGA # Grad/N Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 32 / 35
50 Numerical Results High Dimensional Regression Optimality gap 1/β f(z r ) 2 for different algorithms, with 100 passes of the datasets. SGD NESTT-E (α = 10) NESTT SAGA N Uniform Non-Uni Uniform Non-Uni Uniform Non-Uni Uniform Non-Uni E E E E E E-9 5.9E-9 1.2E E E E-6 2.7E-6 4.5E-7 1.4E-7 2.5E E-4 8.1E-5 1.8E-5 3.1E-5 4.1E E E-4 1.2E-4 2.7E-4 2.5E In both uniform and non-uniform case NESTT outperforms other methods. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 33 / 35
51 Numerical Results Comparison with GD (worst case) Consider distributed SPCA problem Split a matrix into diagonal sub-matrices: Comparison of NESTT and GD Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 34 / 35
52 Numerical Results Conclusion: A Primal-Dual based algorithms for minimizing nonconvex finite sum It converges sublinearly in general, and Q-Linearly for certain nonconvex l 1 penalized quadratic problems IAG/SAG/SAGA has closed connection with NESTT Non-uniform sampling helps to improve the convergence rates Key insight: Primal-Dual based algorithm can offer faster convergence compared with primal-only algorithms Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 35 / 35
53 Numerical Results Thank You! Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 35 / 35
54 Numerical Results Comments: Linear Convergence The linear convergence results utilizes a key error bound condition from [Luo-Tseng 92] Linear convergence to stationary solutions This is different from a few recent works showing linear convergence for "non-convex" problems [Zhu-Hazan 16] [Reddi et al 16][Karimi-Schmidt 16] 1 Based on certain quadratic growth condition 2 Smooth unconstrained problems, every stationary point is a global minimum 3 Do not cover our nonconvex QP Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 35 / 35
55 Numerical Results Connections and Comparisons with Existing Works We can verify that the x i update is given by x r+1 j = z r 1 α j η j (λ r j + 1 N g j(y r j )). (5) The dual update is given by: λ r+1 i = 1 N g i(y r i ) (6) Observation: The dual variables remembers the past gradients of g i s. Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 35 / 35
NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization
ESTT: A onconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization Davood Hajinezhad, Mingyi Hong Tuo Zhao Zhaoran Wang Abstract We study a stochastic and distributed algorithm for
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationDoes Alternating Direction Method of Multipliers Converge for Nonconvex Problems?
Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems? Mingyi Hong IMSE and ECpE Department Iowa State University ICCOPT, Tokyo, August 2016 Mingyi Hong (Iowa State University)
More informationModern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization
Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where
More informationNesterov s Acceleration
Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationMinimizing Finite Sums with the Stochastic Average Gradient Algorithm
Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning
More informationAccelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization
Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationStochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos
1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and
More informationAsynchronous Non-Convex Optimization For Separable Problem
Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationJournal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19
Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationThe Proximal Primal-Dual Approach for Nonconvex Linearly Constrained Problems
The Proximal Primal-Dual Approach for Nonconvex Linearly Constrained Problems Presenter: Mingyi Hong Joint work with Davood Hajinezhad University of Minnesota ECE Department DIMACS Workshop on Distributed
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationShiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers
Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)
More informationFinite-sum Composition Optimization via Variance Reduced Gradient Descent
Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationRecent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables
Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationBlock stochastic gradient update method
Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationLinearly-Convergent Stochastic-Gradient Methods
Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure
More informationIncremental Gradient, Subgradient, and Proximal Methods for Convex Optimization
Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology February 2014
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationAn interior-point stochastic approximation method and an L1-regularized delta rule
Photograph from National Geographic, Sept 2008 An interior-point stochastic approximation method and an L1-regularized delta rule Peter Carbonetto, Mark Schmidt and Nando de Freitas University of British
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationDual Proximal Gradient Method
Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method
More informationStochastic optimization: Beyond stochastic gradients and convexity Part I
Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationProximal Methods for Optimization with Spasity-inducing Norms
Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology
More informationImproved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations
Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org
More informationCutting Plane Training of Structural SVM
Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs
More informationPerturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization
Noname manuscript No. (will be inserted by the editor Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization Davood Hajinezhad and Mingyi Hong Received: date / Accepted: date Abstract
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationMATH 680 Fall November 27, Homework 3
MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and
More informationBLOCK ALTERNATING OPTIMIZATION FOR NON-CONVEX MIN-MAX PROBLEMS: ALGORITHMS AND APPLICATIONS IN SIGNAL PROCESSING AND COMMUNICATIONS
BLOCK ALTERNATING OPTIMIZATION FOR NON-CONVEX MIN-MAX PROBLEMS: ALGORITHMS AND APPLICATIONS IN SIGNAL PROCESSING AND COMMUNICATIONS Songtao Lu, Ioannis Tsaknakis, and Mingyi Hong Department of Electrical
More informationARock: an algorithmic framework for asynchronous parallel coordinate updates
ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,
More informationAccelerated primal-dual methods for linearly constrained convex problems
Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationAn Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal
An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve
More informationExpanding the reach of optimal methods
Expanding the reach of optimal methods Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with C. Kempton (UW), M. Fazel (UW), A.S. Lewis (Cornell), and S. Roy (UW) BURKAPALOOZA! WCOM
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine
More informationPrimal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions
Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function
More informationA Universal Catalyst for Gradient-Based Optimization
A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58 Collaborators Hongzhou Lin Zaid Harchaoui Publication
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine
More informationCoordinate descent methods
Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationOLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research
OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and
More informationDual Ascent. Ryan Tibshirani Convex Optimization
Dual Ascent Ryan Tibshirani Conve Optimization 10-725 Last time: coordinate descent Consider the problem min f() where f() = g() + n i=1 h i( i ), with g conve and differentiable and each h i conve. Coordinate
More informationDealing with Constraints via Random Permutation
Dealing with Constraints via Random Permutation Ruoyu Sun UIUC Joint work with Zhi-Quan Luo (U of Minnesota and CUHK (SZ)) and Yinyu Ye (Stanford) Simons Institute Workshop on Fast Iterative Methods in
More informationStochastic Proximal Gradient Algorithm
Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind
More informationLecture 3: Minimizing Large Sums. Peter Richtárik
Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors
More informationStochastic Composition Optimization
Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators
More informationA Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir
A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationHigh Order Methods for Empirical Risk Minimization
High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu Thanks to: Aryan Mokhtari, Mark
More informationarxiv: v2 [cs.na] 3 Apr 2018
Stochastic variance reduced multiplicative update for nonnegative matrix factorization Hiroyui Kasai April 5, 208 arxiv:70.078v2 [cs.a] 3 Apr 208 Abstract onnegative matrix factorization (MF), a dimensionality
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationPrimal-dual Subgradient Method for Convex Problems with Functional Constraints
Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual
More informationALADIN An Algorithm for Distributed Non-Convex Optimization and Control
ALADIN An Algorithm for Distributed Non-Convex Optimization and Control Boris Houska, Yuning Jiang, Janick Frasch, Rien Quirynen, Dimitris Kouzoupis, Moritz Diehl ShanghaiTech University, University of
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationIncremental Quasi-Newton methods with local superlinear convergence rate
Incremental Quasi-Newton methods wh local superlinear convergence rate Aryan Mokhtari, Mark Eisen, and Alejandro Ribeiro Department of Electrical and Systems Engineering Universy of Pennsylvania Int. Conference
More informationAccelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization
Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu
More informationADMM and Fast Gradient Methods for Distributed Optimization
ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work
More informationHigh Order Methods for Empirical Risk Minimization
High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless
More informationCoordinate Descent Faceoff: Primal or Dual?
JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University
More informationOptimization for Machine Learning (Lecture 3-B - Nonconvex)
Optimization for Machine Learning (Lecture 3-B - Nonconvex) SUVRIT SRA Massachusetts Institute of Technology MPI-IS Tübingen Machine Learning Summer School, June 2017 ml.mit.edu Nonconvex problems are
More informationSimple Optimization, Bigger Models, and Faster Learning. Niao He
Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture
More informationProximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization
Mach earn 207 06:595 622 DOI 0.007/s0994-06-5609- Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization Yiu-ming Cheung Jian ou Received:
More informationDual and primal-dual methods
ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method
More informationFirst-order methods of solving nonconvex optimization problems: Algorithms, convergence, and optimality
Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2018 First-order methods of solving nonconvex optimization problems: Algorithms, convergence, and optimality
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationDual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725
Dual Methods Lecturer: Ryan Tibshirani Conve Optimization 10-725/36-725 1 Last time: proimal Newton method Consider the problem min g() + h() where g, h are conve, g is twice differentiable, and h is simple.
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationPrimal and Dual Variables Decomposition Methods in Convex Optimization
Primal and Dual Variables Decomposition Methods in Convex Optimization Amir Beck Technion - Israel Institute of Technology Haifa, Israel Based on joint works with Edouard Pauwels, Shoham Sabach, Luba Tetruashvili,
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationConvex Optimization. Newton s method. ENSAE: Optimisation 1/44
Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationOn the Convergence Rate of Incremental Aggregated Gradient Algorithms
On the Convergence Rate of Incremental Aggregated Gradient Algorithms M. Gürbüzbalaban, A. Ozdaglar, P. Parrilo June 5, 2015 Abstract Motivated by applications to distributed optimization over networks
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More information