Efficient Methods for Large-Scale Optimization

Size: px
Start display at page:

Download "Efficient Methods for Large-Scale Optimization"

Transcription

1 Efficient Methods for Large-Scale Optimization Aryan Mokhtari Department of Electrical and Systems Engineering University of Pennsylvania Ph.D. Proposal Advisor: Alejandro Ribeiro Committee: Ali Jadbabaie (chair), Asuman Ozdaglar, and Gesualdo Scutari November 21st, 2016 Aryan Mokhtari Efficient Methods for Large-Scale Optimization 1

2 Introduction Introduction Stochastic quasi-newton algorithms Proposed research: Incremental aggregated BFGS Distributed methods Proposed research: Quasi-Newton extension Proposed research: Stochastic decentralized methods Adaptive sample size algorithms Proposed research: Ada Newton Conclusions Aryan Mokhtari Efficient Methods for Large-Scale Optimization 2

3 Large-scale Optimization Expected risk minimization problem min w R p E θ [f (w, θ)] The distribution is unknown We have access to N independent realizations of θ We settle for solving the Empirical Risk Minimization (ERM) problem 1 min F (w) := min w Rp w R p N N f (w, θ i ) i=1 Large-scale optimization or machine learning: large N, large p N: number of observations (inputs) p: number of parameters in the model Applications: classification problem, wireless networks, sensor networks,... Aryan Mokhtari Efficient Methods for Large-Scale Optimization 3

4 Optimization methods Stochastic methods: a subset of samples is used at each iteration SGD is the most popular; however, it is slow because of Noise of stochasticity Variance reduction (SAG, SAGA, SVRG,...) Poor curvature approx. Stochastic QN (SGD-QN, RES, olbfgs,...) Decentralized methods: samples are distributed over multiple processors Primal methods: DGD, Acc. DGD, NN,... Dual methods: DDA, DADMM, DQM, EXTRA, ESOM,... Adaptive sample size methods: start with a subset of samples and increase the size of training set at each iteration Ada Newton The solutions are close when the number of samples are close Aryan Mokhtari Efficient Methods for Large-Scale Optimization 4

5 Stochastic quasi-newton algorithms Introduction Stochastic quasi-newton algorithms Proposed research: Incremental aggregated BFGS Distributed methods Proposed research: Quasi-Newton extension Proposed research: Stochastic decentralized methods Adaptive sample size algorithms Proposed research: Ada Newton Conclusions Aryan Mokhtari Efficient Methods for Large-Scale Optimization 5

6 Stochastic Gradient Descent Objective function gradients s(w) := F (w) = 1 N N f (w, θ i ) i=1 Gradient descent iteration w t+1 = w t ɛ t s(w t) Gradient evaluation is not computationally affordable Stochastic gradient Replace expectation by sample average ŝ(w, θ) = 1 L L f (w, θ l ) l=1 Defined aggregate sample random variable θ = [θ 1;...; θ L ] Stochastic gradient descent iteration w t+1 = w t ɛ t ŝ(w t, θ t) (Stochastic) gradient descent is (very) slow. Newton is impractical Aryan Mokhtari Efficient Methods for Large-Scale Optimization 6

7 BFGS quasi-newton method Correct curvature with Hessian approximation matrix B 1 t w t+1 = w t ɛ t B 1 t s(w t) Approximation B t close to H(w t) := 2 F (w t). Broyden, DFP, BFGS Variable variation: v t = w t+1 w t. Gradient variation: r t = s(w t+1) s(w t) Follow the colors Matrix B t+1 satisfies secant condition B t+1v t = r t. Underdetermined Aryan Mokhtari Efficient Methods for Large-Scale Optimization 7

8 BFGS quasi-newton method Resolve indeterminacy making B t+1 closest to previous approximation B t B t+1 = argmin tr(b 1 t Z) log det(b 1 t Z) n s. t. Z v t = r t, Z 0 Proximity measured in terms of differential entropy Solve to find Hessian approximation update B t+1 = B t + rtrt t v T Btvtvt T B t t r t v T t B tv t For positive definite B 1 0 and nonnegative variation v T t r t > 0 B t stays positive definite for all iterations t Aryan Mokhtari Efficient Methods for Large-Scale Optimization 8

9 Stochastic (online) BFGS SBFGS substitute all the gradients in BFGS by stochastic gradients Correct curvature with Hessian approximation matrix B 1 t w t+1 = w t ɛ t ˆB 1 t ŝ(w t, θ t) Variable variation v t = w t+1 w t Stochastic gradient variation ˆr t = ŝ(w t+1, θ t) ŝ(w t, θ t) Update the Hessian ˆB t+1 = ˆB t + ˆrtˆr t T ˆB T tv tv t ˆBt v t Tˆr t v T t ˆBtv t ˆB t can become arbitrarily close to be positive semidefinite Eigenvalues of ˆB 1 t are unbounded No guarantee for convergence Aryan Mokhtari Efficient Methods for Large-Scale Optimization 9

10 Regularized BFGS Modify BFGS to resolve the issue of unbounded eigenvalues Secant condition is fundamental. Proximity criteria somewhat arbitrary Restrict search to matrices Z whose smallest eigenvalue is at least δ B t+1 = argmin tr[b 1 t (Z δi)] log det[b 1 t (Z δi)] n s. t. Z v t = r t, Z 0 Solve to find regularized Hessian approximation update B t+1 = B t + rt r t T Btvtvt T B t v t T r t v T + δi t B tv t Modified gradient variation r t := r t δv t Replaced gradient variation r t by modified gradient variation r t := r t δv t Added regularization term δi Aryan Mokhtari Efficient Methods for Large-Scale Optimization 10

11 Regularized Stochastic BFGS (RES) Replace stochastic gradient in descent iteration and secant condition Matrix ˆB t+1 satisfies instantaneous secant condition and is closest to ˆB t ˆB t+1 = argmin tr[ˆb 1 t (Z δi)] log det[ˆb 1 t (Z δi)] n s. t. Z v t = ˆr t, Z 0 Solve in closed form to find Hessian approximation update (if ˆB δi) ˆB t+1 = ˆB t + rt r t T ˆB T tv tv t ˆBt v t T r t v T t ˆB tv t + δi Stochastic modified gradient variation r t = ˆr t δv t Replaced gradient variation r t by stochastic gradient variation ˆr t Stochastic BFGS descent w t+1 = w t ɛ t (ˆB 1 t +ΓI) ŝ(w t, θ t) Added ΓI to Hessian inverse approximation ˆB 1 t Aryan Mokhtari Efficient Methods for Large-Scale Optimization 11

12 RES Require Initial variable w 0. Hessian approximation ˆB 0 δi. for t = 0, 1, 2,... do 1: Collect L realization of random variable θ t = [θ t1,..., θ tl ] 2: Compute stochastic gradient ŝ(w t, θ t) = 1 L L l=1 f (wt, θ tl) 3: Update the variable w t+1 = w t ɛ t (ˆB 1 t + ΓI) ŝ(w t, θ t) 4: Compute variable variation v t = w t+1 w t 5: Compute stochastic gradient ŝ(w t+1, θ t) = 1 L L l=1 f (wt+1, θ tl) 6: Compute modified stochastic gradient variation r t r t = ˆr t δv t = ŝ(w t+1, θ t) ŝ(w t, θ t) δv t 7: Hessian approximation ˆB t+1 = ˆB t + rt r T t ˆB T tv tv t ˆB t + δi. v t T r t v T t ˆBtv t Aryan Mokhtari Efficient Methods for Large-Scale Optimization 12

13 Assumptions Eigenvalues of instantaneous Hessians H(w, θ) := 2 f (w, θ) bounded mi H(w, θ) MI Second moment of the norm of the stochastic gradient is bounded E [ ŝ(w t, θ t) 2] S 2 Regularization parameter δ is less than m δ < m Guarantee positiveness of inner product r T t v t > 0 Aryan Mokhtari Efficient Methods for Large-Scale Optimization 13

14 Almost sure convergence Theorem Consider the regularized stochastic BFGS algorithm as defined. If the stepsize sequence satisfies ɛ t = and ɛ 2 t <, the limit of the squared distance t=1 to optimality w t w 2 satisfies t=1 lim t wt w 2 = 0, w.p.1 Aryan Mokhtari Efficient Methods for Large-Scale Optimization 14

15 Convergence rate Theorem Consider the regularized stochastic BFGS algorithm as defined and let the step size sequence given by ɛ t = ɛ 0t 0/(t 0 + t) such that 2ɛ 0t 0Γ > 1. Then the difference between the expected objective value E[F (w t)] and the optimal objective F (w ) satisfies E[F (w t)] F (w ) O ( ) 1 t Aryan Mokhtari Efficient Methods for Large-Scale Optimization 15

16 Numerical results SVM problem N = 10 3 and p = 40 RES outperforms the other methods by exploiting curvature information Objective function value (F(wt)) SAA SGD S2GD SAG RES x 10 4 Number of processed feature vectors (Lt) RES has the fastest convergence time to achieve accuracy F (w t ) = 10 4 n N RES SGD SAG S2GD SAA ms 520 ms 350 ms 280 ms > 690 ms Aryan Mokhtari Efficient Methods for Large-Scale Optimization 16

17 Making RES more efficient B 1 t depends on B 1 t 1 and the curvature information pairs {vt 1, rt 1} depends on B 1 0 and all previous curvature information {v u, r u} t 1 u=0 The old curvature information pairs should be less related B 1 t LBFGS uses only the last τ curvature information pairs {v u, r u} t 1 u=t τ The computation complexity of implementing LBFGS is O(τp) O(p 2 ) Stochastic method Substitute gradients with stochastic gradients No regularization is required for the stochastic version Aryan Mokhtari Efficient Methods for Large-Scale Optimization 17

18 Bounded eigenvalues without regularization Lemma Consider the ol-bfgs algorithm as defined. Then, the trace and determinant of the approximate Hessian ˆB t are bounded as ) tr (ˆB t (p + τ)m, ) det (ˆB t m p+τ [(p + τ)m] τ Lemma Consider the ol-bfgs algorithm as defined. Then, the eigenvalues of the approximate Hessian ˆB t are bounded as m p+τ [(p + τ)m] p+τ 1 I =: ci ˆB t CI := (p + τ)m I. The eigenvalues of the approximate Hessian inverse ˆB 1 t are bounded Aryan Mokhtari Efficient Methods for Large-Scale Optimization 18

19 Large-scale Click-Through Rate (CTR) Soso search engine data set from Tencent Sun (2014) with searches We use N = feature vectors with dimension p = Description Components Max # Mean # Gender Age Ad URL 26, Depth Position Depth and Position Advertisement ID 641, Advertiser ID 14, Query ID 24,122, Keyword ID 1,188, Title ID 3,735, Description ID 2,934, User ID 22,023, Query Tokens 1,039, Ad Keyword Tokens 91, Ad Title Tokens 101, Ad Description Tokens 122, Cosine Similarities Depth (Numerical) Position (Numerical) # of Query Tokens # of Ad Keyword Tokens # of Ad Title Tokens # of Ad Description Tokens Total Counts 56,040, Aryan Mokhtari Efficient Methods for Large-Scale Optimization 19

20 Comparison in terms of convergence rate olbfgs and SGD processing up to feature vectors Relative Objective Function Value (F(w) F(w )) SGD olbfgs Prediction Error SGD olbfgs Processed Feature Vectors (Lt) x Processed Feature Vectors x 10 7 SGD achieves F (w t) F (w ) = 0.78 after processing feature vectors olbfgs achieves F (w t) F (w ) = 0.13 after processing feature vectors Prediction error of olbfgs is less than SGD after processing feature vectors olbfgs error = SGD error = 0.38 Aryan Mokhtari Efficient Methods for Large-Scale Optimization 20

21 Comparison in terms of convergence time olbfgs and SGD running for up to 9 hours Relative Objective Function Value (F(w) F(w )) SGD olbfgs Prediction Error SGD olbfgs Run Time (Hours) Run Time (Hours) olbfgs outperforms SGD despite larger per-iteration computational cost Aryan Mokhtari Efficient Methods for Large-Scale Optimization 21

22 Proposed research: Incremental aggregated BFGS Introduction Stochastic quasi-newton algorithms Proposed research: Incremental aggregated BFGS Distributed methods Proposed research: Quasi-Newton extension Proposed research: Stochastic decentralized methods Adaptive sample size algorithms Proposed research: Ada Newton Conclusions Aryan Mokhtari Efficient Methods for Large-Scale Optimization 22

23 Proposed research SQN methods improve the curvature approximation in SGD They outperform SGD in ill-conditioned problems However, they can t improve the convergence rate of SGD Convergence rate is sublinear Can t recover the superlinear rate of quasi-newton methods Reason? Noise of stochasticity does not vanish with constant stepsize Solution: Incremental aggregated algorithms (variance reduction methods) Track the information of all the functions f 1,..., f N in a memory Update the information of only one function f i at each iteration Use the most recent inf. of all functions f 1,..., f N to update the model Aryan Mokhtari Efficient Methods for Large-Scale Optimization 23

24 Incremental aggregated gradient method f t 1 f t it f t N f i t (wt+1 ) f t+1 1 f t+1 it f t+1 N Consider i t as the index of the function chosen at step t Step 1: Compute w t+1 = w t α N N f i i=1 Step2: Update the gradient corresponding to function f it Aryan Mokhtari Efficient Methods for Large-Scale Optimization 24

25 Incremental BFGS method (Part I) z t 1 z t it z t N B t 1 B t it B t N f t 1 f t it f t N w t+1 BFGS f i t (wt+1 ) z t+1 1 z t+1 it z t+1 N B t+1 1 B t+1 it B t+1 N f t+1 1 f t+1 it f t+1 N B i Approximates Hessian 2 f i z i copy of variable w corresponding to function f i Consider i t as the index of the function chosen at step t ( Step 1: Compute w t+1 = w t 1 N N i=1 B t i ) 1 ( ) 1 N f i N i=1 Step2: Update the information corresponding to function f it It doesn t work Could be as slow as the incremental gradient method Aryan Mokhtari Efficient Methods for Large-Scale Optimization 25

26 Incremental BFGS method (Part II) Approximate f i (w) by its (BFGS) quadratic approximation around z t i The global loss f (w) := (1/N) N i=1 f i(w) is therefore approximated by f (w) 1 N N i=1 [ f i (z t i ) + f i (z t i ) T (w z t i ) + 1 ] 2 (w zt i ) T B t i (w z t i ) Solving this quadratic programming yields the update ( w t+1 1 N = N i=1 B t i ) 1 [ 1 N B t i z t i 1 N N i=1 ] N f i (z t i ) i=1 Aryan Mokhtari Efficient Methods for Large-Scale Optimization 26

27 Numerical results Quadratic programming f (w) := (1/N) N i=1 wt A i w/2 + b T i w A i R p p is a diagonal positive definite matrix b i R p is a random vector from the box [0, 10 3 ] p N = 1000, p = 100, and condition number (10 2, 10 4 ) Relative error w t w / w 0 w of SAG, SAGA, IAG, and IQN Normalized Error SAG SAGA IAG IQN Normalized Error SAG SAGA IAG IQN Number of Effective Passes (a) small condition number Number of Effective Passes (b) large condition number Aryan Mokhtari Efficient Methods for Large-Scale Optimization 27

28 Distributed methods Introduction Stochastic quasi-newton algorithms Proposed research: Incremental aggregated BFGS Distributed methods Proposed research: Quasi-Newton extension Proposed research: Stochastic decentralized methods Adaptive sample size algorithms Proposed research: Ada Newton Conclusions Aryan Mokhtari Efficient Methods for Large-Scale Optimization 28

29 Distributed optimization Stochastic methods attempt to distribute samples (functions) over time The alternative approach is to distribute samples over processors Network with n nodes. Each node i has access to local function f i (x) n Collaborate to minimize global objective f (x) = f i (x) i=1 f 2 (x) f 5 (x) f 8 (x) f 1 (x) f 3 (x) f 6 (x) f 9 (x) f 4 (x) f 7 (x) f 10 (x) Recursive exchanges with neighbors j N i to aggregate global information Aryan Mokhtari Efficient Methods for Large-Scale Optimization 29

30 Methods for distributed optimization n Replicate common variable at each node f (x 1,..., x n) = f i (x i ) Enforce equality between neighbors x i = x j (thus between all nodes) i=1 f 2 (x) f 5 (x) f 8 (x) f 1 (x) f 3 (x) f 6 (x) f 9 (x) f 4 (x) f 7 (x) f 10 (x) Operate recursively to enforce equality asymptotically. Differ on how. Distributed gradient descent, recursive averaging, [Nedic, Ozdaglar 09] Distributed dual ascent, prices, [Rabbat et al 05] Distributed ADMM, prices, [Schizas et al 08] All are first order methods, thus, convergence times not always reasonable Aryan Mokhtari Efficient Methods for Large-Scale Optimization 30

31 Methods for distributed optimization n Replicate common variable at each node f (x 1,..., x n) = f i (x i ) Enforce equality between neighbors x i = x j (thus between all nodes) i=1 f 2 (x 2 ) f 5 (x 5 ) f 8 (x 8 ) f 1 (x 1 ) f 3 (x 3 ) f 6 (x 6 ) f 9 (x 9 ) f 4 (x 4 ) f 7 (x 7 ) f 10 (x 10 ) Operate recursively to enforce equality asymptotically. Differ on how. Distributed gradient descent, recursive averaging, [Nedic, Ozdaglar 09] Distributed dual ascent, prices, [Rabbat et al 05] Distributed ADMM, prices, [Schizas et al 08] All are first order methods, thus, convergence times not always reasonable Aryan Mokhtari Efficient Methods for Large-Scale Optimization 31

32 (Approximate) Network Newton (NN) Reinterpret distributed gradient descent (DGD) as a penalty method Newton step for objective + a penalty requires global coordination Approximate with local operations by truncating Taylor s series of Hessian inverse kth term of series is k-hop neighbor sparse NN-k aggregates information from k-hop neighborhood f 2 (x 2 ) f 5 (x 5 ) f 8 (x 8 ) f 1 (x 1 ) f 3 (x 3 ) f 6 (x 6 ) f 9 (x 9 ) f 4 (x 4 ) f 7 (x 7 ) f 10 (x 10 ) NN-k converges linearly always and exhibits a quadratic phase in a range Aryan Mokhtari Efficient Methods for Large-Scale Optimization 32

33 (Approximate) Network Newton (NN) Reinterpret distributed gradient descent (DGD) as a penalty method Newton step for objective + a penalty requires global coordination Approximate with local operations by truncating Taylor s series of Hessian inverse kth term of series is k-hop neighbor sparse NN-k aggregates information from k-hop neighborhood NN-1 f 2 (x 2 ) f 5 (x 5 ) f 8 (x 8 ) f 1 (x 1 ) f 3 (x 3 ) f 6 (x 6 ) f 9 (x 9 ) f 4 (x 4 ) f 7 (x 7 ) f 10 (x 10 ) NN-k converges linearly always and exhibits a quadratic phase in a range Aryan Mokhtari Efficient Methods for Large-Scale Optimization 33

34 (Approximate) Network Newton (NN) Reinterpret distributed gradient descent (DGD) as a penalty method Newton step for objective + a penalty requires global coordination Approximate with local operations by truncating Taylor s series of Hessian inverse kth term of series is k-hop neighbor sparse NN-k aggregates information from k-hop neighborhood NN-1 NN-2 f 2 (x 2 ) f 5 (x 5 ) f 8 (x 8 ) f 1 (x 1 ) f 3 (x 3 ) f 6 (x 6 ) f 9 (x 9 ) f 4 (x 4 ) f 7 (x 7 ) f 10 (x 10 ) NN-k converges linearly always and exhibits a quadratic phase in a range Aryan Mokhtari Efficient Methods for Large-Scale Optimization 34

35 (Approximate) Network Newton (NN) Reinterpret distributed gradient descent (DGD) as a penalty method Newton step for objective + a penalty requires global coordination Approximate with local operations by truncating Taylor s series of Hessian inverse kth term of series is k-hop neighbor sparse NN-k aggregates information from k-hop neighborhood NN-1 NN-2 NN-3 f 2 (x 2 ) f 5 (x 5 ) f 8 (x 8 ) f 1 (x 1 ) f 3 (x 3 ) f 6 (x 6 ) f 9 (x 9 ) f 4 (x 4 ) f 7 (x 7 ) f 10 (x 10 ) NN-k converges linearly always and exhibits a quadratic phase in a range Aryan Mokhtari Efficient Methods for Large-Scale Optimization 35

36 Decentralized Gradient Descent (DGD) Problem in distributed form min x 1,...,x n n f i (x i ), i=1 s.t. x i = x j, for j N i With nonnegative doubly stochastic weights w ij, DGD update at node i is x i,t+1 = j N i w ij x j,t α f i (x i,t ) Average of local and neighboring variables + local gradient descent Rewrite DGD in vector form (aggregate variable y := [x 1;... ; x n]), y t+1 = Zy t αh(y t) Weight matrix W, Z := W I. Gradient h(y) := [ f 1(x 1);... ; f n(x n)] Reorder terms in vector form DGD y t+1 = y t [ (I Z)y t + αh(y t) ] Gradient descent on function F (y) := 1 n 2 yt (I Z) y + α f i (x i ) i=1 Aryan Mokhtari Efficient Methods for Large-Scale Optimization 36

37 DGD as a penalty method (reconsidering the mystery of DGD) n Why do gradient descent on F (y) := α f i (x i ) yt (I Z) y? i=1 (I Z)y = 0 if and only if x i = x j Same is true of (I Z) 1/2 problem in distributed form is equivalent to min y n f i (x i ), s.t. (I Z) 1/2 y = 0 i=1 DGD is a penalty method that solves this (equivalent) problem Squared norm penalty 1 2 (I Z)1/2 y 2 with coefficient 1/α Converges to wrong solution. Not far from right if α is small Gradient descent in F (y) works Why not using Newton steps on F (y)? Aryan Mokhtari Efficient Methods for Large-Scale Optimization 37

38 Newton method for penalized objective function Penalized objective function F (y) = 1 n 2 yt (I Z)y + α f i (x i ) i=1 To implement Newton on F (y) need Hessians H t = I Z + αg t G t is block diagonal with blocks G ii,t = 2 f i (x i,t ) Hessian H t has the sparsity pattern of Z = sparsity pattern of graph Can be computed with local information + exchanges with neighbors Newton step depends on Hessian inverse d t := H 1 t g t Inverse of H t is, in general, not block sparse nor locally computable Aryan Mokhtari Efficient Methods for Large-Scale Optimization 38

39 Network Newton Hessian approximation Define diagonal matrix D t := αg t + 2(I diag(z)) Define block graph sparse matrix B := I 2diag(Z) + Z ( ) Split Hessian as H t = D t B = D 1/2 t I D 1/2 t BD 1/2 t D 1/2 t Use Taylor series (I X) 1 = k=0 Xk to write Hessian inverse as H 1 t = D 1/2 t k=0 ( ) kd D 1/2 t BD 1/2 1/2 t t Define NN-K step d (K) t := Ĥ(K) 1 t g t by truncating sum at Kth term Ĥ (K) 1 t := D 1/2 t K k=0 ( ) k D 1/2 t BD 1/2 1/2 t D t ( ) k D 1/2 t BD 1/2 t graph sparse D 1/2 t BD 1/2 t k-hop neighborhood sparse Aryan Mokhtari Efficient Methods for Large-Scale Optimization 39

40 Distributed computation of NN-K step The descent direction is d (K) t := Ĥ(K) 1 t g t If we define d (K) i,t as the local part of d (K) t d (k+1) i,t = D 1 ii,t j N i,j=i at node i, then B ij d (k) j,t D 1 ii,t g i,t Local piece of NN-(k + 1) step is computed as a function of Local matrices, local gradient components, local piece of NN-k step Pieces of the NN-k step of neighboring nodes. Can exchange. Aryan Mokhtari Efficient Methods for Large-Scale Optimization 40

41 NN-K Algorithm at node i (0) Initialize at x i,0. Repeat for times t = 0, 1,... (1) Exchange local iterates x i,t with neighboring nodes j N i. (2) Compute local gradient components g i,t = (1 w ii )x i,t j N i w ij x j,t + α f i (x i,t ) (3) Initialize NN step computation with NN-0 step d (0) i,t (4) Repeat for k = 0, 1,..., K 1 = D 1 ii,t g i,t (5) Exchange local elements d (k) i,t of NN-k step with neighbors j N i. (6) Compute local component of NN-(k + 1) step d (k+1) i,t = D 1 ii,t j N i,j=i (7) Update local iterate: x i,t+1 = x i,t + ɛ d (K) i,t B ij d (k) j,t D 1 ii,t g i,t Aryan Mokhtari Efficient Methods for Large-Scale Optimization 41

42 Assumptions Assumption 1 The local objective functions f i (x) are twice differentiable The Hessians 2 f i (x) have bounded eigenvalues mi 2 f i (x) MI Assumption 2 The local Hessians are Lipschitz continuous 2 f i (x) 2 f i (ˆx) L x ˆx Aryan Mokhtari Efficient Methods for Large-Scale Optimization 42

43 Linear convergence of NN-K Theorem For a specific choice of stepsize ɛ the sequence F (y t) converges to the optimal argument F (y ) at least linearly with constant 0 < 1 ζ α < 1, i.e., F (y t) F (y ) (1 ζ α) t (F (y 0) F (y )) Aryan Mokhtari Efficient Methods for Large-Scale Optimization 43

44 Superlinear convergence Theorem Theorem For specific values of Γ 1 and Γ 2 the sequence of weighted gradient norm satisfies ] D 1/2 t g t+1 (1 ɛ+ɛρ K+1 ) [1 + Γ 1(1 ζ) t 1 4 D 1/2 +ɛ t 1 gt 2 D 1/2 2 Γ 2 t 1 gt where ρ < 1. D 1 2 t g t+1 is upper bounded by linear and quadratic terms of D 1 2 Similar to the convergence analysis of Newton method with constant ɛ For t large enough Γ 1(1 ζ) t K large enough ɛ 1 linear term becomes smaller t 1 gt There is an interval in which the quadratic term dominates the linear term Rate of convergence is quadratic in that interval Aryan Mokhtari Efficient Methods for Large-Scale Optimization 44

45 Numerical results Convergence path for f (x) := 100 i=1 xt A i x/2 + b T i x 100 i=1 A i is ill-conditioned, Graph is 4 regular and α = 10 2 xi x 2 x 2 error = 1 n n i= DGD NN-0 NN-1 NN Number of iterations DGD is slower than different versions of NN-K NN-K with larger K converges faster in terms of number of iterations Aryan Mokhtari Efficient Methods for Large-Scale Optimization 45

46 Numerical results Number of required information exchanges to achieve accuracy e = 10 2 n = 100, d = {4,..., 10} and c.n.= {10 2, 10 3, 10 4 } Empirical distribution Empirical distribution DGD Number of information exchanges NN Number of information exchanges Empirical distribution Empirical distribution NN-0 Number of information exchanges NN Number of information exchanges Different versions of NN-k have almost similar performances DGD is slower than all versions of NN-K by an order of magnitude Aryan Mokhtari Efficient Methods for Large-Scale Optimization 46

47 Proposed research: Quasi-Newton extension Introduction Stochastic quasi-newton algorithms Proposed research: Incremental aggregated BFGS Distributed methods Proposed research: Quasi-Newton extension Proposed research: Stochastic decentralized methods Adaptive sample size algorithms Proposed research: Ada Newton Conclusions Aryan Mokhtari Efficient Methods for Large-Scale Optimization 47

48 Proposed research NN methods improve the curvature approximation in DGD They accelerate the convergence rate of DGD Quadratic phase However, they suffer from two major issues (I-1): They require computation of the Hessian (I-2): They converge to a neighborhood of the solution Solution to (I-1): Decentralized quasi-newton methods QN methods only require the computation of the objective gradients Solution to (I-2): Dual domain methods Dual methods converge to the exact solution using the Lagrangian idea Goal: Develop a distributed QN method that operates in the dual domain Aryan Mokhtari Efficient Methods for Large-Scale Optimization 48

49 Proposed research: Stochastic decentralized methods Introduction Stochastic quasi-newton algorithms Proposed research: Incremental aggregated BFGS Distributed methods Proposed research: Quasi-Newton extension Proposed research: Stochastic decentralized methods Adaptive sample size algorithms Proposed research: Ada Newton Conclusions Aryan Mokhtari Efficient Methods for Large-Scale Optimization 49

50 Proposed research Decentralized methods reduce the computational complexity and storage By a factor proportional to the number of processors (nodes) However, the No. of samples assigned to each node could be still massive Nodes can not afford the computational complexity of det. methods Solution: Stochastic approximation Approach 1: Use stochastic gradients in lieu of gradients Noise of stochastic approx. does not vanish Slow convergence Approach 2: Use variance reduction methods instead of gradients Noise of stoch. approx. vanishes Keeps the rate of det. methods Goal: Develop stochastic distributed methods that leverage variance reduction techniques Aryan Mokhtari Efficient Methods for Large-Scale Optimization 50

51 Numerical results Comparison of a saddle point method (EXTRA) and its stochastic approximations LR problem n = 20 nodes, N = 500 sample points q n = Error x t x sto-extra α = 10 3 sto-extra α = 10 2 DSA α = EXTRA α = Error x t x sto-extra α = 10 2 sto-extra α = 10 3 EXTRA α = DSA α = Number of itrations t Number of gradient evaluations DSA is the only stochastic decentralized method with linear converg. rate DSA requires less samples relative to EXTRA to achieve a target accuracy For e t = 10 10, each node in DSA requires 460 grad. comps. For e t = 10 10, each node in EXTRA requires 2000 = grad. comps. Aryan Mokhtari Efficient Methods for Large-Scale Optimization 51

52 Adaptive sample size algorithms Introduction Stochastic quasi-newton algorithms Proposed research: Incremental aggregated BFGS Distributed methods Proposed research: Quasi-Newton extension Proposed research: Stochastic decentralized methods Adaptive sample size algorithms Proposed research: Ada Newton Conclusions Aryan Mokhtari Efficient Methods for Large-Scale Optimization 52

53 Back to the ERM problem Our original goal was to solve the statistical loss problem w := argmin w R p L(w) = argmin E [f (w, Z)] w R p But since the distribution of Z is unknown we settle for the ERM problem w N 1 := argmin L N (w) = argmin w R p w R p N N f (w, z k ) Where the samples z k are drawn from a common distribution ERM approximates actual problem Don t need perfect solution k=1 Aryan Mokhtari Efficient Methods for Large-Scale Optimization 53

54 Regularized ERM problem From statistical learning we know that there exists a constant V N such that sup L(w) L N (w) V N, w w.h.p. V N = O(1/ N) from CLT. V N = O(1/N) sometimes [Bartlett et al 06] There is no need to minimize L N (w) beyond accuracy O(V N ) This is well known. In fact, this is why we can add regularizers to ERM w N := argmin w R N (w) = argmin L N (w) + cv N w 2 w 2 Adding the term (cv N /2) w 2 moves the optimum of the ERM problem But the optimum w N is still in a ball of order V N around w Aryan Mokhtari Efficient Methods for Large-Scale Optimization 54

55 Adaptive sample size methods Consider ERM problem R n for subset of n N uniformly chosen samples w n := argmin w R n(w) = argmin L n(w) + cvn w 2 w 2 Solutions w m for m samples and w n n samples close if m and n are close They are chosen from the same distribution Find approximate solution w m for ERM problem R m with m samples Increase sample size to n samples Use solution w m as warm start to compute approximate solution w n for R n Aryan Mokhtari Efficient Methods for Large-Scale Optimization 55

56 Adaptive sample size Newton method Ada Newton is a specific adaptive sample size method using Newton steps Find w m that solves R m to its statistical accuracy V m Apply single Newton iteration w n = w m 2 R n(w m) 1 R n(w m) If m and n close, we have w n within statistical accuracy of R n w n w m This works if statistical accuracy ball of R m is within Newton quadratic convergence ball of R n. Then, w m is within Newton quadratic convergence ball of R n A single Newton iteration yields w n within statistical accuracy of R n Aryan Mokhtari Efficient Methods for Large-Scale Optimization 56

57 Numerical results LR problem Protein homology dataset provided (KDD cup 2004) Number of samples N = 145, 751, dimension p = 74 Parameters V n = 1/n, c = 20, m 0 = 124, and α = SGD SAGA Newton Ada Newton RN(w) R N Number of passes Ada Newton achieves the statistical accuracy of the full training set with about two passes over the dataset Can we show that α = 2 is feasible in theory? Aryan Mokhtari Efficient Methods for Large-Scale Optimization 57

58 Proposed research: Ada Newton Introduction Stochastic quasi-newton algorithms Proposed research: Incremental aggregated BFGS Distributed methods Proposed research: Quasi-Newton extension Proposed research: Stochastic decentralized methods Adaptive sample size algorithms Proposed research: Ada Newton Conclusions Aryan Mokhtari Efficient Methods for Large-Scale Optimization 58

59 Proposed research If α is chosen properly, w m will be in the quadratic Nbhd. of Newton s method for the risk R n w n will stay within the statistical accuracy of R n Question: How should we choose α? Computational complexity of Ada Newton O(np 2 + p 3 ) How can we reduce the computational complexity of Ada Newton? There is room for error Don t need exact solution Hessian computation O(np 2 ) Sub-sampled Newton O(τp 2 ) Hessian inverse computation O(p 3 ) SVD truncation O(rp 2 ) Aryan Mokhtari Efficient Methods for Large-Scale Optimization 59

60 Conclusions Introduction Stochastic quasi-newton algorithms Proposed research: Incremental aggregated BFGS Distributed methods Proposed research: Quasi-Newton extension Proposed research: Stochastic decentralized methods Adaptive sample size algorithms Proposed research: Ada Newton Conclusions Aryan Mokhtari Efficient Methods for Large-Scale Optimization 60

61 Conclusions We studied different approaches to solve large-scale ERM problems Stochastic methods Distribute samples over time Distributed (decentralized) methods Distribute samples over processors Distributed stochastic methods Distribute samples over time and processors Adaptive sample size methods Leverage the closeness of solutions The presented SQN methods improve the convergence time of SGD The convergence rate is still slow (sublinear) PR: Incorporate incremental aggregation technique to reduce the noise of SA Aryan Mokhtari Efficient Methods for Large-Scale Optimization 61

62 Conclusions NN methods converge faster (quadratic+linear) than DGD (linear) (I-1) The convergence is inexact (I-2) Implementation of NN requires Hessian computation PR: Develop a quasi-newton method that operates in the dual domain (I-3) Local gradient computation could be costly (same as other dist. methods) PR: Explore the use of variance reduction methods Adaptive sample size version of Newton s method In practice we can double the size of the training set (α = 2) PR: What is the largest valid choice for α in theory? How can we reduce the computational complexity of Ada Newton? PR: Sub-sampled Newton Reduces C.C. of the Hessian computation PR: SVD truncation Reduces C.C. of the Hessian inverse computation Aryan Mokhtari Efficient Methods for Large-Scale Optimization 62

63 References A. Mokhtari and A. Ribeiro, RES: Regularized Stochastic BFGS Algorithm, IEEE Transaction on Signal Processing (TSP), vol. 62, no. 23, pp , December A. Mokhtari and A. Ribeiro, Global Convergence of Online Limited Memory BFGS, Journal of Machine Learning Research (JMLR), vol. 16, pp , A. Mokhtari, Q. Ling, and A. Ribeiro, Network Newton Distributed Optimization Methods, IEEE Transaction on Signal Processing (TSP), vol. 65, no. 1, pp , January A. Mokhtari, and A. Ribeiro, DSA: Decentralized double stochastic averaging gradient algorithm, Journal of Machine Learning Research (JMLR), vol. 17, no. 61, pp. 1-35, Aryan Mokhtari Efficient Methods for Large-Scale Optimization 63

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China) Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu Thanks to: Aryan Mokhtari, Mark

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless

More information

Incremental Quasi-Newton methods with local superlinear convergence rate

Incremental Quasi-Newton methods with local superlinear convergence rate Incremental Quasi-Newton methods wh local superlinear convergence rate Aryan Mokhtari, Mark Eisen, and Alejandro Ribeiro Department of Electrical and Systems Engineering Universy of Pennsylvania Int. Conference

More information

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 23, DECEMBER 1, Aryan Mokhtari and Alejandro Ribeiro, Member, IEEE

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 23, DECEMBER 1, Aryan Mokhtari and Alejandro Ribeiro, Member, IEEE IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 23, DECEMBER 1, 2014 6089 RES: Regularized Stochastic BFGS Algorithm Aryan Mokhtari and Alejandro Ribeiro, Member, IEEE Abstract RES, a regularized

More information

DECENTRALIZED algorithms are used to solve optimization

DECENTRALIZED algorithms are used to solve optimization 5158 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 19, OCTOBER 1, 016 DQM: Decentralized Quadratically Approximated Alternating Direction Method of Multipliers Aryan Mohtari, Wei Shi, Qing Ling,

More information

Decentralized Quasi-Newton Methods

Decentralized Quasi-Newton Methods 1 Decentralized Quasi-Newton Methods Mark Eisen, Aryan Mokhtari, and Alejandro Ribeiro Abstract We introduce the decentralized Broyden-Fletcher- Goldfarb-Shanno (D-BFGS) method as a variation of the BFGS

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers Decentralized Quadratically Approimated Alternating Direction Method of Multipliers Aryan Mokhtari Wei Shi Qing Ling Alejandro Ribeiro Department of Electrical and Systems Engineering, University of Pennsylvania

More information

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION Aryan Mokhtari, Alec Koppel, Gesualdo Scutari, and Alejandro Ribeiro Department of Electrical and Systems

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

ABSTRACT 1. INTRODUCTION

ABSTRACT 1. INTRODUCTION A DIAGONAL-AUGMENTED QUASI-NEWTON METHOD WITH APPLICATION TO FACTORIZATION MACHINES Aryan Mohtari and Amir Ingber Department of Electrical and Systems Engineering, University of Pennsylvania, PA, USA Big-data

More information

Online Limited-Memory BFGS for Click-Through Rate Prediction

Online Limited-Memory BFGS for Click-Through Rate Prediction Online Limited-Memory BFGS for Click-Through Rate Prediction Mitchell Stern Department of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104, USA Aryan Mokhtari Department

More information

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725 Quasi-Newton Methods Javier Peña Convex Optimization 10-725/36-725 Last time: primal-dual interior-point methods Consider the problem min x subject to f(x) Ax = b h(x) 0 Assume f, h 1,..., h m are convex

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

A Distributed Newton Method for Network Utility Maximization, II: Convergence

A Distributed Newton Method for Network Utility Maximization, II: Convergence A Distributed Newton Method for Network Utility Maximization, II: Convergence Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract The existing distributed algorithms for Network Utility

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

DLM: Decentralized Linearized Alternating Direction Method of Multipliers

DLM: Decentralized Linearized Alternating Direction Method of Multipliers 1 DLM: Decentralized Linearized Alternating Direction Method of Multipliers Qing Ling, Wei Shi, Gang Wu, and Alejandro Ribeiro Abstract This paper develops the Decentralized Linearized Alternating Direction

More information

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Improving the Convergence of Back-Propogation Learning with Second Order Methods the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Programming, numerics and optimization

Programming, numerics and optimization Programming, numerics and optimization Lecture C-3: Unconstrained optimization II Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428

More information

Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent

More information

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS) Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno (BFGS) Limited memory BFGS (L-BFGS) February 6, 2014 Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

A Distributed Newton Method for Network Utility Maximization, I: Algorithm

A Distributed Newton Method for Network Utility Maximization, I: Algorithm A Distributed Newton Method for Networ Utility Maximization, I: Algorithm Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract Most existing wors use dual decomposition and first-order

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Homework 5. Convex Optimization /36-725

Homework 5. Convex Optimization /36-725 Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Nonlinear Programming

Nonlinear Programming Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week

More information

arxiv: v3 [math.oc] 8 Jan 2019

arxiv: v3 [math.oc] 8 Jan 2019 Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Optimization for neural networks

Optimization for neural networks 0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make

More information

Chapter 2. Optimization. Gradients, convexity, and ALS

Chapter 2. Optimization. Gradients, convexity, and ALS Chapter 2 Optimization Gradients, convexity, and ALS Contents Background Gradient descent Stochastic gradient descent Newton s method Alternating least squares KKT conditions 2 Motivation We can solve

More information

Higher-Order Methods

Higher-Order Methods Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Adaptive Gradient Methods AdaGrad / Adam

Adaptive Gradient Methods AdaGrad / Adam Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Numerical optimization

Numerical optimization Numerical optimization Lecture 4 Alexander & Michael Bronstein tosca.cs.technion.ac.il/book Numerical geometry of non-rigid shapes Stanford University, Winter 2009 2 Longest Slowest Shortest Minimal Maximal

More information

Distributed Optimization over Networks Gossip-Based Algorithms

Distributed Optimization over Networks Gossip-Based Algorithms Distributed Optimization over Networks Gossip-Based Algorithms Angelia Nedić angelia@illinois.edu ISE Department and Coordinated Science Laboratory University of Illinois at Urbana-Champaign Outline Random

More information

arxiv: v1 [cs.lg] 15 Jun 2016

arxiv: v1 [cs.lg] 15 Jun 2016 A Class of Parallel Douly Stochastic Algorithms for Large-Scale Learning arxiv:1606.04991v1 [cs.lg] 15 Jun 2016 Aryan Mokhtari Alec Koppel Alejandro Rieiro Department of Electrical and Systems Engineering

More information

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization Quasi-Newton Methods Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization 10-725 Last time: primal-dual interior-point methods Given the problem min x f(x) subject to h(x)

More information

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

Sub-Sampled Newton Methods I: Globally Convergent Algorithms Sub-Sampled Newton Methods I: Globally Convergent Algorithms arxiv:1601.04737v3 [math.oc] 26 Feb 2016 Farbod Roosta-Khorasani February 29, 2016 Abstract Michael W. Mahoney Large scale optimization problems

More information

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2 Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2 Coralia Cartis, University of Oxford INFOMM CDT: Modelling, Analysis and Computation of Continuous Real-World Problems Methods

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Convex Optimization. Problem set 2. Due Monday April 26th

Convex Optimization. Problem set 2. Due Monday April 26th Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining

More information

Optimization and Root Finding. Kurt Hornik

Optimization and Root Finding. Kurt Hornik Optimization and Root Finding Kurt Hornik Basics Root finding and unconstrained smooth optimization are closely related: Solving ƒ () = 0 can be accomplished via minimizing ƒ () 2 Slide 2 Basics Root finding

More information

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems 1 Numerical optimization Alexander & Michael Bronstein, 2006-2009 Michael Bronstein, 2010 tosca.cs.technion.ac.il/book Numerical optimization 048921 Advanced topics in vision Processing and Analysis of

More information

Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods

Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods Mahyar Fazlyab, Santiago Paternain, Alejandro Ribeiro and Victor M. Preciado Abstract In this paper, we consider a class of

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Notes on AdaGrad. Joseph Perla 2014

Notes on AdaGrad. Joseph Perla 2014 Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.

More information

A Stochastic Quasi-Newton Method for Large-Scale Optimization

A Stochastic Quasi-Newton Method for Large-Scale Optimization A Stochastic Quasi-Newton Method for Large-Scale Optimization R. H. Byrd S.L. Hansen Jorge Nocedal Y. Singer February 17, 2015 Abstract The question of how to incorporate curvature information in stochastic

More information

Minimizing finite sums with the stochastic average gradient

Minimizing finite sums with the stochastic average gradient Math. Program., Ser. A (2017) 162:83 112 DOI 10.1007/s10107-016-1030-6 FULL LENGTH PAPER Minimizing finite sums with the stochastic average gradient Mark Schmidt 1 Nicolas Le Roux 2 Francis Bach 3 Received:

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Variable Metric Stochastic Approximation Theory

Variable Metric Stochastic Approximation Theory Variable Metric Stochastic Approximation Theory Abstract We provide a variable metric stochastic approximation theory. In doing so, we provide a convergence theory for a large class of online variable

More information

WE consider the problem of estimating a time varying

WE consider the problem of estimating a time varying 450 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL 61, NO 2, JANUARY 15, 2013 D-MAP: Distributed Maximum a Posteriori Probability Estimation of Dynamic Systems Felicia Y Jakubiec Alejro Ribeiro Abstract This

More information

Newton-like method with diagonal correction for distributed optimization

Newton-like method with diagonal correction for distributed optimization Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić August 15, 2015 Abstract We consider distributed optimization problems

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Algorithms for Constrained Optimization

Algorithms for Constrained Optimization 1 / 42 Algorithms for Constrained Optimization ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University April 19, 2015 2 / 42 Outline 1. Convergence 2. Sequential quadratic

More information

Distributed Consensus Optimization

Distributed Consensus Optimization Distributed Consensus Optimization Ming Yan Michigan State University, CMSE/Mathematics September 14, 2018 Decentralized-1 Backgroundwhy andwe motivation need decentralized optimization? I Decentralized

More information

Quasi-Newton Methods

Quasi-Newton Methods Newton s Method Pros and Cons Quasi-Newton Methods MA 348 Kurt Bryan Newton s method has some very nice properties: It s extremely fast, at least once it gets near the minimum, and with the simple modifications

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

ORIE 6326: Convex Optimization. Quasi-Newton Methods

ORIE 6326: Convex Optimization. Quasi-Newton Methods ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

A Distributed Newton Method for Network Utility Maximization

A Distributed Newton Method for Network Utility Maximization A Distributed Newton Method for Networ Utility Maximization Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie Abstract Most existing wor uses dual decomposition and subgradient methods to solve Networ Utility

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Math 273a: Optimization Netwon s methods

Math 273a: Optimization Netwon s methods Math 273a: Optimization Netwon s methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 some material taken from Chong-Zak, 4th Ed. Main features of Newton s method Uses both first derivatives

More information