High Order Methods for Empirical Risk Minimization

Size: px
Start display at page:

Download "High Order Methods for Empirical Risk Minimization"

Transcription

1 High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania IPAM Workshop of Emerging Wireless Systems February 7th, 2017 Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 1

2 Introduction Introduction Incremental quasi-newton algorithms Adaptive sample size algorithms Conclusions Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 2

3 Optimal resource allocation in wireless systems Wireless channels characterized by random fading coefficients h Want to assign power p(h) as a function of fading to: Satisfy prescribed constraints, e.g., average power, target SINRs Optimize given criteria, e.g., maximize capacity, minimize power Two challenges Resultant optimization problems are infinite dimensional (h is) In most cases problems are not convex However, duality gap is null under mild conditions (Ribeiro-Giannakis 10) And in the dual domain the problem is finite dimensional and convex Motivate use of stochastic optimization algorithms that Have manageable computational complexity per iteration Use sample channel realizations instead of channel distributions Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 3

4 Large-scale empirical risk minimization We aim to solve expected risk minimization problem min w R p E θ [f (w, θ)] The distribution is unknown We have access to N independent realizations of θ We settle for solving the Empirical Risk Minimization (ERM) problem 1 min F (w) := min w Rp w R p N N f (w, θ i ) i=1 Large-scale optimization or machine learning: large N, large p N: number of observations (inputs) p: number of parameters in the model Not just wireless Many (most) machine learning algorithms reduce to ERM problems Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 4

5 Optimization methods Stochastic methods: a subset of samples is used at each iteration SGD is the most popular; however, it is slow because of Noise of stochasticity Variance reduction (SAG, SAGA, SVRG,...) Poor curvature approx. Stochastic QN (SGD-QN, RES, olbfgs,...) Decentralized methods: samples are distributed over multiple processors Primal methods: DGD, Acc. DGD, NN,... Dual methods: DDA, DADMM, DQM, EXTRA, ESOM,... Adaptive sample size methods: start with a subset of samples and increase the size of training set at each iteration Ada Newton The solutions are close when the number of samples are close Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 5

6 Incremental quasi-newton algorithms Introduction Incremental quasi-newton algorithms Adaptive sample size algorithms Conclusions Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 6

7 Incremental Gradient Descent Objective function gradients s(w) := F (w) = 1 N N f (w, θ i ) (Deterministic) gradient descent iteration w t+1 = w t ɛ t s(w t) Evaluation of (deterministic) gradients is not computationally affordable i=1 Incremental/Stochastic gradient Sample average in lieu of expectations ŝ(w, θ) = 1 L f (w, θ l ) θ = [θ 1;...; θ L ] L l=1 Functions are chosen cyclically or at random with or without replacement Incremental gradient descent iteration w t+1 = w t ɛ t ŝ(w t, θ t) (Incremental) gradient descent is (very) slow. Newton is impractical Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 7

8 BFGS quasi-newton method Approximate function s curvature with Hessian approximation matrix B 1 t w t+1 = w t ɛ t B 1 t s(w t) Make B t close to H(w t) := 2 F (w t). Broyden, DFP, BFGS Variable variation: v t = w t+1 w t. Gradient variation: r t = s(w t+1) s(w t) Matrix B t+1 satisfies secant condition B t+1v t = r t. Underdetermined Resolve indeterminacy making B t+1 closest to previous approximation B t Using Gaussian relative entropy as proximity condition yields update B t+1 = B t + rtrt t v T Btvtvt T B t t r t v T t B tv t Superlinear convergence Close enough to quadratic rate of Newton BFGS requires gradients Use stochastic/incremental gradients Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 8

9 Stochastic/Incremental quasi-newton methods Online (o)bfgs & Online Limited-Memory (ol)bfgs [Schraudolph et al 07] obfgs may diverge because Hessian approximation gets close to singular Regularized Stochastic BFGS (RES) [Mokhtari-Ribeiro 14] olbfgs does (surprisingly) converge [Mokhtari-Ribeiro 15] Problem solved? Alas. RES and olbfgs have sublinear convergence Same as stochastic gradient descent. Asymptotically not better Variance reduced stochastic L-BFGS (SVRG+oLBFGS) [Mortiz et al 16] Linear convergence rate. But this is not better than SAG, SAGA, SVRG Computationally feasible quasi Newton method with superlinear convergence Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 9

10 Incremental aggregated gradient method Utilize memory to reduce variance of stochastic gradient approximation f t 1 f t it f t N f i t (wt+1 ) f t+1 1 f t+1 it f t+1 N Descend along incremental gradient w t+1 = w t α N N i=1 f t i Select update index i t cyclically. Uniformly at random is similar = w t αg t i Update gradient corresponding to function f it f t+1 i t = f it (w t+1 ) Sum easy to compute g t+1 i = gi t f t+1 i t + f t+1 i t. Converges linearly Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 10

11 Incremental BFGS method Keep memory of variables z t i, Hessian approximations B t i, and gradients fi t Functions indexed by i. Time indexed by t. Select function f it at time t z t 1 z t it z t N B t 1 B t it B t N f t 1 f t it f t N w t+1 All gradients, matrices, and variables used to update w t+1 Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 11

12 Incremental BFGS method Keep memory of variables z t i, Hessian approximations B t i, and gradients fi t Functions indexed by i. Time indexed by t. Select function f it at time t z t 1 z t it z t N B t 1 B t it B t N f t 1 f t it f t N w t+1 f i t (wt+1 ) Updated variable w t+1 used to update gradient f t+1 i t = f it (w t+1 ) Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 12

13 Incremental BFGS method Keep memory of variables z t i, Hessian approximations B t i, and gradients fi t Functions indexed by i. Time indexed by t. Select function f it at time t z t 1 z t it z t N B t 1 B t it B t N f t 1 f t it f t N w t+1 B t+1 it f i t (wt+1 ) Update Bi t t to satisfy secant condition for function f it for variable variation z t i t w t+1 and gradient variation f t+1 i t fi t t (more later) Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 13

14 Incremental BFGS method Keep memory of variables z t i, Hessian approximations B t i, and gradients fi t Functions indexed by i. Time indexed by t. Select function f it at time t z t 1 z t it z t N B t 1 B t it B t N f t 1 f t it f t N w t+1 B t+1 it f i t (wt+1 ) z t+1 1 z t+1 it z t+1 N B t+1 1 B t+1 it B t+1 N f t+1 1 f t+1 it f t+1 N Update variable, Hessian approximation, and gradient memory for function f it Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 14

15 Update of Hessian approximation matrices Variable variation at time t for function f i = f it vi t := z t+1 i Gradient variation at time t for function f i = f it ri t := f t+1 i t z t i Update B t i = B t i t to satisfy secant condition for variations v t i and r t i B t+1 i = B t i + rt i r t i T r t i T v t i Bt i v t i v t i T B t i v t i T B t i vt i f t i t We want B t i to approximate the Hessian of the function f i = f it Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 15

16 A naive (in hindsight) incremental BFGS method The key is in the update of w t. Use memory in stochastic quantities ( w t+1 = w t 1 N N i=1 B t i ) 1 ( 1 N N i=1 f t i It doesn t work Better than incremental gradient but not superlinear ) Optimization updates are solutions of function approximations In this particular update we are minimizing the quadratic form f (w) 1 n [ f i (z t i ) + f i (z t i ) T (w w t) + 1 ] n 2 (w wt ) T B t i (w w t ) i=1 Gradients evaluated at z t i. Secant condition verified at z t i The quadratic form is centered at w t. Not a reasonable Taylor series Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 16

17 A proper Taylor series expansion Each individual function f i is being approximated by the quadratic f i (w) f i (z t i ) + f i (z t i ) T (w w t) (w wt ) T B t i (w w t ) To have a proper expansion we have to recenter the quadratic form at z t i f i (w) f i (z t i ) + f i (z t i ) T (w z t i ) (w zt i ) T B t i (w z t i ) I.e., we approximate f (w) with the aggregate quadratic function f (w) 1 N N i=1 [ f i (z t i ) + f i (z t i ) T (w z t i ) + 1 ] 2 (w zt i ) T B t i (w z t i ) This is now a reasonable Taylor series that we use to derive an update Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 17

18 Incremental BFGS Solving this quadratic program yields the update for the IQN method ( w t+1 1 N = N i=1 B t i ) 1 [ 1 N B t i z t i 1 N N i=1 ] N f i (z t i ) i=1 Looks difficult to implement but it is more similar to BFGS than apparent As in BFGS, it can be implemented with O(p 2 ) operations Write as rank-2 update, use matrix inversion lemma Independently of N. True incremental method. Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 18

19 Superlinear convergence rate The functions f i are m-strongly convex. The gradients f i are M-Lipschitz continuous. The Hessians f i are L-Lipschitz continuous Theorem The sequence of residuals w t w in the IQN method converges to zero at a superlinear rate, lim t w t w (1/N)( w t 1 w + + w t N w ) = 0. Incremental method with small cost per iteration converging at superlinear rate Resulting from the use of memory to reduce stochastic variances Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 19

20 Numerical results Quadratic programming f (w) := (1/N) N i=1 wt A i w/2 + b T i w A i R p p is a diagonal positive definite matrix b i R p is a random vector from the box [0, 10 3 ] p N = 1000, p = 100, and condition number (10 2, 10 4 ) Relative error w t w / w 0 w of SAG, SAGA, IAG, and IQN Normalized Error SAG SAGA IAG IQN Normalized Error SAG SAGA IAG IQN Number of Effective Passes (a) small condition number Number of Effective Passes (b) large condition number Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 20

21 Adaptive sample size algorithms Introduction Incremental quasi-newton algorithms Adaptive sample size algorithms Conclusions Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 21

22 Back to the ERM problem Our original goal was to solve the statistical loss problem w := argmin w R p L(w) = argmin E [f (w, Z)] w R p But since the distribution of Z is unknown we settle for the ERM problem w N 1 := argmin L N (w) = argmin w R p w R p N N f (w, z k ) Where the samples z k are drawn from a common distribution ERM approximates actual problem Don t need perfect solution k=1 Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 22

23 Regularized ERM problem From statistical learning we know that there exists a constant V N such that sup L(w) L N (w) V N, w w.h.p. V N = O(1/ N) from CLT. V N = O(1/N) sometimes [Bartlett et al 06] There is no need to minimize L N (w) beyond accuracy O(V N ) This is well known. In fact, this is why we can add regularizers to ERM w N := argmin w R N (w) = argmin L N (w) + cv N w 2 w 2 Adding the term (cv N /2) w 2 moves the optimum of the ERM problem But the optimum w N is still in a ball of order V N around w Goal: Minimize the risk R N within its statistical accuracy V N Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 23

24 Adaptive sample size methods ERM problem R n for subset of n N unif. chosen samples w n := argmin w R n(w) = argmin L n(w) + cvn w 2 w 2 Solutions wm for m samples and wn for n samples are close Find approx. solution w m for the risk R m with m samples Increase sample size to n > m samples Use w m as a warm start to find approx. solution w n for R n If m < n, it is easier to solve R m comparing to R n since The condition number of R m is smaller than R n The required accuracy V m is larger than V n The computation cost of solving R m is lower than R n Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 24

25 Adaptive sample size Newton method Ada Newton is a specific adaptive sample size method using Newton steps Find w m that solves R m to its statistical accuracy V m Apply single Newton iteration w n = w m 2 R n(w m) 1 R n(w m) If m and n close, we have w n within statistical accuracy of R n w n w m This works if statistical accuracy ball of R m is within Newton quadratic convergence ball of R n. Then, w m is within Newton quadratic convergence ball of R n A single Newton iteration yields w n within statistical accuracy of R n Question: How should we choose α? Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 25

26 Assumptions The functions f (w, z) are convex The gradients f (w, z) are M-Lipschitz continuous f (w, z) f (w, z) M w w, for all z. The functions f (w, z) are self-concordant with respect to w for all z Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 26

27 Theoretical result Theorem Consider w m as a V m-optimal solution of R m, i.e., R m(w m) R m(wm) V m, and let n = αm. If the inequalities [ ] 1 2(M +cvm)v 2 m 2(n m) + + ((2+ 2)c c w )(V m V n) cv n nc 1 2 (cv n) , [ 2(n m) 144 V m + (V n m + V m) c w 2 2 (V m V n)] V n n 2 are satisfied, then w n has sub-optimality error V n w.h.p., i.e., R n(w n) R n(w n ) V n, w.h.p. Condition1 w m is in the Newton quadratic convergence ball of R n Condition 2 w n is in the statistical accuracy of R n Condition 2 becomes redundant for large m Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 27

28 Doubling the size of training set Proposition Consider a learning problem in which the statistical accuracy satisfies V m αv n for n = αm and lim n V n = 0. If c is chosen so that ( ) 1/2 2αM + c 2(α 1) αc 1/2 1 4, then, there exists a sample size m such that the conditions in Theorem 1 are satisfied for all m > m and n = αm. We can double the size of training set α = 2 If the size of training set is large enough If the constant c satisfies c > 16(2 M + 1) 2 We achieve the S.A. of the full training set in about 2 passes over the data After inversion of about 3.32 log 10 N Hessians Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 28

29 Adaptive sample size Newton (Ada Newton) Parameters: α 0 = 2 and 0 < β < 1. Initialize: n = m 0 and w n = w m0 with R n(w n) R n(w n ) V n while n N do Update w m = w n and m = n. Reset factor α = α 0 repeat [sample size backtracking loop] 1: Increase sample size: n = min{αm, N}. 2: Comp. gradient: R n(w m) = 1 n n k=1 f (wm, z k) + cv nw m 3: Comp. Hessian: 2 R n(w m) = 1 n n k=1 2 f (w m, z k ) + cv ni 4: Update the variable: w n = w m 2 R n(w m) 1 R n(w m) 5: Backtrack sample size increase α = βα. until R n(w n) R n(w n ) V n end while Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 29

30 Numerical results LR problem Protein homology dataset provided (KDD cup 2004) Number of samples N = 145, 751, dimension p = 74 Parameters V n = 1/n, c = 20, m 0 = 124, and α = SGD SAGA Newton Ada Newton RN(w) R N Number of passes Ada Newton achieves the statistical accuracy of the full training set with about two passes over the dataset Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 30

31 Numerical results We use A9A and SUSY datasets to train a LR problem A9A: N = 32, 561 samples with dimension p = 123 SUSY: N = 5, 000, 000 samples with dimension p = 18 The green line shows the iteration at which Ada Newton reached convergence on the test set Ada Newton Newton SAGA Ada Newton Newton SAGA RN(w) R N RN(w) R N Number of passes Number of passes Figure: Suboptimality vs No. of effective passes. A9A (left) and SUSY (right) Ada Newton achieves the accuracy of R N (w) R N < 1/N by less than 2.3 passes over the full training set Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 31

32 Numerical results We use A9A and SUSY datasets to train a LR problem A9A: N = 32, 561 samples with dimension p = 123 SUSY: N = 5, 000, 000 samples with dimension p = 18 The green line shows the iteration at which Ada Newton reached convergence on the test set Ada Newton Newton SAGA Ada Newton Newton SAGA RN(w) R N RN(w) R N Run time Run time Figure: Suboptimality vs runtime. A9A (left) and SUSY (right) Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 32

33 Numerical results We use A9A and SUSY datasets to train a LR problem A9A: N = 32, 561 samples with dimension p = 123 SUSY: N = 5, 000, 000 samples with dimension p = Ada Newton Ada Newton 0.65 Newton SAGA 0.65 Newton SAGA Error 0.5 Error Number of passes Number of passes Figure: Test error vs No. of effective passes. A9A (left) and SUSY (right) Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 33

34 Ada Newton and the challenges for Newton in ERM There are four reasons why it is impractical to use Newton s method in ERM It is costly to compute Hessians (and gradients). Order O(Np 2 ) operations It is costly to invert Hessians. Order O(p 3 ) operations. A line search is needed to moderate stepsize outside of quadratic region Quadratic convergence is advantageous close to the optimum but we don t want to optimize beyond statistical accuracy Ada Newton (mostly) overcomes these four challenges Compute Hessians for a subset of samples. Two passes over dataset Hessians are inverted in a logarithmic number of steps. But still There is no line search We enter quadratic regions without going beyond statistical accuracy Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 34

35 Conclusions Introduction Incremental quasi-newton algorithms Adaptive sample size algorithms Conclusions Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 35

36 Conclusions We studied different approaches to solve large-scale ERM problems An incremental quasi-newton BFGS method (IQN) was presented IQN only computes the information of a single function at each step Low computation cost IQN aggregates variable, gradient, and BFGS approximation Reduce the noise Superlinear convergence Ada Newton resolves the Newton-type methods drawbacks Unit stepsize No line search Not sensitive to initial point less Hessian inversions Exploits quadratic convergence of Newton s method at each iteration Ada Newton achieves statistical accuracy with about two passes over the data Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 36

37 References N. N. Schraudolph, J. Yu, and S. Gunter, A Stochastic Quasi-Newton Method for Online Convex Optimization, In AISTATS, vol. 7, pp , A. Mokhtari and A. Ribeiro, RES: Regularized Stochastic BFGS Algorithm, IEEE Trans. on Signal Processing (TSP), vol. 62, no. 23, pp , December A. Mokhtari and A. Ribeiro, Global Convergence of Online Limited Memory BFGS, Journal of Machine Learning Research (JMLR), vol. 16, pp , P. Moritz, R. Nishihara, and M. I. Jordan, A Linearly-Convergent Stochastic L-BFGS Algorithm, Proceedings of the Nineteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pp , 2016 A. Mokhtari, M. Eisen, and A. Ribeiro, An Incremental Quasi-Newton Method with a Local Superlinear Convergence Rate, in Proc. Int. Conf. Acoustics Speech Signal Process. (ICASSP), New Orleans, LA, March A. Mokhtari, H. Daneshmand, A. Lucchi, T. Hofmann, and A. Ribeiro, Adaptive Newton Method for Empirical Risk Minimization to Statistical Accuracy, In Advances in Neural Information Processing Systems (NIPS), Barcelona, Spain, Alejandro Ribeiro High Order Methods for Empirical Risk Minimization 37

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu Thanks to: Aryan Mokhtari, Mark

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless

More information

Incremental Quasi-Newton methods with local superlinear convergence rate

Incremental Quasi-Newton methods with local superlinear convergence rate Incremental Quasi-Newton methods wh local superlinear convergence rate Aryan Mokhtari, Mark Eisen, and Alejandro Ribeiro Department of Electrical and Systems Engineering Universy of Pennsylvania Int. Conference

More information

Efficient Methods for Large-Scale Optimization

Efficient Methods for Large-Scale Optimization Efficient Methods for Large-Scale Optimization Aryan Mokhtari Department of Electrical and Systems Engineering University of Pennsylvania aryanm@seas.upenn.edu Ph.D. Proposal Advisor: Alejandro Ribeiro

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 23, DECEMBER 1, Aryan Mokhtari and Alejandro Ribeiro, Member, IEEE

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 23, DECEMBER 1, Aryan Mokhtari and Alejandro Ribeiro, Member, IEEE IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 23, DECEMBER 1, 2014 6089 RES: Regularized Stochastic BFGS Algorithm Aryan Mokhtari and Alejandro Ribeiro, Member, IEEE Abstract RES, a regularized

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725 Quasi-Newton Methods Javier Peña Convex Optimization 10-725/36-725 Last time: primal-dual interior-point methods Consider the problem min x subject to f(x) Ax = b h(x) 0 Assume f, h 1,..., h m are convex

More information

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China) Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu

More information

Adaptive Newton Method for Empirical Risk Minimization to Statistical Accuracy

Adaptive Newton Method for Empirical Risk Minimization to Statistical Accuracy Adaptive Neton Method for Empirical Risk Minimization to Statistical Accuracy Aryan Mokhtari? University of Pennsylvania aryanm@seas.upenn.edu Hadi Daneshmand? ETH Zurich, Sitzerland hadi.daneshmand@inf.ethz.ch

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725 Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Lecture 14: October 17

Lecture 14: October 17 1-725/36-725: Convex Optimization Fall 218 Lecture 14: October 17 Lecturer: Lecturer: Ryan Tibshirani Scribes: Pengsheng Guo, Xian Zhou Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization Quasi-Newton Methods Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization 10-725 Last time: primal-dual interior-point methods Given the problem min x f(x) subject to h(x)

More information

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS) Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno (BFGS) Limited memory BFGS (L-BFGS) February 6, 2014 Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb

More information

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION Aryan Mokhtari, Alec Koppel, Gesualdo Scutari, and Alejandro Ribeiro Department of Electrical and Systems

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Improving the Convergence of Back-Propogation Learning with Second Order Methods the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible

More information

ABSTRACT 1. INTRODUCTION

ABSTRACT 1. INTRODUCTION A DIAGONAL-AUGMENTED QUASI-NEWTON METHOD WITH APPLICATION TO FACTORIZATION MACHINES Aryan Mohtari and Amir Ingber Department of Electrical and Systems Engineering, University of Pennsylvania, PA, USA Big-data

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University

More information

DECENTRALIZED algorithms are used to solve optimization

DECENTRALIZED algorithms are used to solve optimization 5158 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 19, OCTOBER 1, 016 DQM: Decentralized Quadratically Approximated Alternating Direction Method of Multipliers Aryan Mohtari, Wei Shi, Qing Ling,

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Second-Order Methods for Stochastic Optimization

Second-Order Methods for Stochastic Optimization Second-Order Methods for Stochastic Optimization Frank E. Curtis, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Decentralized Quasi-Newton Methods

Decentralized Quasi-Newton Methods 1 Decentralized Quasi-Newton Methods Mark Eisen, Aryan Mokhtari, and Alejandro Ribeiro Abstract We introduce the decentralized Broyden-Fletcher- Goldfarb-Shanno (D-BFGS) method as a variation of the BFGS

More information

Why should you care about the solution strategies?

Why should you care about the solution strategies? Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the

More information

5 Quasi-Newton Methods

5 Quasi-Newton Methods Unconstrained Convex Optimization 26 5 Quasi-Newton Methods If the Hessian is unavailable... Notation: H = Hessian matrix. B is the approximation of H. C is the approximation of H 1. Problem: Solve min

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

LEARNING STATISTICALLY ACCURATE RESOURCE ALLOCATIONS IN NON-STATIONARY WIRELESS SYSTEMS

LEARNING STATISTICALLY ACCURATE RESOURCE ALLOCATIONS IN NON-STATIONARY WIRELESS SYSTEMS LEARIG STATISTICALLY ACCURATE RESOURCE ALLOCATIOS I O-STATIOARY WIRELESS SYSTEMS Mark Eisen Konstantinos Gatsis George J. Pappas Alejandro Ribeiro Department of Electrical and Systems Engineering, University

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex

More information

Variable Metric Stochastic Approximation Theory

Variable Metric Stochastic Approximation Theory Variable Metric Stochastic Approximation Theory Abstract We provide a variable metric stochastic approximation theory. In doing so, we provide a convergence theory for a large class of online variable

More information

Learning in Non-Stationary Wireless Control Systems via Newton s Method

Learning in Non-Stationary Wireless Control Systems via Newton s Method Learning in on-stationary Wireless Control Systems via ewton s Method Mark Eisen, Konstantinos Gatsis, George J. Pappas and Alejandro Ribeiro Abstract This paper considers wireless control systems over

More information

Nonlinear Programming

Nonlinear Programming Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Learning in Non-Stationary Wireless Control Systems via Newton s Method

Learning in Non-Stationary Wireless Control Systems via Newton s Method Learning in on-stationary Wireless Control Systems via ewton s Method Mark Eisen, Konstantinos Gatsis, George Pappas and Alejandro Ribeiro Abstract This paper considers wireless control systems over an

More information

Lecture 18: November Review on Primal-dual interior-poit methods

Lecture 18: November Review on Primal-dual interior-poit methods 10-725/36-725: Convex Optimization Fall 2016 Lecturer: Lecturer: Javier Pena Lecture 18: November 2 Scribes: Scribes: Yizhu Lin, Pan Liu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

MATH 4211/6211 Optimization Quasi-Newton Method

MATH 4211/6211 Optimization Quasi-Newton Method MATH 4211/6211 Optimization Quasi-Newton Method Xiaojing Ye Department of Mathematics & Statistics Georgia State University Xiaojing Ye, Math & Stat, Georgia State University 0 Quasi-Newton Method Motivation:

More information

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Efficient Distributed Hessian Free Algorithm for Large-scale Empirical Risk Minimization via Accumulating Sample Strategy

Efficient Distributed Hessian Free Algorithm for Large-scale Empirical Risk Minimization via Accumulating Sample Strategy Efficient Distributed Hessian Free Algorithm for Large-scale Empirical Risk Minimization via Accumulating Sample Strategy Majid Jahani Xi He Chenxin Ma Aryan Mokhtari Lehigh University Lehigh University

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Non-convex optimization. Issam Laradji

Non-convex optimization. Issam Laradji Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Optimization for neural networks

Optimization for neural networks 0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng Optimization 2 CS5240 Theoretical Foundations in Multimedia Leow Wee Kheng Department of Computer Science School of Computing National University of Singapore Leow Wee Kheng (NUS) Optimization 2 1 / 38

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Programming, numerics and optimization

Programming, numerics and optimization Programming, numerics and optimization Lecture C-3: Unconstrained optimization II Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems 8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

Optimization Tutorial 1. Basic Gradient Descent

Optimization Tutorial 1. Basic Gradient Descent E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.

More information

A Distributed Newton Method for Network Utility Maximization, II: Convergence

A Distributed Newton Method for Network Utility Maximization, II: Convergence A Distributed Newton Method for Network Utility Maximization, II: Convergence Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract The existing distributed algorithms for Network Utility

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

ORIE 6326: Convex Optimization. Quasi-Newton Methods

ORIE 6326: Convex Optimization. Quasi-Newton Methods ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted

More information

J. Sadeghi E. Patelli M. de Angelis

J. Sadeghi E. Patelli M. de Angelis J. Sadeghi E. Patelli Institute for Risk and, Department of Engineering, University of Liverpool, United Kingdom 8th International Workshop on Reliable Computing, Computing with Confidence University of

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Online Limited-Memory BFGS for Click-Through Rate Prediction

Online Limited-Memory BFGS for Click-Through Rate Prediction Online Limited-Memory BFGS for Click-Through Rate Prediction Mitchell Stern Department of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104, USA Aryan Mokhtari Department

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

WE consider the problem of estimating a time varying

WE consider the problem of estimating a time varying 450 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL 61, NO 2, JANUARY 15, 2013 D-MAP: Distributed Maximum a Posteriori Probability Estimation of Dynamic Systems Felicia Y Jakubiec Alejro Ribeiro Abstract This

More information

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers Decentralized Quadratically Approimated Alternating Direction Method of Multipliers Aryan Mokhtari Wei Shi Qing Ling Alejandro Ribeiro Department of Electrical and Systems Engineering, University of Pennsylvania

More information

Second-Order Stochastic Optimization for Machine Learning in Linear Time

Second-Order Stochastic Optimization for Machine Learning in Linear Time Journal of Machine Learning Research 8 (207) -40 Submitted 9/6; Revised 8/7; Published /7 Second-Order Stochastic Optimization for Machine Learning in Linear Time Naman Agarwal Computer Science Department

More information

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 15 Newton Method and Self-Concordance. October 23, 2008 Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

Accelerated Gradient Temporal Difference Learning

Accelerated Gradient Temporal Difference Learning Accelerated Gradient Temporal Difference Learning Yangchen Pan, Adam White, and Martha White {YANGPAN,ADAMW,MARTHA}@INDIANA.EDU Department of Computer Science, Indiana University Abstract The family of

More information

Higher-Order Methods

Higher-Order Methods Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Nonlinear Optimization: What s important?

Nonlinear Optimization: What s important? Nonlinear Optimization: What s important? Julian Hall 10th May 2012 Convexity: convex problems A local minimizer is a global minimizer A solution of f (x) = 0 (stationary point) is a minimizer A global

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Homework 5. Convex Optimization /36-725

Homework 5. Convex Optimization /36-725 Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Thore Graepel and Nicol N. Schraudolph Institute of Computational Science ETH Zürich, Switzerland {graepel,schraudo}@inf.ethz.ch

More information

A Distributed Newton Method for Network Utility Maximization, I: Algorithm

A Distributed Newton Method for Network Utility Maximization, I: Algorithm A Distributed Newton Method for Networ Utility Maximization, I: Algorithm Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract Most existing wors use dual decomposition and first-order

More information