Optimized first-order minimization methods

Size: px
Start display at page:

Download "Optimized first-order minimization methods"

Transcription

1 Optimized first-order minimization methods Donghwan Kim & Jeffrey A. Fessler EECS Dept., BME Dept., Dept. of Radiology University of Michigan web.eecs.umich.edu/~fessler UM AIM Seminar

2 Disclosure Research support from GE Healthcare Research support to GE Global Research Supported in part by NIH grants R01 HL and P01 CA Equipment support from Intel Corporation 2

3 Low-dose X-ray CT image reconstruction Thin-slice FBP ASIR Statistical Seconds A bit longer Much longer Image reconstruction as an optimization problem: ˆx = argmin x y Ax 2 W + R(x) (Same sinogram, so all at same dose) 3

4 Outline Motivation (done) Problem definition Existing algorithms Gradient descent Nesterov s optimal first-order methods General first-order methods Optimizing first-order minimization methods Drori & Teboulle s numerical bounds Donghwan Kim s analytically optimized ( more optimal ) first-order methods Examples: logistic regression for machine learning CT image reconstruction Summary / Future work 4

5 Problem setting 5

6 Optimization problem setting ˆx argmin x Unconstrained Large-scale (Hessian too big to store) image reconstruction big-data / machine learning... Cost function assumptions (throughout) f : R M R convex (need not be strictly convex) non-empty set of global minimizers: f (x) ˆx X = { x R M : f (x ) f (x), x R M} smooth (differentiable with L-Lipschitz gradient) f (x) f (z) 2 L x z 2, x, z R M 6

7 Example: Machine learning To learn weights x of binary classifier given feature vectors {v i } and labels {y i }: f (x) = ψ(y i x, v i ), i where y i = ±1. 8 Loss functions (surrogates) exponential hinge logistic 0-1 loss functions ψ(t) 0-1: I {t 0} exponential: exp( t) logistic: log(1 + exp( t)) hinge: max{0,1 t} Which of these fit our conditions? ψ(t) t 7

8 Algorithms 8

9 Gradient descent iteration with step size 1/L ensures monotonic descent of f : x n+1 = x n 1 L f (x n ) stacking: i.e.: x 1 x 2. x N 1 x N x 1 x 2. x N 1 x N = = x 0 x 1. x N 2 x N 1 x 0 x 1. x N 2 x N 1 1 L 1 L f (x 0 ) f (x 1 ). f (x N 2 ) f (x N 1 ) }{{} H GD I f (x 0 ) f (x 1 ). f (x N 2 ) f (x N 1 ) Note: N N coefficient matrix H GD is diagonal (a special case of lower triangular). 9

10 Gradient descent convergence rate Classic O(1/n) convergence rate of cost function descent: f (x n ) f (x ) }{{} inaccuracy L x 0 x n Drori & Teboulle (2013) derive tightest inaccuracy bound: f (x n ) f (x ) L x 0 x n + 2 They construct a Huber-like function f for which GD achieves that bound. Case closed for GD. O(1/n) rate is undesirably slow. 10

11 Heavy ball method iteration (Polyak, 1987): stacking: x 1 x 2. x N 1 x N x n+1 = x n α L f (x n )+β (x n x n 1 ) }{{} momentum! = x n 1 L = x 0 x 1. x N 2 x N 1 n k=0 } αβ {{ n k } coefficients 1 L f (x k ) α αβ α αβ N 2... αβ α 0 αβ N 1... αβ 2 αβ α (for implementation) (for analysis) }{{} H HB I Here, N N coefficient matrix H HB is lower triangular. How to choose α and β? How to optimize N N coefficient matrix H more generally? f (x 0 ) f (x 1 ). f (x N 2 ) f (x N 1 ) 11

12 General first-order (FO) iteration: General first-order method class x n+1 = x n 1 L n h n+1,k f (x k ) k=0 stacking: x 1 x 2. x N 1 x N = x 0 x 1. x N 2 x N 1 1 L h 1, h 2,0 h 2, h N,0 h N,1... h N,N 2 h N,N 1 }{{} H FO I f (x 0 ) f (x 1 ). f (x N 2 ) f (x N 1 ) Primary goals: Analyze convergence rate of FO for any given H Optimize N N lower-triangular ( causal ) step-size coefficient matrix H. fast convergence efficient recursive implementation universal (design prior to iterating) 12

13 Not: Barzilai-Borwein gradient method Barzilai & Borwein, 1988 g (n) f (x n ) x n x n 1 2 α n = x n x n 1, g (n) g (n 1) x n+1 = x n α n f (x n ). Not in first-order class FO. Neither are methods like steepest descent (with line search), conjugate gradient, quasi-newton... 13

14 Nesterov s fast gradient method (FGM1) Nesterov (1983) iteration: Initialize: t 0 = 1, z 0 = x 0 z n+1 = x n 1 L f (x n ) (usual GD update) t n+1 = 1 ( 1 + ) 1 + 4tn 2 2 (magic momentum factors) x n+1 = z n+1 + t n 1 (z n+1 z n ) (update with momentum). t n+1 Reverts to GD if t n = 1, n. FGM1 is in class FO: h n+1,k = x n+1 = x n 1 L t n 1 t n+1 h n,k, k = 0,...,n 2 t n 1 t n+1 (h n,n 1 1), k = n t n 1 t n+1, k = n. n h n+1,k f (x k ) k=

15 Nesterov FGM1 optimal convergence rate Shown by Nesterov to be O(1/n 2 ) for auxiliary sequence: f (z n ) f (x ) 2L x 0 x 2 2 (n + 1) 2. Nesterov constructed a function f such that any first-order method achieves Thus O(1/n 2 ) rate of FGM1 is optimal L x 0 x 2 2 (n + 1) 2 f (x n ) f (x ). New results (Donghwan Kim, 2014): Bound on convergence rate of primary sequence {x n }: f (x n ) f (x ) 2L x 0 x 2 2 (n + 2) 2. Verifies (numerically inspired) conjecture of Drori & Teboulle (2013). 15

16 General first-order (FO) iteration: x n+1 = x n 1 L Overview n h n+1,k f (x k ) k=0 Analyze (i.e., bound) convergence rate as a function of number of iterations N Lipschitz constant L step-size coefficients H = {h n+1,k } Distance to a solution: R = x 0 x Optimize H by minimizing the bound 16

17 Ideal universal bound for first-order methods For given number of iterations N Lipschitz constant L step-size coefficients H = {h n+1,k } distance to a solution: R = x 0 x bound the worst-case convergence rate of FO algorithm: B 1 (H,R,L,N) max f F L max x 0,x 1,...,x N R M max x X ( f ) f (x N ) f (x ) x 0 x R such that Clearly for any FO method: x n+1 = x n 1 L n h n+1,k f (x k ), n = 0,...,N 1. k=0 f (x N ) f (x ) B 1 (H,R,L,N) 17

18 Towards practical bounds for first-order methods For convex functions with L-Lipschitz gradients 1 2L f (x) f (z) 2 f (x) f (z) f (z), x z, x, z R M. Drori & Teboulle (2013) use this inequality to propose a more tractable bound: B 2 (H,R,L,N) max max g 0,...,g N R M δ 0,...,δ N R max x 0,x 1,...,x N R M max x : x 0 x R such that x n+1 = x n 1 n L h n+1,k R g k, n = 0,...,N 1, k=0 1 g 2 i g j 2 δi δ j 1 R g j, x i x j, i, j = 0,...,N, where g n = 1 LR f (x n ) and δ n = 1 LR ( f (x n ) f (x )). For any FO method: f (x N ) f (x ) B 1 (H,R,L,N) B 2 (H,R,L,N) However, even B 2 is as of yet unsolved. LRδ 2 N 18

19 Numerical bounds for first-order methods Drori & Teboulle (2013) further relax the bound leading eventually to a still simpler optimization problem (with no known closed-form solution): f (x N ) f (x ) B 1 (H,R,L,N) B 2 (H,R,L,N) B 3 (H,R,L,N). For given step-size coefficients H, and given number of iterations N, they use a semi-definite program (SDP) to compute B 3 numerically. They find numerically that for the FGM1 choice of H, the convergence bound B 3 is slightly tighter than 2L x 0 x 2 2 (N + 1) 2. 19

20 Optimizing step-size coefficients numerically Drori & Teboulle (2013) also compute numerically the minimizer over H of their relaxed bound for given N using a semi-definite program (SDP): H = argminb 3 (H,R,L,N). H Numerical solution for H for N = 5 iterations: [Fig. from Drori & Teboulle (2013)] Drawbacks Must choose N in advance Requires O(N) memory for all gradient vectors { f (x n )} N n=1 O(N 2 ) computation for N iterations Benefit: convergence bound (for specific N) 2 lower than for Nesterov s FGM1. 20

21 H : h n+1,k = New analytical solution Analytical solution for optimized step-size coefficients (Donghwan Kim, 2014): θ n = θ n 1 θ n+1 h n,k, k = 0,...,n 2 θ n 1 θ n+1 (h n,n 1 1), k = n θ n 1 θ n+1, k = n. 1, ( ) n = θn 1 2, n = 1,...,N 1 ( ) θn 1 2, n = N. 1 2 Analytical convergence bound for these optimized step-size coefficients: f (x N ) f (x ) B 3 (H,R,L,N) = 1L x 0 x 2 2 (N + 1)(N ). Of course bound is O(1/N 2 ), but constant is twice better than that of Nesterov. No numerical SDP needed = feasible for large N. (History: sought banded / structured lower-triangular form) 21

22 Optimized gradient method (OGM1) Donghwan Kim (2014) found efficient recursive iteration: Initialize: θ 0 = 1, z 0 = x 0 z n+1 = x n 1 L f (x n ) ( ) θn 1 2, n = 1,...,N 1 θ n = ( ) θn 1 2, n = N 1 2 (usual GD update) (momentum factors) x n+1 = z n+1 + θ n 1 (z n+1 z n ) + θ n (z n+1 x n ). θ n+1 } θ n+1 {{} new momentum Reverts to Nesterov s FGM1 if the new terms are removed. Very simple modification of existing Nesterov code No need to choose N in advance (or solve SDP); use favorite stopping rule then run one last decreased momentum step. Factor of 2 better upper bound than Nesterov s optimal FGM1. (Proofs omitted.) 22

23 Numerical Example(s) 23

24 Machine learning (logistic regression) To learn weights x of binary classifier given feature vectors {v i } and labels {y i }: where y i = ±1. ˆx = argmin x f (x), f (x) = ψ(y i x, v i )+β 1 i 2 x 2 2, logistic: ψ(t) = log(1 + e t ), ψ(t) = 1 e t + 1, ψ(t) = e t (e t + 1) 2 Gradient f (x) = i y i v i ψ(y i x, v i )+βx Hessian is positive definite so strictly convex: 2 f (x) = i = L 1 4 ρ ( i v i ψ(y i x, v i ) v i +βi 1 4 i ) v i v i + β maxρ ( 2 f (x) ) x v i v i +βi ( 0, 1 ] 4 24

25 Numerical Results: logistic regression v v 1 Training data, initial decision boundary (red), final decision boundary (magenta) 25

26 Numerical Results: convergence rates 8 GD Nesterov OGM x n - x * Iteration 26

27 Numerical Results: adaptive restart 8 GD Nesterov (restart) OGM1 (restart) x n - x * Iteration O Donoghue & Candès,

28 Low-dose 2D X-ray CT image reconstruction simulation GD FGM OGM RMSD [HU] Iteration 28

29 Summary New optimized first-order minimization algorithm Simple implementation akin to Nesterov s FGM Analytical converge rate bound Bound is 2 better than Nesterov Future work Constraints Non-smooth cost functions, e.g., l 1 Tighter bounds Strongly convex case Asymptotic / local convergence rates Incremental gradients Stochastic gradient descent Adaptive restart Low-dose 3D X-ray CT image reconstruction 29

30 Bibliography [1] Y. Drori and M. Teboulle. Performance of first-order methods for smooth convex minimization: A novel approach. Mathematical Programming, 145(1-2):451 82, June [2] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O(1/k 2 ). Dokl. Akad. Nauk. USSR, 269(3):543 7, [3] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127 52, May [4] D. Kim and J. A. Fessler. Optimized first-order methods for smooth convex minimization. Mathematical Programming, Submitted. [5] D. Böhning and B. G. Lindsay. Monotonicity of quadratic approximation algorithms. Ann. Inst. Stat. Math., 40(4):641 63, December [6] B. O Donoghue and E. Candès. Adaptive restart for accelerated gradient schemes. Found. Computational Math., 15(3):715 32, June 2015.

Relaxed linearized algorithms for faster X-ray CT image reconstruction

Relaxed linearized algorithms for faster X-ray CT image reconstruction Relaxed linearized algorithms for faster X-ray CT image reconstruction Hung Nien and Jeffrey A. Fessler University of Michigan, Ann Arbor The 13th Fully 3D Meeting June 2, 2015 1/20 Statistical image reconstruction

More information

Optimized first-order methods for smooth convex minimization

Optimized first-order methods for smooth convex minimization Noname manuscript No. will be inserted by the editor) Optimized first-order methods for smooth convex minimization Donghwan Kim Jeffrey A. Fessler Date of current version: September 9, 05 Abstract We introduce

More information

Chapter 8 Optimization basics

Chapter 8 Optimization basics Chapter 8 Optimization basics Contents (class version) 8.0 Introduction........................................ 8.2 8.1 Preconditioned gradient descent (PGD) for LS..................... 8.3 Tool: Matrix

More information

arxiv: v4 [math.oc] 1 Apr 2018

arxiv: v4 [math.oc] 1 Apr 2018 Generalizing the optimized gradient method for smooth convex minimization Donghwan Kim Jeffrey A. Fessler arxiv:607.06764v4 [math.oc] Apr 08 Date of current version: April 3, 08 Abstract This paper generalizes

More information

A New Look at the Performance Analysis of First-Order Methods

A New Look at the Performance Analysis of First-Order Methods A New Look at the Performance Analysis of First-Order Methods Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with Yoel Drori, Google s R&D Center, Tel Aviv Optimization without

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

arxiv: v5 [math.oc] 23 Jan 2018

arxiv: v5 [math.oc] 23 Jan 2018 Another look at the fast iterative shrinkage/thresholding algorithm FISTA Donghwan Kim Jeffrey A. Fessler arxiv:608.086v5 [math.oc] Jan 08 Date of current version: January 4, 08 Abstract This paper provides

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Non-convex optimization. Issam Laradji

Non-convex optimization. Issam Laradji Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective

More information

Higher-Order Methods

Higher-Order Methods Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Gradient Methods Using Momentum and Memory

Gradient Methods Using Momentum and Memory Chapter 3 Gradient Methods Using Momentum and Memory The steepest descent method described in Chapter always steps in the negative gradient direction, which is orthogonal to the boundary of the level set

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Gradient Descent, Newton-like Methods Mark Schmidt University of British Columbia Winter 2017 Admin Auditting/registration forms: Submit them in class/help-session/tutorial this

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep

More information

Regularizing inverse problems using sparsity-based signal models

Regularizing inverse problems using sparsity-based signal models Regularizing inverse problems using sparsity-based signal models Jeffrey A. Fessler William L. Root Professor of EECS EECS Dept., BME Dept., Dept. of Radiology University of Michigan http://web.eecs.umich.edu/

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Optimization for Training I. First-Order Methods Training algorithm

Optimization for Training I. First-Order Methods Training algorithm Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order

More information

Adaptive Restart of the Optimized Gradient Method for Convex Optimization

Adaptive Restart of the Optimized Gradient Method for Convex Optimization J Optim Theory Appl (2018) 178:240 263 https://doi.org/10.1007/s10957-018-1287-4 Adaptive Restart of the Optimized Gradient Method for Convex Optimization Donghwan Kim 1 Jeffrey A. Fessler 1 Received:

More information

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

A Conservation Law Method in Optimization

A Conservation Law Method in Optimization A Conservation Law Method in Optimization Bin Shi Florida International University Tao Li Florida International University Sundaraja S. Iyengar Florida International University Abstract bshi1@cs.fiu.edu

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

Optimization II: Unconstrained Multivariable

Optimization II: Unconstrained Multivariable Optimization II: Unconstrained Multivariable CS 205A: Mathematical Methods for Robotics, Vision, and Graphics Doug James (and Justin Solomon) CS 205A: Mathematical Methods Optimization II: Unconstrained

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Optimization by General-Purpose Methods

Optimization by General-Purpose Methods c J. Fessler. January 25, 2015 11.1 Chapter 11 Optimization by General-Purpose Methods ch,opt Contents 11.1 Introduction (s,opt,intro)...................................... 11.3 11.1.1 Iterative optimization

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Lecture 5: September 12

Lecture 5: September 12 10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS

More information

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725 Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

More information

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION 15-382 COLLECTIVE INTELLIGENCE - S19 LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION TEACHER: GIANNI A. DI CARO WHAT IF WE HAVE ONE SINGLE AGENT PSO leverages the presence of a swarm: the outcome

More information

10. Unconstrained minimization

10. Unconstrained minimization Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09 Numerical Optimization 1 Working Horse in Computer Vision Variational Methods Shape Analysis Machine Learning Markov Random Fields Geometry Common denominator: optimization problems 2 Overview of Methods

More information

Homework 5. Convex Optimization /36-725

Homework 5. Convex Optimization /36-725 Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Subsampled Hessian Newton Methods for Supervised

Subsampled Hessian Newton Methods for Supervised 1 Subsampled Hessian Newton Methods for Supervised Learning Chien-Chih Wang, Chun-Heng Huang, Chih-Jen Lin Department of Computer Science, National Taiwan University, Taipei 10617, Taiwan Keywords: Subsampled

More information

x k+1 = x k + α k p k (13.1)

x k+1 = x k + α k p k (13.1) 13 Gradient Descent Methods Lab Objective: Iterative optimization methods choose a search direction and a step size at each iteration One simple choice for the search direction is the negative gradient,

More information

Lecture 5: September 15

Lecture 5: September 15 10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 15 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Di Jin, Mengdi Wang, Bin Deng Note: LaTeX template courtesy of UC Berkeley EECS

More information

ORIE 6326: Convex Optimization. Quasi-Newton Methods

ORIE 6326: Convex Optimization. Quasi-Newton Methods ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted

More information

Math 164: Optimization Barzilai-Borwein Method

Math 164: Optimization Barzilai-Borwein Method Math 164: Optimization Barzilai-Borwein Method Instructor: Wotao Yin Department of Mathematics, UCLA Spring 2015 online discussions on piazza.com Main features of the Barzilai-Borwein (BB) method The BB

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Algorithmic Stability and Generalization Christoph Lampert

Algorithmic Stability and Generalization Christoph Lampert Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

Non-negative Matrix Factorization via accelerated Projected Gradient Descent

Non-negative Matrix Factorization via accelerated Projected Gradient Descent Non-negative Matrix Factorization via accelerated Projected Gradient Descent Andersen Ang Mathématique et recherche opérationnelle UMONS, Belgium Email: manshun.ang@umons.ac.be Homepage: angms.science

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent CMU 10-725/36-725: Convex Optimization (Fall 2017) OUT: Sep 29 DUE: Oct 13, 5:00

More information

arxiv: v2 [math.oc] 28 Nov 2017

arxiv: v2 [math.oc] 28 Nov 2017 Adaptive Restart of the Optimized Gradient Method for Convex Optimization Donghwan Kim Jeffrey A. Fessler arxiv:703.0464v [math.oc] 8 Nov 07 Date of current version: November 9, 07 Abstract First-order

More information

Statistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it

Statistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it Statistics 300B Winter 08 Final Exam Due 4 Hours after receiving it Directions: This test is open book and open internet, but must be done without consulting other students. Any consultation of other students

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set

More information

Hamiltonian Descent Methods

Hamiltonian Descent Methods Hamiltonian Descent Methods Chris J. Maddison 1,2 with Daniel Paulin 1, Yee Whye Teh 1,2, Brendan O Donoghue 2, Arnaud Doucet 1 Department of Statistics University of Oxford 1 DeepMind London, UK 2 The

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

Optimization for neural networks

Optimization for neural networks 0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Practical Optimization: Basic Multidimensional Gradient Methods

Practical Optimization: Basic Multidimensional Gradient Methods Practical Optimization: Basic Multidimensional Gradient Methods László Kozma Lkozma@cis.hut.fi Helsinki University of Technology S-88.4221 Postgraduate Seminar on Signal Processing 22. 10. 2008 Contents

More information

Machine Learning and Data Mining. Linear classification. Kalev Kask

Machine Learning and Data Mining. Linear classification. Kalev Kask Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q

More information

On the Convergence Rate of Incremental Aggregated Gradient Algorithms

On the Convergence Rate of Incremental Aggregated Gradient Algorithms On the Convergence Rate of Incremental Aggregated Gradient Algorithms M. Gürbüzbalaban, A. Ozdaglar, P. Parrilo June 5, 2015 Abstract Motivated by applications to distributed optimization over networks

More information

More First-Order Optimization Algorithms

More First-Order Optimization Algorithms More First-Order Optimization Algorithms Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 3, 8, 3 The SDM

More information

2. Quasi-Newton methods

2. Quasi-Newton methods L. Vandenberghe EE236C (Spring 2016) 2. Quasi-Newton methods variable metric methods quasi-newton methods BFGS update limited-memory quasi-newton methods 2-1 Newton method for unconstrained minimization

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

GRADIENT = STEEPEST DESCENT

GRADIENT = STEEPEST DESCENT GRADIENT METHODS GRADIENT = STEEPEST DESCENT Convex Function Iso-contours gradient 0.5 0.4 4 2 0 8 0.3 0.2 0. 0 0. negative gradient 6 0.2 4 0.3 2.5 0.5 0 0.5 0.5 0 0.5 0.4 0.5.5 0.5 0 0.5 GRADIENT DESCENT

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information