Optimized first-order minimization methods
|
|
- Joy Golden
- 6 years ago
- Views:
Transcription
1 Optimized first-order minimization methods Donghwan Kim & Jeffrey A. Fessler EECS Dept., BME Dept., Dept. of Radiology University of Michigan web.eecs.umich.edu/~fessler UM AIM Seminar
2 Disclosure Research support from GE Healthcare Research support to GE Global Research Supported in part by NIH grants R01 HL and P01 CA Equipment support from Intel Corporation 2
3 Low-dose X-ray CT image reconstruction Thin-slice FBP ASIR Statistical Seconds A bit longer Much longer Image reconstruction as an optimization problem: ˆx = argmin x y Ax 2 W + R(x) (Same sinogram, so all at same dose) 3
4 Outline Motivation (done) Problem definition Existing algorithms Gradient descent Nesterov s optimal first-order methods General first-order methods Optimizing first-order minimization methods Drori & Teboulle s numerical bounds Donghwan Kim s analytically optimized ( more optimal ) first-order methods Examples: logistic regression for machine learning CT image reconstruction Summary / Future work 4
5 Problem setting 5
6 Optimization problem setting ˆx argmin x Unconstrained Large-scale (Hessian too big to store) image reconstruction big-data / machine learning... Cost function assumptions (throughout) f : R M R convex (need not be strictly convex) non-empty set of global minimizers: f (x) ˆx X = { x R M : f (x ) f (x), x R M} smooth (differentiable with L-Lipschitz gradient) f (x) f (z) 2 L x z 2, x, z R M 6
7 Example: Machine learning To learn weights x of binary classifier given feature vectors {v i } and labels {y i }: f (x) = ψ(y i x, v i ), i where y i = ±1. 8 Loss functions (surrogates) exponential hinge logistic 0-1 loss functions ψ(t) 0-1: I {t 0} exponential: exp( t) logistic: log(1 + exp( t)) hinge: max{0,1 t} Which of these fit our conditions? ψ(t) t 7
8 Algorithms 8
9 Gradient descent iteration with step size 1/L ensures monotonic descent of f : x n+1 = x n 1 L f (x n ) stacking: i.e.: x 1 x 2. x N 1 x N x 1 x 2. x N 1 x N = = x 0 x 1. x N 2 x N 1 x 0 x 1. x N 2 x N 1 1 L 1 L f (x 0 ) f (x 1 ). f (x N 2 ) f (x N 1 ) }{{} H GD I f (x 0 ) f (x 1 ). f (x N 2 ) f (x N 1 ) Note: N N coefficient matrix H GD is diagonal (a special case of lower triangular). 9
10 Gradient descent convergence rate Classic O(1/n) convergence rate of cost function descent: f (x n ) f (x ) }{{} inaccuracy L x 0 x n Drori & Teboulle (2013) derive tightest inaccuracy bound: f (x n ) f (x ) L x 0 x n + 2 They construct a Huber-like function f for which GD achieves that bound. Case closed for GD. O(1/n) rate is undesirably slow. 10
11 Heavy ball method iteration (Polyak, 1987): stacking: x 1 x 2. x N 1 x N x n+1 = x n α L f (x n )+β (x n x n 1 ) }{{} momentum! = x n 1 L = x 0 x 1. x N 2 x N 1 n k=0 } αβ {{ n k } coefficients 1 L f (x k ) α αβ α αβ N 2... αβ α 0 αβ N 1... αβ 2 αβ α (for implementation) (for analysis) }{{} H HB I Here, N N coefficient matrix H HB is lower triangular. How to choose α and β? How to optimize N N coefficient matrix H more generally? f (x 0 ) f (x 1 ). f (x N 2 ) f (x N 1 ) 11
12 General first-order (FO) iteration: General first-order method class x n+1 = x n 1 L n h n+1,k f (x k ) k=0 stacking: x 1 x 2. x N 1 x N = x 0 x 1. x N 2 x N 1 1 L h 1, h 2,0 h 2, h N,0 h N,1... h N,N 2 h N,N 1 }{{} H FO I f (x 0 ) f (x 1 ). f (x N 2 ) f (x N 1 ) Primary goals: Analyze convergence rate of FO for any given H Optimize N N lower-triangular ( causal ) step-size coefficient matrix H. fast convergence efficient recursive implementation universal (design prior to iterating) 12
13 Not: Barzilai-Borwein gradient method Barzilai & Borwein, 1988 g (n) f (x n ) x n x n 1 2 α n = x n x n 1, g (n) g (n 1) x n+1 = x n α n f (x n ). Not in first-order class FO. Neither are methods like steepest descent (with line search), conjugate gradient, quasi-newton... 13
14 Nesterov s fast gradient method (FGM1) Nesterov (1983) iteration: Initialize: t 0 = 1, z 0 = x 0 z n+1 = x n 1 L f (x n ) (usual GD update) t n+1 = 1 ( 1 + ) 1 + 4tn 2 2 (magic momentum factors) x n+1 = z n+1 + t n 1 (z n+1 z n ) (update with momentum). t n+1 Reverts to GD if t n = 1, n. FGM1 is in class FO: h n+1,k = x n+1 = x n 1 L t n 1 t n+1 h n,k, k = 0,...,n 2 t n 1 t n+1 (h n,n 1 1), k = n t n 1 t n+1, k = n. n h n+1,k f (x k ) k=
15 Nesterov FGM1 optimal convergence rate Shown by Nesterov to be O(1/n 2 ) for auxiliary sequence: f (z n ) f (x ) 2L x 0 x 2 2 (n + 1) 2. Nesterov constructed a function f such that any first-order method achieves Thus O(1/n 2 ) rate of FGM1 is optimal L x 0 x 2 2 (n + 1) 2 f (x n ) f (x ). New results (Donghwan Kim, 2014): Bound on convergence rate of primary sequence {x n }: f (x n ) f (x ) 2L x 0 x 2 2 (n + 2) 2. Verifies (numerically inspired) conjecture of Drori & Teboulle (2013). 15
16 General first-order (FO) iteration: x n+1 = x n 1 L Overview n h n+1,k f (x k ) k=0 Analyze (i.e., bound) convergence rate as a function of number of iterations N Lipschitz constant L step-size coefficients H = {h n+1,k } Distance to a solution: R = x 0 x Optimize H by minimizing the bound 16
17 Ideal universal bound for first-order methods For given number of iterations N Lipschitz constant L step-size coefficients H = {h n+1,k } distance to a solution: R = x 0 x bound the worst-case convergence rate of FO algorithm: B 1 (H,R,L,N) max f F L max x 0,x 1,...,x N R M max x X ( f ) f (x N ) f (x ) x 0 x R such that Clearly for any FO method: x n+1 = x n 1 L n h n+1,k f (x k ), n = 0,...,N 1. k=0 f (x N ) f (x ) B 1 (H,R,L,N) 17
18 Towards practical bounds for first-order methods For convex functions with L-Lipschitz gradients 1 2L f (x) f (z) 2 f (x) f (z) f (z), x z, x, z R M. Drori & Teboulle (2013) use this inequality to propose a more tractable bound: B 2 (H,R,L,N) max max g 0,...,g N R M δ 0,...,δ N R max x 0,x 1,...,x N R M max x : x 0 x R such that x n+1 = x n 1 n L h n+1,k R g k, n = 0,...,N 1, k=0 1 g 2 i g j 2 δi δ j 1 R g j, x i x j, i, j = 0,...,N, where g n = 1 LR f (x n ) and δ n = 1 LR ( f (x n ) f (x )). For any FO method: f (x N ) f (x ) B 1 (H,R,L,N) B 2 (H,R,L,N) However, even B 2 is as of yet unsolved. LRδ 2 N 18
19 Numerical bounds for first-order methods Drori & Teboulle (2013) further relax the bound leading eventually to a still simpler optimization problem (with no known closed-form solution): f (x N ) f (x ) B 1 (H,R,L,N) B 2 (H,R,L,N) B 3 (H,R,L,N). For given step-size coefficients H, and given number of iterations N, they use a semi-definite program (SDP) to compute B 3 numerically. They find numerically that for the FGM1 choice of H, the convergence bound B 3 is slightly tighter than 2L x 0 x 2 2 (N + 1) 2. 19
20 Optimizing step-size coefficients numerically Drori & Teboulle (2013) also compute numerically the minimizer over H of their relaxed bound for given N using a semi-definite program (SDP): H = argminb 3 (H,R,L,N). H Numerical solution for H for N = 5 iterations: [Fig. from Drori & Teboulle (2013)] Drawbacks Must choose N in advance Requires O(N) memory for all gradient vectors { f (x n )} N n=1 O(N 2 ) computation for N iterations Benefit: convergence bound (for specific N) 2 lower than for Nesterov s FGM1. 20
21 H : h n+1,k = New analytical solution Analytical solution for optimized step-size coefficients (Donghwan Kim, 2014): θ n = θ n 1 θ n+1 h n,k, k = 0,...,n 2 θ n 1 θ n+1 (h n,n 1 1), k = n θ n 1 θ n+1, k = n. 1, ( ) n = θn 1 2, n = 1,...,N 1 ( ) θn 1 2, n = N. 1 2 Analytical convergence bound for these optimized step-size coefficients: f (x N ) f (x ) B 3 (H,R,L,N) = 1L x 0 x 2 2 (N + 1)(N ). Of course bound is O(1/N 2 ), but constant is twice better than that of Nesterov. No numerical SDP needed = feasible for large N. (History: sought banded / structured lower-triangular form) 21
22 Optimized gradient method (OGM1) Donghwan Kim (2014) found efficient recursive iteration: Initialize: θ 0 = 1, z 0 = x 0 z n+1 = x n 1 L f (x n ) ( ) θn 1 2, n = 1,...,N 1 θ n = ( ) θn 1 2, n = N 1 2 (usual GD update) (momentum factors) x n+1 = z n+1 + θ n 1 (z n+1 z n ) + θ n (z n+1 x n ). θ n+1 } θ n+1 {{} new momentum Reverts to Nesterov s FGM1 if the new terms are removed. Very simple modification of existing Nesterov code No need to choose N in advance (or solve SDP); use favorite stopping rule then run one last decreased momentum step. Factor of 2 better upper bound than Nesterov s optimal FGM1. (Proofs omitted.) 22
23 Numerical Example(s) 23
24 Machine learning (logistic regression) To learn weights x of binary classifier given feature vectors {v i } and labels {y i }: where y i = ±1. ˆx = argmin x f (x), f (x) = ψ(y i x, v i )+β 1 i 2 x 2 2, logistic: ψ(t) = log(1 + e t ), ψ(t) = 1 e t + 1, ψ(t) = e t (e t + 1) 2 Gradient f (x) = i y i v i ψ(y i x, v i )+βx Hessian is positive definite so strictly convex: 2 f (x) = i = L 1 4 ρ ( i v i ψ(y i x, v i ) v i +βi 1 4 i ) v i v i + β maxρ ( 2 f (x) ) x v i v i +βi ( 0, 1 ] 4 24
25 Numerical Results: logistic regression v v 1 Training data, initial decision boundary (red), final decision boundary (magenta) 25
26 Numerical Results: convergence rates 8 GD Nesterov OGM x n - x * Iteration 26
27 Numerical Results: adaptive restart 8 GD Nesterov (restart) OGM1 (restart) x n - x * Iteration O Donoghue & Candès,
28 Low-dose 2D X-ray CT image reconstruction simulation GD FGM OGM RMSD [HU] Iteration 28
29 Summary New optimized first-order minimization algorithm Simple implementation akin to Nesterov s FGM Analytical converge rate bound Bound is 2 better than Nesterov Future work Constraints Non-smooth cost functions, e.g., l 1 Tighter bounds Strongly convex case Asymptotic / local convergence rates Incremental gradients Stochastic gradient descent Adaptive restart Low-dose 3D X-ray CT image reconstruction 29
30 Bibliography [1] Y. Drori and M. Teboulle. Performance of first-order methods for smooth convex minimization: A novel approach. Mathematical Programming, 145(1-2):451 82, June [2] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O(1/k 2 ). Dokl. Akad. Nauk. USSR, 269(3):543 7, [3] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127 52, May [4] D. Kim and J. A. Fessler. Optimized first-order methods for smooth convex minimization. Mathematical Programming, Submitted. [5] D. Böhning and B. G. Lindsay. Monotonicity of quadratic approximation algorithms. Ann. Inst. Stat. Math., 40(4):641 63, December [6] B. O Donoghue and E. Candès. Adaptive restart for accelerated gradient schemes. Found. Computational Math., 15(3):715 32, June 2015.
Relaxed linearized algorithms for faster X-ray CT image reconstruction
Relaxed linearized algorithms for faster X-ray CT image reconstruction Hung Nien and Jeffrey A. Fessler University of Michigan, Ann Arbor The 13th Fully 3D Meeting June 2, 2015 1/20 Statistical image reconstruction
More informationOptimized first-order methods for smooth convex minimization
Noname manuscript No. will be inserted by the editor) Optimized first-order methods for smooth convex minimization Donghwan Kim Jeffrey A. Fessler Date of current version: September 9, 05 Abstract We introduce
More informationChapter 8 Optimization basics
Chapter 8 Optimization basics Contents (class version) 8.0 Introduction........................................ 8.2 8.1 Preconditioned gradient descent (PGD) for LS..................... 8.3 Tool: Matrix
More informationarxiv: v4 [math.oc] 1 Apr 2018
Generalizing the optimized gradient method for smooth convex minimization Donghwan Kim Jeffrey A. Fessler arxiv:607.06764v4 [math.oc] Apr 08 Date of current version: April 3, 08 Abstract This paper generalizes
More informationA New Look at the Performance Analysis of First-Order Methods
A New Look at the Performance Analysis of First-Order Methods Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with Yoel Drori, Google s R&D Center, Tel Aviv Optimization without
More informationAccelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)
Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers
More informationarxiv: v5 [math.oc] 23 Jan 2018
Another look at the fast iterative shrinkage/thresholding algorithm FISTA Donghwan Kim Jeffrey A. Fessler arxiv:608.086v5 [math.oc] Jan 08 Date of current version: January 4, 08 Abstract This paper provides
More informationNOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained
NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume
More informationNon-convex optimization. Issam Laradji
Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective
More informationHigher-Order Methods
Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationGradient Methods Using Momentum and Memory
Chapter 3 Gradient Methods Using Momentum and Memory The steepest descent method described in Chapter always steps in the negative gradient direction, which is orthogonal to the boundary of the level set
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationSIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University
SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationSub-Sampled Newton Methods
Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Gradient Descent, Newton-like Methods Mark Schmidt University of British Columbia Winter 2017 Admin Auditting/registration forms: Submit them in class/help-session/tutorial this
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationFast proximal gradient methods
L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient
More informationOracle Complexity of Second-Order Methods for Smooth Convex Optimization
racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il
More informationLinearly-Convergent Stochastic-Gradient Methods
Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationIPAM Summer School Optimization methods for machine learning. Jorge Nocedal
IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep
More informationRegularizing inverse problems using sparsity-based signal models
Regularizing inverse problems using sparsity-based signal models Jeffrey A. Fessler William L. Root Professor of EECS EECS Dept., BME Dept., Dept. of Radiology University of Michigan http://web.eecs.umich.edu/
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationConvex Optimization. Newton s method. ENSAE: Optimisation 1/44
Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)
More informationOptimization for Training I. First-Order Methods Training algorithm
Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order
More informationAdaptive Restart of the Optimized Gradient Method for Convex Optimization
J Optim Theory Appl (2018) 178:240 263 https://doi.org/10.1007/s10957-018-1287-4 Adaptive Restart of the Optimized Gradient Method for Convex Optimization Donghwan Kim 1 Jeffrey A. Fessler 1 Received:
More informationOn Nesterov s Random Coordinate Descent Algorithms - Continued
On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationA Conservation Law Method in Optimization
A Conservation Law Method in Optimization Bin Shi Florida International University Tao Li Florida International University Sundaraja S. Iyengar Florida International University Abstract bshi1@cs.fiu.edu
More informationLinear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More informationOptimization II: Unconstrained Multivariable
Optimization II: Unconstrained Multivariable CS 205A: Mathematical Methods for Robotics, Vision, and Graphics Doug James (and Justin Solomon) CS 205A: Mathematical Methods Optimization II: Unconstrained
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More informationAn Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal
An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve
More informationOptimization by General-Purpose Methods
c J. Fessler. January 25, 2015 11.1 Chapter 11 Optimization by General-Purpose Methods ch,opt Contents 11.1 Introduction (s,opt,intro)...................................... 11.3 11.1.1 Iterative optimization
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based
More informationLecture 5: September 12
10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS
More informationNewton s Method. Ryan Tibshirani Convex Optimization /36-725
Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x
More informationLECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION
15-382 COLLECTIVE INTELLIGENCE - S19 LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION TEACHER: GIANNI A. DI CARO WHAT IF WE HAVE ONE SINGLE AGENT PSO leverages the presence of a swarm: the outcome
More information10. Unconstrained minimization
Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation
More informationLecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning
Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function
More informationNumerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09
Numerical Optimization 1 Working Horse in Computer Vision Variational Methods Shape Analysis Machine Learning Markov Random Fields Geometry Common denominator: optimization problems 2 Overview of Methods
More informationHomework 5. Convex Optimization /36-725
Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationSubsampled Hessian Newton Methods for Supervised
1 Subsampled Hessian Newton Methods for Supervised Learning Chien-Chih Wang, Chun-Heng Huang, Chih-Jen Lin Department of Computer Science, National Taiwan University, Taipei 10617, Taiwan Keywords: Subsampled
More informationx k+1 = x k + α k p k (13.1)
13 Gradient Descent Methods Lab Objective: Iterative optimization methods choose a search direction and a step size at each iteration One simple choice for the search direction is the negative gradient,
More informationLecture 5: September 15
10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 15 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Di Jin, Mengdi Wang, Bin Deng Note: LaTeX template courtesy of UC Berkeley EECS
More informationORIE 6326: Convex Optimization. Quasi-Newton Methods
ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted
More informationMath 164: Optimization Barzilai-Borwein Method
Math 164: Optimization Barzilai-Borwein Method Instructor: Wotao Yin Department of Mathematics, UCLA Spring 2015 online discussions on piazza.com Main features of the Barzilai-Borwein (BB) method The BB
More information1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method
L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order
More informationWorst-Case Complexity Guarantees and Nonconvex Smooth Optimization
Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationAlgorithmic Stability and Generalization Christoph Lampert
Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts
More informationStochastic Optimization: First order method
Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO
More informationLINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception
LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,
More informationNon-negative Matrix Factorization via accelerated Projected Gradient Descent
Non-negative Matrix Factorization via accelerated Projected Gradient Descent Andersen Ang Mathématique et recherche opérationnelle UMONS, Belgium Email: manshun.ang@umons.ac.be Homepage: angms.science
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationI P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION
I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationHomework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent
Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent CMU 10-725/36-725: Convex Optimization (Fall 2017) OUT: Sep 29 DUE: Oct 13, 5:00
More informationarxiv: v2 [math.oc] 28 Nov 2017
Adaptive Restart of the Optimized Gradient Method for Convex Optimization Donghwan Kim Jeffrey A. Fessler arxiv:703.0464v [math.oc] 8 Nov 07 Date of current version: November 9, 07 Abstract First-order
More informationStatistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it
Statistics 300B Winter 08 Final Exam Due 4 Hours after receiving it Directions: This test is open book and open internet, but must be done without consulting other students. Any consultation of other students
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationMachine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang
Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set
More informationHamiltonian Descent Methods
Hamiltonian Descent Methods Chris J. Maddison 1,2 with Daniel Paulin 1, Yee Whye Teh 1,2, Brendan O Donoghue 2, Arnaud Doucet 1 Department of Statistics University of Oxford 1 DeepMind London, UK 2 The
More informationNewton s Method. Javier Peña Convex Optimization /36-725
Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and
More informationOptimization for neural networks
0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make
More informationMinimizing Finite Sums with the Stochastic Average Gradient Algorithm
Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale
More informationPractical Optimization: Basic Multidimensional Gradient Methods
Practical Optimization: Basic Multidimensional Gradient Methods László Kozma Lkozma@cis.hut.fi Helsinki University of Technology S-88.4221 Postgraduate Seminar on Signal Processing 22. 10. 2008 Contents
More informationMachine Learning and Data Mining. Linear classification. Kalev Kask
Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q
More informationOn the Convergence Rate of Incremental Aggregated Gradient Algorithms
On the Convergence Rate of Incremental Aggregated Gradient Algorithms M. Gürbüzbalaban, A. Ozdaglar, P. Parrilo June 5, 2015 Abstract Motivated by applications to distributed optimization over networks
More informationMore First-Order Optimization Algorithms
More First-Order Optimization Algorithms Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 3, 8, 3 The SDM
More information2. Quasi-Newton methods
L. Vandenberghe EE236C (Spring 2016) 2. Quasi-Newton methods variable metric methods quasi-newton methods BFGS update limited-memory quasi-newton methods 2-1 Newton method for unconstrained minimization
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationGRADIENT = STEEPEST DESCENT
GRADIENT METHODS GRADIENT = STEEPEST DESCENT Convex Function Iso-contours gradient 0.5 0.4 4 2 0 8 0.3 0.2 0. 0 0. negative gradient 6 0.2 4 0.3 2.5 0.5 0 0.5 0.5 0 0.5 0.4 0.5.5 0.5 0 0.5 GRADIENT DESCENT
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationOLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research
OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and
More informationStochastic Quasi-Newton Methods
Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationDescent methods. min x. f(x)
Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained
More informationMachine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.
More information