You should be able to...

Similar documents
Algorithms for constrained local optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

5 Handling Constraints

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Lagrange duality. The Lagrangian. We consider an optimization program of the form

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

Math 273a: Optimization Subgradients of convex functions

Constrained Optimization and Lagrangian Duality

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Unconstrained minimization of smooth functions

Lecture 18: Optimization Programming

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Optimization Methods. Lecture 18: Optimality Conditions and. Gradient Methods. for Unconstrained Optimization

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Lecture 3. Optimization Problems and Iterative Algorithms

ICS-E4030 Kernel Methods in Machine Learning

5. Subgradient method

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Solution Methods. Richard Lusby. Department of Management Engineering Technical University of Denmark

Lecture 3: Linesearch methods (continued). Steepest descent methods

Lecture 7: September 17

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Conditional Gradient (Frank-Wolfe) Method

Optimization for Machine Learning

The Conjugate Gradient Method

Line Search Methods for Unconstrained Optimisation

Convex Optimization M2

Optimization Problems with Constraints - introduction to theory, numerical Methods and applications

Optimality Conditions for Constrained Optimization

CSCI : Optimization and Control of Networks. Review on Convex Optimization

CS-E4830 Kernel Methods in Machine Learning

minimize x subject to (x 2)(x 4) u,

Algorithms for Constrained Optimization

Optimization Methods. Lecture 19: Line Searches and Newton s Method

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Numerical Optimization

Convex Optimization & Lagrange Duality

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Optimization methods

8 Numerical methods for unconstrained problems

2. Dual space is essential for the concept of gradient which, in turn, leads to the variational analysis of Lagrange multipliers.

Primal/Dual Decomposition Methods

UC Berkeley Department of Electrical Engineering and Computer Science. EECS 227A Nonlinear and Convex Optimization. Solutions 5 Fall 2009

Trust Region Methods. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

4TE3/6TE3. Algorithms for. Continuous Optimization

Introduction to Nonlinear Stochastic Programming

Dual and primal-dual methods

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

10 Numerical methods for constrained problems

Examination paper for TMA4180 Optimization I

TMA4180 Solutions to recommended exercises in Chapter 3 of N&W

A Brief Review on Convex Optimization

Nonlinear equations. Norms for R n. Convergence orders for iterative methods

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

5.6 Penalty method and augmented Lagrangian method

5. Duality. Lagrangian

Algorithms for Nonsmooth Optimization

Optimization methods

MATH 4211/6211 Optimization Basics of Optimization Problems

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Subgradient Method. Ryan Tibshirani Convex Optimization

Lecture 8 Plus properties, merit functions and gap functions. September 28, 2008

Convex Optimization. Problem set 2. Due Monday April 26th

Dual Proximal Gradient Method

March 5, 2012 MATH 408 FINAL EXAM SAMPLE

March 8, 2010 MATH 408 FINAL EXAM SAMPLE

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Math 273a: Optimization Convex Conjugacy

Sparse Optimization Lecture: Dual Methods, Part I

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

LINEAR ALGEBRA BOOT CAMP WEEK 4: THE SPECTRAL THEOREM

Convex Optimization Boyd & Vandenberghe. 5. Duality

Iteration-complexity of first-order penalty methods for convex programming

Image restoration: numerical optimisation

Written Examination

IE 5531: Engineering Optimization I

SECTION C: CONTINUOUS OPTIMISATION LECTURE 11: THE METHOD OF LAGRANGE MULTIPLIERS

2.098/6.255/ Optimization Methods Practice True/False Questions

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Support Vector Machines

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Numerical Optimization: Basic Concepts and Algorithms

Numerical Optimization

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

Nonlinear Optimization: What s important?

The Steepest Descent Algorithm for Unconstrained Optimization

Lecture 2: Convex Sets and Functions

Accelerated primal-dual methods for linearly constrained convex problems


Lecture: Cone programming. Approximating the Lorentz cone.

Constrained optimization: direct methods (cont.)

Quiz Discussion. IE417: Nonlinear Programming: Lecture 12. Motivation. Why do we care? Jeff Linderoth. 16th March 2006

Applications of Linear Programming

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

Douglas-Rachford Splitting: Complexity Estimates and Accelerated Variants

THE NEWTON BRACKETING METHOD FOR THE MINIMIZATION OF CONVEX FUNCTIONS SUBJECT TO AFFINE CONSTRAINTS

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Transcription:

Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set method and ADMM Approximating the constraint set 1 / 22

You should be able to... Recognise and formulate a gradient projection algorithm; Select the step length Extend the idea to exploratory moves Have an understanding of the complexity of the method Compute and approximate projections 2 / 22

Gradient Projection min f(x), s.t. x X f is continuously differentiable and X is closed and convex. We have the following iteratative method: x k+1 = P X (x k α k f(x k )) rf(x k ) X where α k > 0 is the step-length. Define the projection arc for α > 0: projection arc x k x k ( ) x k (α) = P X (x k α f(x k )) x k rf(x k ) 3 / 22

Gradient Projection The projection arc is the set of all possible next iterates paramterised by α. Next we show that unless x k (α) = x k (which is a condition for optimality of x k ), the vector x k (α) x k is a feasible descent direction. We first need an important result. Theorem (Projection Theorem): Let X be a nonempty closed convex subset of R n. There is a unique vector that minimises z x over x X called the projection of z on X. Furthermore, x is the projection of z on X iff (z x ) T (x x ) 0, x X. 4 / 22

Gradient Projection Theorem (Descent Properties of Gradient Projection): (i) If x k (α) x k, then x k (α) x k is a feasible descent direction and particularly f(x k ) T (x k (α) x k ) 1 α x k(α) x k 2, α > 0 (ii) If x k (α) = x k for some α > 0 then x k satisfies the necessary condition for minimising f(x) over X, i.e. f(x k ) T (x x k ) 0, x X From Projection Theorem: (x k α f(x k ) x k (α)) T (x x k (α)) 0, x X Setting x = x k yields (i). If x k (α) = x k for some α > 0. Above inequality yields (ii). 5 / 22

Gradient Projection An (often) important assumption for constant step-size convergence: Lipschitz continuity of the gradient f(x) f(y) L x y, x, y X This condition results in the following important inequality: f(y) l(y;x) {}}{ f(x) + f(x) T (y x) + L 2 y x 2, x, y X 1 f(y) f(x) = (y x) T f(x + t(y x))dt 0 1 1 (y x) T f(x)dt + (y x) T ( f(x + t(y x)) f(x))dt 0 (y x) T f(x) + 1 (y x) T f(x) + y x 0 0 y x f(x + t(y x)) f(x) dt 1 0 y x Ltdt = (y x) T f(x) + L 2 y x 2 6 / 22

Gradient Projection: Constant Step Length Theorem (Constant Step Length Convergence): Assume the gradient is Lipschitz continuous and α k = α, α (0, 2/L). Then every limit point x of the generated sequence {x k } satisfies the necessary optimality condition f( x) T (x x) 0, x X. From the inequality of the previous page by y = x k+1 and x = x k : f(x k+1 ) f(x k ) + f(x k ) T (x k+1 x k ) + L 2 x k+1 x k 2 Moreover, f(x k ) T (x k+1 x k ) 1 α x k+1 x k 2. Thus, ( 1 f(x k+1 ) f(x k ) α L ) x k+1 x k 2 2 Since α (0, 2/L), the cost function is reduced. 7 / 22

Gradient Projection: Varying Step Length Thus for any limit x of the subsequence K, f(x k ) f( x) and consequently x k+1 x k 0. Consequently, P X ( x α f( x)) x = lim x k+1 x k = 0 k,k K This (by the earlier descent result) implies that x satisfies the necessary optimality condition. Theorem (Convergence for Convex Cost Function): Let α k ᾱ is selected via any step length rule and for all k f(x k+1 ) l(x k+1 ; x k ) + 1 2α k x k+1 x k 2. Then {x k } converges to x and f(x k ) f min x X x 0 x 2, k 0 2kᾱ 8 / 22

Gradient Projection: Constant Step Length and Strong Convexity The Lipschitz condition needs to be satisfied for the level set L = {x X f(x) f(x 0 )} that depends on the initial condition. The error converges to 0 with an order O(1/k). The convergence is linear when f is strongly convex. Theorem (Convergence for Strongly Convex Cost Function): Let α (0, 2/L) and f is strongly convex with modulus σ. Then, x k+1 x max( 1 αl, 1 ασ ) x k x. The bound is minimised if α = 2/(σ + L) 1. L/σ is the condition number of the problem (L σ). 1 I have spent some part of my research finding such optimum step-lengths for different optimisation problems. 9 / 22

Gradient Projection: Diminishing Step Length Consider diminishing step size: lim α k = 0, k α k =, k=0 αk 2 < k=0 If there is a scalar γ such that ) γ (1 2 + min x x X 0 x 2 sup k 0 f(x k ) 2 Then the gradient projection method converges even without Lipschitz continuity of the gradient. For example, f(x) = x 3/2 gradient projection converges to 0 for the diminishing step size, but not with a constant step size (gradient not Lipschitz) Convergence rate is sublinear. 10 / 22

Gradient Projection: Step length via an Armijo-like rule Algorithm: Gradient Projection Via an Armijo-like rule β (0, 1), x 0, α, k 0 while x k+1 x k > τ do e.g. or any other termination condition d k P X (x k α f(x k )) x k m k 0 while f(x k ) f(x k + β m k d k ) < β m k f(x k ) T d k do m k m k + 1 end while x k+1 x k + β m k d k k k + 1 end while At each step, we search along the line {x k + γd k γ > 0} by checking step sizes γ = 1, β, β 2,... until sufficient decrease is obtained. For convex f this algorithm converges to the solution without the gradient Lipschitz condition. 11 / 22

Some Complexity Discussions How many iterations are required to achieve a solution with cost that is within ɛ > 0 of the optimum? A method has iteration complexity O(1/ɛ p ) if we can show (for some M, p > 0): min f(x k) f + ɛ k M/ɛ p A method involves cost function error of order O(1/k q ) if we can show (for some M, q > 0): min j k f(x j) f + M k q If M does not depend on n then it is good for large problems. (gradient vs. Newton) For gradient methods it requires k O(1/ɛ) to get an error order of O(1/k). However, these bounds do not take advantage of the special structure of the problem. 12 / 22

Gradient Projection With Exploration: Heavy Ball x(k + 1) = x(k) rf x(k) + x(k) x(k 1) 13 / 22

Gradient Projection With Exploration: Optimal Iteration Complexity Heavy ball takes advantage of memory (x k 1 ) to improve the performance. Adding more memory is not necessarily useful. Assume f(x) is convex and has a Lipschitz continuous gradient. The iterations become (x 1 = x 0, β k (0, 1)) y k = x k + β k (x k x k 1 ), x k+1 = P X (y k α f(x k )), (exploration step) (gradient projection step) β k = θ k(1 θ k 1 ) θ k 1. {θ k } such that θ 0 = θ 1 (0, 1] and 1 θ k+1 θ 2 k+1 1 θ 2 k 14 / 22

Gradient Projection With Exploration: Optimal Iteration Complexity Example: β k and θ k 0 k = 0 1 k = 1 β k = k 1 k 1, θ k = 2 k 0 k + 2 k + 2 Theorem: Let α = 1/L and β k is chosen as above. Then, lim k x k x = 0 and f(x k ) f 2L (k + 1) 2 x k x 2. The error is O(1/k 2 ) and equivalently the iteration complexity is O(1/ ɛ). 15 / 22

Gradient Projection: Projection,... what projection? Recall P X (x) = ξ where ξ arg min z x z, s.t. z X So, do we need to solve another optimisation problem for each iteration of an optimisation problem?! Example: Box Constraints A simple box constraint : X = {x l x u} u and l are respectively the upper and lower bounds on the entries of x. l i x i < l i ξ i = x i l i x i u i u i u i x i 16 / 22

Gradient Projection: What Projection? Example: Linear Subspace Ax = b P X (x) = ξ where ξ arg min z x z, s.t. z X = {x Ax = b} ξ = (I A T (AA T ) 1 A)x + A T (AA T ) 1 b Example: Constraint Set Defined by Inequalities P X (x) = ξ where ξ arg min z x z, s.t. z X = {x c i (x) 0, i I} c i concave and I inequality constraints index set This in effect is solving a quadratic problem with constraints. 17 / 22

Solving Quadratic Programs We will consider the case where the constraints c i (x) are linear. Two different approaches to solving QPs will be considered. Primal Active Set Method Alternating Direction Method of Multipliers (ADMM) In primal active-set methods some of the inequality constraints (and all the equalities, if any) are imposed as equalities. This subset is referred to as the working set, W k. It is required that the constraints in the working set be linearly independent. Let s assume all constraints are linearly independent. min 1 2 xt Qx + q T x, s.t. a T i x b i, i I 18 / 22

Algorithm: Active-Set Method for Convex QP Choose a feasible x 0 and W 0 A(x 0 ) for k = 0, 1,... do p k arg min 1 2 pt Qp+(Qx k +q) T p, s.t. a T i p = 0, i W i; if p k = 0 then Find ˆλ i solving: i W k a i λ i = Qx k + q; if λ i 0, i W k I then x x k ; stop; else j arg min j Wk I λ j ; W k+1 W k \ {j}; end if else p k 0 B arg min i Wk, a T i p k<0(b i a T i x k)/(a T i p k); α k min ( 1, (b j a T j x k)/(a T j p k) ), j B; x k+1 x k + α k ; W k+1 W k {j} j B ; end if end for Convergence in finite steps for Q > 0: projection, Q = I, q = 0. 19 / 22

Alternating Direction Method of Multipliers Consider the following problem min x,z s.t. f 1 (x) + f 2 (z) Ax = z Define the augmented Lagrangian: L A (x, z, λ; µ) = f 1 (x) + f 2 (z) λ T (Ax z) + µ Ax z 2 2 The iterations become: x k+1 arg min x L A (x, z k, λ k ; µ) z k+1 arg min z L A (x k+1, z, λ k ; µ) λ k+1 = λ k + µ(ax k+1 z k+1 ) The inner two minimisations are decoupled. The method converges for convex f 1 and f 2 for any µ > 0. It is related to the Augmented Lagrangian methods and Douglas-Raschford splitting. 20 / 22

Gradient Projection: ADMM Algorithm: ADMM for QP with linear inequality constraints a a For more detail see: Ghadimi, E., Teixeira, A., Shames, I. and Johansson, M., 2015. Optimal parameter selection for the alternating direction method of multipliers (ADMM): quadratic problems. IEEE Transactions on Automatic Control, 60(3), pp.644-658. Choose x 0, z 0, u 0, and µ > 0 u is the scaled Lagrange multiplier, u = λ/µ while A termination condition is not satisfied do x k+1 (Q + ρa A) 1 [q µa (z k + u k c)]; z k+1 max{0, Ax k+1 u k b}; u k+1 u k Ax k+1 + b + z k+1 ; end while The algorithm converges R-linearly to the solution for Q > 0: projection, Q = I, q = 0. Optimum step-length: ( ) 1 µ = λ max (AQ 1 A T )λ min (AQ 1 A T ) 21 / 22

Projection: Approximating X We know projection onto boxes is easy. So why not approximate constraints with a box? A box is a -norm ball centred at x c and with radius R: B(x c, R) = {x x x c R} Let X = {x a T i x b i, i I}. The goal is find the largest ball (square box) in X. The problem of finding the largest ball (x c is called the Chebyshev centre): max x c,r R s.t. c i (x c, R) 0, i I R 0 c i (x c, R) = sup a T i (x c + Ru) b i = a T i x + R u 1 ( sup a T i u u 1 ) b i = a T i x + R a i b i ( a i = a i 1 ) 22 / 22