Unconstrained Optimization

Similar documents
Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Unconstrained optimization

Optimality conditions for unconstrained optimization. Outline

The Steepest Descent Algorithm for Unconstrained Optimization

Preliminary draft only: please check for final version

Scientific Computing: Optimization

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Chapter 2: Unconstrained Extrema

Constrained Optimization

Examination paper for TMA4180 Optimization I

1 Overview. 2 A Characterization of Convex Functions. 2.1 First-order Taylor approximation. AM 221: Advanced Optimization Spring 2016

Math 273a: Optimization Basic concepts

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Optimality Conditions

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Computational Optimization. Mathematical Programming Fundamentals 1/25 (revised)

Introduction to Unconstrained Optimization: Part 2

Algorithms for constrained local optimization

SWFR ENG 4TE3 (6TE3) COMP SCI 4TE3 (6TE3) Continuous Optimization Algorithm. Convex Optimization. Computing and Software McMaster University

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

Nonlinear Optimization: What s important?

Unconstrained minimization of smooth functions

Performance Surfaces and Optimum Points

Algorithms for Constrained Optimization

Gradient Descent. Dr. Xiaowei Huang

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Numerical Optimization

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

E 600 Chapter 3: Multivariate Calculus

MATH 4211/6211 Optimization Basics of Optimization Problems

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

IE 5531: Engineering Optimization I

Lecture 3: Basics of set-constrained and unconstrained optimization

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Computational Optimization. Convexity and Unconstrained Optimization 1/29/08 and 2/1(revised)

Unconstrained optimization I Gradient-type methods

MVE165/MMG631 Linear and integer optimization with applications Lecture 13 Overview of nonlinear programming. Ann-Brith Strömberg

Line Search Methods for Unconstrained Optimisation

Optimization Tutorial 1. Basic Gradient Descent

Optimality Conditions for Constrained Optimization

Nonlinear Optimization for Optimal Control

Higher-Order Methods

Optimization. Yuh-Jye Lee. March 21, Data Science and Machine Intelligence Lab National Chiao Tung University 1 / 29

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey

Second Order Optimization Algorithms I

Chapter 2: Preliminaries and elements of convex analysis

5 Handling Constraints

Continuous Optimisation, Chpt 6: Solution methods for Constrained Optimisation

ARE202A, Fall 2005 CONTENTS. 1. Graphical Overview of Optimization Theory (cont) Separating Hyperplanes 1

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Appendix A Taylor Approximations and Definite Matrices

FIXED POINT ITERATIONS

Optimization. A first course on mathematics for economists

Convex Optimization. Problem set 2. Due Monday April 26th

Numerical Optimization

Solution Methods. Richard Lusby. Department of Management Engineering Technical University of Denmark

NONLINEAR. (Hillier & Lieberman Introduction to Operations Research, 8 th edition)

Multivariate Newton Minimanization

Optimization and Root Finding. Kurt Hornik

Notes on Constrained Optimization

1 Directional Derivatives and Differentiability

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

ARE211, Fall2015. Contents. 2. Univariate and Multivariate Differentiation (cont) Taylor s Theorem (cont) 2

Lecture 7: CS395T Numerical Optimization for Graphics and AI Trust Region Methods

Math (P)refresher Lecture 8: Unconstrained Optimization

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

4 damped (modified) Newton methods

CE 191: Civil and Environmental Engineering Systems Analysis. LEC 05 : Optimality Conditions

Static unconstrained optimization

Module 5 : Linear and Quadratic Approximations, Error Estimates, Taylor's Theorem, Newton and Picard Methods

Arc Search Algorithms

MATHEMATICAL ECONOMICS: OPTIMIZATION. Contents

A Primer on Multidimensional Optimization

MA102: Multivariable Calculus

OPER 627: Nonlinear Optimization Lecture 9: Trust-region methods

Single Variable Minimization

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

Lecture 5: September 12

CPSC 540: Machine Learning

Nonlinear Optimization

On Nesterov s Random Coordinate Descent Algorithms - Continued

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

FALL 2018 MATH 4211/6211 Optimization Homework 4

Zero-Order Methods for the Optimization of Noisy Functions. Jorge Nocedal

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

Outline. Scientific Computing: An Introductory Survey. Optimization. Optimization Problems. Examples: Optimization Problems

Programming, numerics and optimization

Optimization Methods. Lecture 19: Line Searches and Newton s Method

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

This manuscript is for review purposes only.

Mathematical optimization

Lecture 3. Optimization Problems and Iterative Algorithms

CPSC 540: Machine Learning

September Math Course: First Order Derivative

Quadratic Programming

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

FINANCIAL OPTIMIZATION

Transcription:

1 / 36 Unconstrained Optimization ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University February 2, 2015

2 / 36

3 / 36

4 / 36

5 / 36 1. preliminaries 1.1 local approximation 1.2 optimality conditions 1.3 convexity 2. gradient descent 3. Newton s method 4. stabilization 5. trust regions

6 / 36 Taylor series Assuming function f (x) has derivatives of any order, the Taylor series expansion of x about the point x 0 is f (x) = n=0 f (n) (x 0 ) (x x 0 ) n, n! where f (n) (x 0 ) is the nth-order derivative at x 0.

7 / 36 Taylor s theorem Let N 1 be an integer and let the function f (x) be N times differentiable at the point x 0, then f (x) = f (x 0 ) + N n=1 f (n) (x 0 ) (x x 0 ) n + o ( x x 0 N), n! The notation for the remainder, o ( x x 0 N), means that the remainder is small compared to x x 0 N. Formally, lim x x0 o( x x 0 N ) x x 0 N = 0.

8 / 36 Local approximation Approximations in R: (linear) f (x) f (x 0 ) + df (x 0) (x x 0 ); dx (quadratic) f (x) f (x 0 ) + df (x 0) (x x 0 ) + 1 dx 2 Approximations in R n (linear) f (x) f (x 0 ) + (quadratic) f (x) f (x 0 ) + + 1 2 n i=1 n i=1 n i=1 n j=1 f (x 0 ) x i (x i x i0 ); f (x 0 ) x i (x i x i0 ) d 2 f (x 0 ) dx 2 (x x 0 ) 2 2 f (x 0 ) x i x j (x i x i0 )(x j x j0 )

9 / 36 Local approximation The vector form of the approximations: linear: and quadratic: f (x) f (x 0 ) + g T x 0 (x x 0 ); f (x) f (x 0 ) + g T x 0 (x x 0 ) + 1 2 (x x 0) T H x0 (x x 0 ). Here g x0 and H x0 are the gradient and Hessian matrix of f (x) at x 0. H is symmetric. We call f = f (x) f (x 0 ) and x = x x 0 the function perturbation and perturbation vector (at x 0 ).

10 / 36 Local approximation Exercise 1: Find the second-order approximation for f (x) = (3 x 1 ) 2 + (4 x 2 ) 2. How many local minima do we have? What is special about the Hessian? Exercise 2: Find the quadratic approximation of the function: f (x) = 2x 1 + x 2 1 + 2x 2 + x 2 2, x R2, x (0, 0) T. Is the Hessian positive definite? Is the problem bounded?

The function has positive definite Hessian everywhere in its feasible domain, but its function value is unbounded. 11 / 36

12 / 36

13 / 36 Necessary and sufficient conditions first-order necessary condition If f (x), x X R n, has a local minimum at an interior point x of the set X and if f (x) is continuously differentiable at x, then g x = 0. second-order optimality condition Let f (x) be twice differentiable at the point x. 1. (necessity) If x is a local solution, then g x = 0 and Hessian at x is positive-semidefinite. 2. (sufficiency) If the Hessian of f (x) is positive-definite at a stationary point x, i.e., g x = 0, then x is a local minimum. Exercise 3: Find the solution(s) for min x R 2 f (x) = 4x 2 1 4x 1x 2 + x 2 2 4x 1 + 2x 2. What about first-order sufficient condition?

14 / 36 Proof for first-order necessary condition From first-order approximation at x we have: f (x) = f (x ) + g T x (x x ) + o( x x ). (1) Let x = x tg x. (Here we deliberately pick g x as the direction.) Since x is a local solution, we have f (x tg x ) f (x ) 0, t > 0. Take Equation (1) into account to have: 0 f (x tg x ) f (x ) t = g x 2 + o(t g x ). t Taking t 0, we have 0 g x 2 0, requiring g x = 0.

15 / 36 Proof for second-order necessary condition From second-order approximation at x we have: f (x) = f (x ) + g T x (x x ) + 1 2 (x x ) T H x (x x ) + o( x x 2 ). (2) Let x = x + td, where d is a unit direction, i.e., d = 1. Using first-order necessary condition, and the fact that x is a local solution, we have 0 f (x + td) f (x ) t = 1 2 dt H x d + o(t2 ) t 2. Taking t 0, we have 0 d T H x d. Since d is arbitrarily chosen, we have that H x is positive semi-definite.

The Hessian at the stationary point is positive semidefinite, but the stationary point is not a local minimum. 16 / 36

17 / 36

18 / 36 Convex sets and convex functions Definition (convex set) A set S R n is convex if and only if, for every point x 1, x 2 in S, the point x(λ) = λx 2 + (1 λ)x 1, 0 λ 1 also belongs to the set.

19 / 36 Convex sets and convex functions Definition (convex function) A function f : X R, X R n defined on a nonempty convex set X is called convex on X if and only if, for every x 1, x 2 X : where 0 λ 1. f (λx 2 + (1 λ)x 1 ) λf (x 2 ) + (1 λ)f (x 1 ), Exercise 4: Show the intersection of convex sets is convex; Show f 1 + f 2 is convex on the set S if f 1, f 2 are convex on S. Exercise 5: Show that f (x 1 ) f (x 0 ) + g T x 0 (x 1 x 0 ) for a convex function.

20 / 36 Convex sets and convex functions A differentiable function is convex if and only if its Hessian is positive-semidefinite in its entire convex domain. (hint: use Taylor s theorem to have f (x 1 ) = f (x 0 ) + g T x 0 (x 1 x 0 ) + 1/2(x 1 x 0 ) T H x(λ) (x 1 x 0 ), for x(λ) = λx 1 + (1 λ)x 0.) A positive-definite Hessian implies strict convexity, but the converse is generally not true. Example? first-order sufficient condition If a differentiable convex function with a convex open domain has a stationary point, this point will be the global minimum. If the function is strictly convex, then the minimum will be unique. If the function is convex but not strictly convex, will the minimum be unique? If the function is strictly convex, will the minimum be not unique?

21 / 36 Gradient descent In reality it is hard to solve for the optimal solution x by hand because (i) the system of equations from the first-order necessary condition may not be easy to solve or (ii) the objective may not have an analytical form. Therefore, we need an iterative procedure to produce a series x 0, x 1,..., x k that converges to x. One naive way is to use the following: x k+1 = x k g k. Why? Exercise 6: Try this method for the following problem min x f (x) = x 4 1 2x 2 1x 2 + x 2 2 with x 0 = (1.1, 1) T. Explain your observation.

Gradient descent Results from Exercise 6. The gradient steps have correct directions but their step sizes are not desirable. 22 / 36

Gradient descent Setting the step to 0.001 will allow the algorithm to converge but only slowly (takes more than 1000 steps to meet the target tolerance g 10 6 ) 23 / 36

24 / 36 Armijo line search In Armijo line search, we construct a function φ(α) = f k + αtg T k s k, for some t [0.01, 0.3] and denote f (α) := f (x k + αs k ). Starting with a large value (default α = 1), α is reduced by α = bα for some b [0.1, 0.8] until f (α) < φ(α), at which point it is guaranteed that f (α) < f k, since φ(α) < f k by nature.

25 / 36 Gradient algorithm with line search Algorithm 1 Gradient algorithm 1: Select x 0, ε > 0. Compute g 0. Set k = 0. 2: while g k ε do 3: Compute α k = arg min α>0 f (x k αg k ). 4: Set x k+1 = x k α k g k. 5: end while

26 / 36 Figure: α = 0.001, w/o line search

27 / 36 Figure: line search w/ t = 0.3, b = 0.2

28 / 36 Newton s method Instead of using second-order approximation in line search, we can use it to find the direction. f k+1 = f k + g k x k + 1 2 xt k H k x k. The first-order necessary condition for minimizing the approximated f k+1 requires x k+1 = x k H 1 k g k. If the function is locally strictly convex, this iteration will yield a lower function value. Newton s method will move efficiently in the neighborhood of a local minimum where local convexity is present.

29 / 36 Newton s method Newton s method also requires line search since the second order approximation may not capture the actual function. Algorithm 2 Newton s method 1: Select x 0, ε > 0. Compute g 0 and H 0. Set k = 0. 2: while g k ε do 3: Compute α k = arg min α>0 f (x k αh 1 k g k ). 4: Set x k+1 = x k α k H 1 k g k. 5: end while Exercise 7: Try this method for the following problem min x f (x) = 1 3 x3 1 + x 1 x 2 + 1 2 x2 2 + 2x 2 with x 0 = (1, 1) T, ( 1, 1) T, ( 3, 0) T.

Newton s method Exercise 7 cont.: Different starting points lead to different solutions. Newton s method does not guarantee a descent direction when the objective function is nonconvex. 30 / 36

31 / 36 Exercise (1/3) 4.6 Using the methods of this chapter, find the minimum of the function f = (1 x 1 ) 2 + 100(x 2 x 2 1) 2. This is the well-known Rosenbrock s banana function, a test function for numerical optimization algorithms. 4.8 Prove by completing the square that if a function f (x 1, x 2 ) has a stationary point, then this point is (a) a local minimum, if ( 2 f / x1 2 )( 2 f / x2 2 ) ( 2 ) 2 f / x 1 x 2 > 0 and 2 f / x1 2 > 0; (b) a local maximum, if ( 2 f / x1 2 )( 2 f / x2 2 ) ( 2 ) 2 f / x 1 x 2 > 0 and 2 f / x1 2 < 0; (c) a saddlepoint, if ( 2 f / x1 2 )( 2 f / x2 2 ) ( 2 ) 2 f / x 1 x 2 < 0.

32 / 36 Exercise (2/3) 4.10 Show that the stationary point of the function f = 2x 2 1 4x 1 x 2 + 1.5x 2 2 + x 2 is a saddle. Find the directions of downslopes away from the saddle using the differential quadratic form.

33 / 36 Exercise (3/3) 4.12 Find the point in the plane x 1 + 2x 2 + 3x 3 = 1 in R 3 that is nearest to the point ( 1, 0, 1) T. 4.15 Consider the function f = x 2 + 2x 1 x 2 + x 2 1 + x 2 2 3x 2 1x 2 2x 3 1 + 2x 4 1. (a) Show that the point (1, 1) T is stationary and that the Hessian is positive-semidefinite there. (b) Find a straight line along which the second-order perturbation f is zero.

34 / 36 Stabilization Given a symmetric matrix M k, the gradient and Newton s methods can be classed together by the general iteration x k+1 = x k αm k g k, where M k = I for gradient descent and M k = H 1 k for Newton s method. By the first-order Taylor approximation f k+1 f k = αg T k M kg k, descent is accomplished for positive-definite M k. To ensure descent, we can use M k = (H k + µ k I) 1 and select a positive scalar µ k.

35 / 36 Trust region (1/2) Newton s method is faster if x k is close to x. If the search length αh 1 k g k is really short or if H k is very different from the Hessian at x, then g k may be a better direction to search. Trust region algorithm: search using a quadratic approximation, but restrict the search step within the trust region with radius at x k : min s subject to f g T k s + 1 2 st H k s s If s then or if s = then s k = H 1 k g k H 1 k g k <, s k = (H k + µi) 1 g k, µ > 0, s k =.

36 / 36 Trust region (2/2) Given, we evaluate at f (x k + s k ) and calculate the ratio r k = f (x k) f (x k + s k ) (g T k s k + 1 2 st k H ks k ) The trust region radius is increased when r k > 0.75 (indicating a good step ) and decreased when r k < 0.25 (indicating a bad step ). Algorithm 3 Trust region algorithm 1: Start with x 0 and 0 > 0. Set k = 0. 2: Calculate the step s k. 3: Calculate the value f (x k + s k ) and the ratio r k. 4: If f (x k + s k ) f (x k ), then set k+1 = k /2, x k+1 = x k, k = k + 1 and go to Step 3. 5: Else, set x k+1 = x k + s k. If r k < 0.25, then set k+1 = k /2; if r k > 0.75, then set k+1 = 2 k ; otherwise set k+1 = k. Set k = k+1 and go to Step 2.