Lecture 7: CS395T Numerical Optimization for Graphics and AI Trust Region Methods

Similar documents
Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

Trust Regions. Charles J. Geyer. March 27, 2013

Trust Region Methods. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications

Constrained Optimization Theory

Examination paper for TMA4180 Optimization I

Unconstrained Optimization

OPER 627: Nonlinear Optimization Lecture 9: Trust-region methods

Performance Surfaces and Optimum Points

Deep Learning. Authors: I. Goodfellow, Y. Bengio, A. Courville. Chapter 4: Numerical Computation. Lecture slides edited by C. Yim. C.

j=1 r 1 x 1 x n. r m r j (x) r j r j (x) r j (x). r j x k

Constrained Optimization and Lagrangian Duality

4 Newton Method. Unconstrained Convex Optimization 21. H(x)p = f(x). Newton direction. Why? Recall second-order staylor series expansion:

Nonlinear Optimization for Optimal Control

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Gradient Descent. Dr. Xiaowei Huang

Higher-Order Methods

1 Overview. 2 A Characterization of Convex Functions. 2.1 First-order Taylor approximation. AM 221: Advanced Optimization Spring 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

SF2822 Applied Nonlinear Optimization. Preparatory question. Lecture 9: Sequential quadratic programming. Anders Forsgren

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

Optimization Tutorial 1. Basic Gradient Descent

(Here > 0 is given.) In cases where M is convex, there is a nice theory for this problem; the theory has much more general applicability too.

Conditional Gradient (Frank-Wolfe) Method

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

The Steepest Descent Algorithm for Unconstrained Optimization

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Lagrange multipliers. Portfolio optimization. The Lagrange multipliers method for finding constrained extrema of multivariable functions.

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 5. Nonlinear Equations

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems

VIME: Variational Information Maximizing Exploration

Constrained Optimization

arxiv: v1 [math.oc] 1 Jul 2016

Algorithms for constrained local optimization

Introduction to Nonlinear Stochastic Programming

Lectures 9 and 10: Constrained optimization problems and their optimality conditions

The Inverse Function Theorem via Newton s Method. Michael Taylor

Unconstrained optimization

Outline. Scientific Computing: An Introductory Survey. Nonlinear Equations. Nonlinear Equations. Examples: Nonlinear Equations

Second Order Optimality Conditions for Constrained Nonlinear Programming

Scientific Computing: An Introductory Survey

Optimization II: Unconstrained Multivariable

Optimality Conditions

5 Handling Constraints

Determination of Feasible Directions by Successive Quadratic Programming and Zoutendijk Algorithms: A Comparative Study

Module 04 Optimization Problems KKT Conditions & Solvers

MATH 2400: PRACTICE PROBLEMS FOR EXAM 1

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

Lecture 5: September 12

Unit 2: Solving Scalar Equations. Notes prepared by: Amos Ron, Yunpeng Li, Mark Cowlishaw, Steve Wright Instructor: Steve Wright

CS 450 Numerical Analysis. Chapter 5: Nonlinear Equations

Optimization. Charles J. Geyer School of Statistics University of Minnesota. Stat 8054 Lecture Notes

Optimization II: Unconstrained Multivariable

CONSTRAINED NONLINEAR PROGRAMMING

Chapter 3: Root Finding. September 26, 2005

Tangent spaces, normals and extrema

MATH 5720: Unconstrained Optimization Hung Phan, UMass Lowell September 13, 2018

Nonlinear Programming (NLP)

Lecture 5: September 15

A Trust-region-based Sequential Quadratic Programming Algorithm

Introduction to gradient descent

Taylor and Laurent Series

In view of (31), the second of these is equal to the identity I on E m, while this, in view of (30), implies that the first can be written

V. Graph Sketching and Max-Min Problems

Optimality, Duality, Complementarity for Constrained Optimization

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey

Quadratic Programming

We have seen that for a function the partial derivatives whenever they exist, play an important role. This motivates the following definition.

Algorithms for Constrained Optimization

EE 227A: Convex Optimization and Applications October 14, 2008

Optimality Conditions for Constrained Optimization

a j x j. j=0 The number R (possibly infinite) which Theorem 1 guarantees is called the radius of convergence of the power series.

Second Order Optimization Algorithms I

Positive semidefinite matrix approximation with a trace constraint

Lecture 3: Basics of set-constrained and unconstrained optimization

Preliminary draft only: please check for final version

8 Barrier Methods for Constrained Optimization

On Lagrange multipliers of trust-region subproblems

Lecture notes for quantum semidefinite programming (SDP) solvers

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

NONLINEAR. (Hillier & Lieberman Introduction to Operations Research, 8 th edition)

On Lagrange multipliers of trust region subproblems

IFT Lecture 2 Basics of convex analysis and gradient descent

This manuscript is for review purposes only.

Lecture Notes: Geometric Considerations in Unconstrained Optimization

Summary Notes on Maximization

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Chap 3. Linear Algebra

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Advanced Calculus I Chapter 2 & 3 Homework Solutions October 30, Prove that f has a limit at 2 and x + 2 find it. f(x) = 2x2 + 3x 2 x + 2

Gradient-Based Optimization

Chapter 8. P-adic numbers. 8.1 Absolute values

MS&E 318 (CME 338) Large-Scale Numerical Optimization

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Sufficient Conditions for Finite-variable Constrained Minimization

Lecture Note 12: The Eigenvalue Problem

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Calculus of Variation An Introduction To Isoperimetric Problems

Transcription:

Lecture 7: CS395T Numerical Optimization for Graphics and AI Trust Region Methods Qixing Huang The University of Texas at Austin huangqx@cs.utexas.edu 1 Disclaimer This note is adapted from Section 4 of Numerical Optimization by Jorge Nocedal and Stephen J. Wright. Springer series in operations research and financial engineering. Springer, New York, NY, 2. ed. edition, (2006) 2 Introduction Line search methods and trust-region methods both generate steps with the help of a quadratic model of the objective function, but they use this model in different ways. Line search methods use it to generate a search direction, and then focus their efforts on finding a suitable step length along this direction. Trust-region methods define a region around the current iterate within which they trust the model to be an adequate representation of the objective function, and then choose the step to be the approximate minimizer of the model in this region. In effect, they choose the direction and length of the step simultaneously. If a step is not acceptable, they reduce the size of the region and find a new minimizer. In general, the direction of the step changes whenever the size of the trust region is altered. Remark 2.1. Trust region methods have proven to be very effective on various applications. However, it seems to be less used compared to line search methods, partly because it is more complicated to understand and implement. Nevertheless, below is a list of recent papers: Trust Region Policy Optimization. John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan and Philipp Moritz. Proceedings of the 32nd International Conference on Machine Learning (ICML-15). Pages: 1889-1897 Fast Trust Region for Segmentation. Lena Gorelick, Frank R. Schmidt, Yuri Boykov; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 1714-1721 Discriminative Clustering for Image Co-segmentation Armand Joulin, Francis Bach and Jean Ponce. CVPR, 2010. The size of the trust region is critical to the effectiveness of each step. If the region is too small, the algorithm misses an opportunity to take a substantial step that will move it much closer to the minimizer of the objective function. If too large, the minimizer of the model may be far from the minimizer of the objective function in the region, so we may have to reduce the size of the region and try again. In practical algorithms, we choose the size of the region according to the performance of the algorithm during previous iterations. If the model is consistently reliable, producing good steps and accurately predicting the behavior of the objective function along these steps, the size of the trust region may be increased to allow longer, more 1

ambitious, steps to be taken. A failed step is an indication that our model is an inadequate representation of the objective function over the current trust region. After such a step, we reduce the size of the region and try again. The difference between trust region methods and Newton methods is that in many instances the second order approximation is accurate within a certain region, while the optimal solution to the Newton method may be out of this region. In this section, we will assume that the model function m k that is used at each iterate x k is quadratic. Moreover, m k is based on the Taylor-series expansion of f around x k, which is f(x k + p) = f(x k ) + ( f(x k )) T p + 1 2 pt 2 f(x k + tp)p, where t is some scalar in the interval (0, 1). By using an approximation B k to the Hessian in the second-order term, m k is defined as follows: m k (p) = f(x k ) + ( f(x k )) T p + 1 2 pt B k p, where B k is some symmetric matrix. For example, For example, the choice B k = 2 f(x k ) leads to the trust-region Newton method. In this lecture, we will carry out the discussion by assuming B k is general except that it is symmetric and has uniform boundedness. More precisely, we will consider the following sub-optimization problem: min m k(p) = f(x k ) + ( f(x k )) T p + 1 p R n 2 pt B k p s.t. p k, (1) where k > 0 is the trust-region radius. As we will see later, solving (1) can be down without too much computational cost. Moreover, in most cases, we only need an approximate solution to obtain convergence and good practical behavior. Outline Of The Trust-Region Approach One of the key ingredients in a trust-region algorithm is the strategy for choosing the trust-region radius k at each iteration. We base this choice on the agreement between the model function m k and the objective function f at previous iterations. Given a step p k we define the ratio ρ k = f(x k) f(x k + p k ). m k (0) m k (p k ) The numerator is called the actual reduction, and the denominator is the predicted reduction (that is, the reduction in f predicted by the model function). Note that since the step p k is obtained by minimizing the model m k over a region that includes p 0, the predicted reduction will always be nonnegative. Hence, if ρ k is negative, the new objective value f(x k + p k ) is greater than the current value f(x k ), so the step must be rejected. On the other hand, if ρ k is close to 1, there is good agreement between the model m k and the function f over this step, so it is safe to expand the trust region for the next iteration. If ρ k is positive but significantly smaller than 1, we do not alter the trust region, but if it is close to zero or negative, we shrink the trust region by reducing k at the next iteration. Exercise 1. Convert the paragraph above into a formal algorithm. 3 Main Theorem Regarding the Sub-program Trust region amounts to solve the following optimization problem at each iteration: min p f + g T p + 1 2 pt Bp s.t. p. (2) The following theorem characterizes the optimal solution to this 2

Theorem 3.1. Vector p is the global solution of (2) if and only if p is feasible and there is a scalar λ 0 such that the following conditions are satisfied: (B + λi)p = g, λ( p ) = 0, (B + λi) is positive semidefinite (3) Proof Sketch. The first step is to make simplify the problem by converting into the following simplied form: min y 1,,y n subject to 1 2 λ i(y i o i ) 2 yi 2 2. (4) Without losing generality, we can assume o i 0, since otherwise we have flip the sign of the corresponding y i. So the global solution either happens to be y i = o i, 1 i n, which means λ i > 0, 1 i n and o 2 i 2, or along the bound of p. If it is on the boundary, it must be the case that the isolines of yi 2 = 2 and λ i (p i o i ) 2 must have the same tangent line, and the gradient is in opposite directions. i i In other words, there exists a positive number λ so that λy i = λ i (y i o i ), 1 i n, which means Now consider the objective function y i = λ i λ + λ i o i, 1 i n. f(y) = λyi 2 + λ i (y i o i ) 2. For any other y that 2 = y T y = y T y, we have f(y) f(y). This means f(y) f(y) = = ( λ i (yi o i ) 2 (y i o i ) 2) (λ + λ i )(y i y i ) 2 0. So B + λi must be SDP. Formal Proof. The proof relies on the following technical lemma, which deals with the unconstrained minimizers of quadratics and is particularly interesting in the case where the Hessian is positive semidefinite. Lemma 3.1. Let m be the quadratic function defined by m(p) = g T p + 1 2 pt Bp, where B is any symmetric matrix. Then the following statements are true. 3

m attains a minimum if and only if B is positive semi-definite and is in the range of B. If B is positive semi-definite, then every p satisfying Bp = g is a global minimizer of m. m has a unique minimizer if and only if B is positive definite. Proof. We prove each of the three claims in turn. We start by proving the if part. Since g is in the range of B, there is a p with Bp = g. For all w R n, we have m(p + w) = g T (p + w) + 1 1 (p + w)t B(p + w) = (g T p + 1 2 pt Bp) + g T w + (Bp) T w + 1 2 wt Bw = m(p) + 1 2 wt Bw m(p), since B is positive semidefinite. Hence, p is a minimizer of m. For the only if part, let p be a minimizer of m. Since m(p) = Bp + g = 0, we have that g is in the range of B. Also, we have 2 m(p) = B positive semidefinite, giving the result. For the if part, the same argument above suffices with the additional point that w T Bw > 0 whenever w 0. For the only if part, we proceed as in (i) to deduce that B is positive semidefinite. If B is not positive definite, there is a vector w 0 such that Bw = 0. Hence, we have m(p + w) = m(p), so the minimizer is not unique, giving a contradiction. Now let us go back to the prove. Assume first that there is λ 0 such that the conditions in the theorem are satisfied. The lemma above implies that p is a global minimum of the quadratic function ˆm(p) = g T p + 1 2 pt (B + λi)p = m(p) + λ 2 pt p. Since ˆm(p) ˆm(p ), we have m(p) m(p ) + λ 2 ((p ) T p p T p). (5) Because λ( p ) = 0 and therefore λ( 2 (p ) T p ) = 0, we have m(p) m(p ) + λ 2 ( 2 p T p). Hence, from λ 0, we have m(p) m(p ) for all p with p. Therefore, p is a global minimizer of the sub-problem. For the converse, we assume that p is a global solution of (2) and show that there is a λ 0 that satisfies (3). In the case p <, p is an unconstrained minimizer of m, and so m(p ) = Bp + g = 0, 2 m(p ) = Bpositive semidefinite, and so the properties (3) hold for λ 0. Assume for the remainder of the proof that p =. Then (3)(b) is immediately satisfied, and p also solves the constrained problem min m(p) subject to p =. By applying optimality conditions for constrained optimization to this problem (a simpler version is what we have discussed above), we find that there is a λ such that the Lagrangian function defined by L(p, λ) := m(p) + λ 2 (pt p 2 ) 4

has a stationary point at p. By setting p L(p, λ) = 0, we obtain Bp + g + λp = 0 (B + λi)p = g, so that (3)(a) holds. Since ˆm(p) ˆm(p ) for any p with p T p = (p ) T p = 2, we have for such vectors p that m(p) m(p ) + λ 2 ((p ) T p p T p). If we substitute the expression for p = (B+λI) 1 g into this expression, we obtain after some rearrangement that 1 2 (p p ) T (B + λi)(p p ) 0. Since the set of directions p p is dense, so B + λi must be positive semidefinite. It remains to show that λ 0. Because (3)(a) and (3)(c) are satisfied by p, we have from the Lemma that p minimizes L(p, λ), so (5) holds. Suppose that there are only negative values of λ that satisfy (3)(a) and (3)(c). Then we have from (5) that m(p) m(p ) whenever p p. Since we already know that p minimizes m for p, it follows that m is in fact a global, unconstrained minimizer of m. From Lemma it follows that Bp = g and B is positive semidefinite. Therefore conditions 3(a) and 3(c) are satisfied by λ 0, which contradicts our assumption that only negative values of λ 0 can satisfy the conditions. We conclude that λ 0, completing the proof. This theorem suggests that the sub-problem can be reformulated as defining p(λ) = (B + λi) 1 g for λ sufficiently large that B + λi is positive definite and seek a value λ > 0 such that p(λ) =. This problem is one-dimensional root-finding problem in the variable λ. It is easy to formulate (i.e., by using the eigenvector decomposition of B) p(λ) 2 = j=1 (u T i g)2 (λ j + λ) 2, where u i denotes the eigenvectors of B. Let λ n be the smallest eigenvector of B, then it is clear that p(λ) is a continuous, non-increasing function of λ on the interval ( λ n, ). In fact, we have that lim p(λ) = 0. λ + When u T n g 0, the root p(λ) = 0 can be found using Newton s method. A numerically more stable strategy is find the root for 1 1 p(λ) = 0. When u T n g = 0, it can be shown that the optimal λ = λ n. 5