Optimization. Charles J. Geyer School of Statistics University of Minnesota. Stat 8054 Lecture Notes

Similar documents
Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Constrained Optimization and Lagrangian Duality

5 Handling Constraints

Lecture: Duality of LP, SOCP and SDP

OPTIMIZATION THEORY IN A NUTSHELL Daniel McFadden, 1990, 2003

Optimization Theory. Lectures 4-6

14. Duality. ˆ Upper and lower bounds. ˆ General duality. ˆ Constraint qualifications. ˆ Counterexample. ˆ Complementary slackness.

Optimality, Duality, Complementarity for Constrained Optimization

EC /11. Math for Microeconomics September Course, Part II Problem Set 1 with Solutions. a11 a 12. x 2

Nonlinear Programming and the Kuhn-Tucker Conditions

Primal/Dual Decomposition Methods

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Review of Optimization Methods

In view of (31), the second of these is equal to the identity I on E m, while this, in view of (30), implies that the first can be written

Introduction to Nonlinear Stochastic Programming

Lecture 3. Optimization Problems and Iterative Algorithms

Constrained Optimization

Scientific Computing: Optimization

Linear programming II

Lecture 1: January 12

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

Lecture 18: Optimization Programming

8 Numerical methods for unconstrained problems

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

Convex functions, subdifferentials, and the L.a.s.s.o.

Lecture 7: Convex Optimizations

Computational Finance

1 Introduction

Nonlinear Optimization: What s important?

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Support Vector Machines: Maximum Margin Classifiers

ISM206 Lecture Optimization of Nonlinear Objective with Linear Constraints

Constrained optimization: direct methods (cont.)

Convex Optimization & Lagrange Duality

Algorithms for Constrained Optimization

5. Duality. Lagrangian

1 Convexity, concavity and quasi-concavity. (SB )

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Linear and non-linear programming

Convex Optimization Overview (cnt d)

MATH2070 Optimisation

CS711008Z Algorithm Design and Analysis

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Convex Optimization M2

ARE202A, Fall 2005 CONTENTS. 1. Graphical Overview of Optimization Theory (cont) Separating Hyperplanes 1

A Brief Review on Convex Optimization

Lagrangian Duality and Convex Optimization

Convex Optimization Boyd & Vandenberghe. 5. Duality

EC /11. Math for Microeconomics September Course, Part II Lecture Notes. Course Outline

FINANCIAL OPTIMIZATION

Written Examination

minimize x subject to (x 2)(x 4) u,

i=1 h n (ˆθ n ) = 0. (2)

Introduction to Mathematical Programming IE406. Lecture 10. Dr. Ted Ralphs

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

CS-E4830 Kernel Methods in Machine Learning

Lecture 16: October 22

Interior-Point Methods for Linear Optimization

Lecture 4 September 15

1. Nonlinear Equations. This lecture note excerpted parts from Michael Heath and Max Gunzburger. f(x) = 0

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Applications of Linear Programming

Summary Notes on Maximization

January 29, Introduction to optimization and complexity. Outline. Introduction. Problem formulation. Convexity reminder. Optimality Conditions

Link lecture - Lagrange Multipliers

SF2822 Applied Nonlinear Optimization. Preparatory question. Lecture 9: Sequential quadratic programming. Anders Forsgren

ICS-E4030 Kernel Methods in Machine Learning

Sequential Convex Programming

Nonlinear Programming (Hillier, Lieberman Chapter 13) CHEM-E7155 Production Planning and Control

CONSTRAINED NONLINEAR PROGRAMMING

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

OPER 627: Nonlinear Optimization Lecture 9: Trust-region methods

Lecture 4: Optimization. Maximizing a function of a single variable

Lecture: Duality.

12. Interior-point methods

CS 450 Numerical Analysis. Chapter 5: Nonlinear Equations

Bindel, Spring 2017 Numerical Analysis (CS 4220) Notes for So far, we have considered unconstrained optimization problems.

MATHEMATICAL ECONOMICS: OPTIMIZATION. Contents

Seminars on Mathematics for Economics and Finance Topic 5: Optimization Kuhn-Tucker conditions for problems with inequality constraints 1

You should be able to...

Selected Topics in Optimization. Some slides borrowed from

Generalization to inequality constrained problem. Maximize

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

ARE202A, Fall Contents

More on Lagrange multipliers

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

Nonlinear equations. Norms for R n. Convergence orders for iterative methods

Algorithms for constrained local optimization

CS 6820 Fall 2014 Lectures, October 3-20, 2014

Outline. Scientific Computing: An Introductory Survey. Optimization. Optimization Problems. Examples: Optimization Problems

A new ane scaling interior point algorithm for nonlinear optimization subject to linear equality and inequality constraints

Quiz Discussion. IE417: Nonlinear Programming: Lecture 12. Motivation. Why do we care? Jeff Linderoth. 16th March 2006

Trust Regions. Charles J. Geyer. March 27, 2013

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

September Math Course: First Order Derivative

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Convexity in R n. The following lemma will be needed in a while. Lemma 1 Let x E, u R n. If τ I(x, u), τ 0, define. f(x + τu) f(x). τ.

Transcription:

Optimization Charles J. Geyer School of Statistics University of Minnesota Stat 8054 Lecture Notes 1

One-Dimensional Optimization Look at a graph. Grid search. 2

One-Dimensional Zero Finding Zero finding means solving g(x) = 0 for x. If f(x) is objective function for optimization and g(x) = f (x), zero of g is stationary point of f not necessarily optimum. Bisection method. Secant method. Newton s method. R function uniroot (on-line help). 3

Multi-Dimensional Optimization Minimize f : S R. Global Minimum Point x such that f(y) f(x), for all x S Local Minimum Point x such that f(y) f(x), where W is neighborhood of x in S. for all x W 4

Convex Optimization A function f : R d (, + ] is convex if f ( tx + (1 t)y ) tf(x) + (1 t)f(y), x, y R d and 0 < t < 1. Theorem: Every local minimum of a convex function is a global minimum. A convex function is strictly convex if f(x) < and f(y) < and 0 < t < 1 implies f ( tx + (1 t)y ) < tf(x) + (1 t)f(y). Theorem: minimum. A strictly convex function has at most one local 5

Convex Optimization (cont.) Minimum need not be achieved. Consider f(x) = e x, x R f(x) 0 as x but f(x) > 0 for all x. 6

One-Dimensional Derivative Tests Necessary and sufficient for convexity (a) f (x) f (y) whenever x > y (b) f(y) f(x) + f (x)(y x) whenever x y (c) f (x) 0 for all x With replaced by > (a) and (b) are necessary and sufficient for strict convexity and (c) is sufficient but not necessary. 7

Multi-Dimensional Derivative Tests Necessary and sufficient for convexity (a) y x, f(y) f(x) 0 whenever x y (b) f(y) f(x) + f(x), y x whenever x y (c) 2 f(x) positive semidefinite for all x With replaced by > (a) and (b) are necessary and sufficient for strict convexity. With positive semidefinite replaced by positive definite (c) is sufficient but not necessary. 8

Concave and Convex, Maximization and Minimization f is convex if and only if f is concave. In order to avoid lots of duplication, optimization theory only discusses minimization. Minimizing f maximizes f. Switch from minimization to maximization by standing on your head. 9

Global Optimization Except for easy case of convexity. Global optimization is hard. Deterministic algorithms only available for very special problems, e. g., difference of convex functions, and computer code not widely available. Many adaptive random search algorithms advertised to do global optimization, but this is only a hope. They don t stop at the first local minimum they find, but.... 10

Adaptive Random Search IMHO an area rife with charlatanry. Widely used methods are justified by metaphors, not math simulated annealing genetic algorithms Scientists are people too. They like stories. Metaphors do not make optimization algorithms effective. 11

Charlie s Rules for Adaptive Random Search When looking in a sequence of random places x 1, x 2,..., x n. 1. Keep track of the x k that minimizes, f(x k ) f(x i ), 1 i n. 2. Search intensively near the lowest point found so far. 3. Then search elsewhere. Don t stop. 12

Charlie s Rules (cont.) You can do things you can call simulated annealing or genetic algorithms and follow these rules. You can do things you can call simulated annealing or genetic algorithms that don t follow these rules. For any particular problem, if you bother to really think about it, you can probably invent an adaptive random search algorithm particularly for it that beats the metaphors. 13

Newton s Method Method for solving simultaneous non-linear equations g(x) = 0, where g : R d R d. g(y) g(x) + g(x)(y x) where g(x) is linear operator represented by matrix of partial derivatives g i (x)/ x j. Set equal to zero and solve y = x ( g(x) ) 1 g(x) Iterate and hope for convergence. 14

Newton s Method (cont.) In optimization, Newton minimizes quadratic approximation to function m(y) = f(x) + f(x), y x + 1 2 2 f(x)(y x), y x if 2 f(x) is positive semidefinite. Newton update is x n+1 = x n ( 2 f(x n ) ) 1 f(xn ) 15

Big Oh and Little Oh a n = O(b n ) means a n /b n is bounded. a n = o(b n ) means a n /b n converges to zero. Also used for continuous, e. g., definition of derivative f(y) = f(x) + f(x), y x + y x o(y x) 16

Rates of Convergence Let x 1, x 2,... be iterates of optimization algorithm. x n x, and let ɛ n = x n x. Suppose Linear convergence if ɛ n+1 = O(ɛ n ). Superlinear convergence if ɛ n+1 = o(ɛ n ). Quadratic convergence if ɛ n+1 = O(ɛ 2 n). 17

Newton s Method Theorem Suppose f(y) and 2 f(y) are continuous in neighborhood of solution x and 2 f(x) is positive definite, then Newton s method converges superlinearly (if it converges). In addition suppose f(y) = f(x) + 2 f(x)(y x) + O( y x 2 ) 2 f(y) = 2 f(x) + O( y x ) then Newton s method converges quadratically (if it converges). These conditions hold when f has continuous third partial derivatives in a neighborhood of x. 18

Dennis-Moré Theorem Let x 1, x 2,... be iterates of optimization algorithm, and let be the Newton step at x n. n = ( 2 f(x n ) ) 1 f(xn ) If the algorithm converges superlinearly, then x n+1 x n = n + o( n ) 19

Dennis-Moré Theorem (cont.) Every superlinearly convergent optimization algorithm is asymptotically equivalent to Newton s method. (Its steps are equal to Newton steps plus negligible amount.) Every superlinearly convergent optimization algorithm is asymptotically equivalent to any other superlinearly convergent optimization algorithm. To get quadratic convergence must have x n+1 x n = n + O( n 2 ) 20

Quasi-Newton Update is x n+1 = x n B n f(x n ) (1) where B n ( 2 f(x n ) ) 1 If the approximation (1) is good enough, then the algorithm will converge superlinearly (see Wikipedia page). 21

Why Newton Sucks Although Newton is asymptotically most wonderful, it is not guaranteed to converge when started an appreciable distance from the solution. The least a minimization routine can do is take downhill steps. Newton isn t even guaranteed to do that (Rweb demo). Newton or quasi-newton needs modification to be practically useful. Such modification is called safeguarding. 22

Line Search When at x n with a search direction p n that points downhill f(x n ), p n 0, Find a local minimum s n of the function g(s) = f(x n + sp n ) and take step x n+1 = x n + s n p n Sloppy solution s n to line search subproblem is o. k. (see Wolfe conditions). 23

Line Search (cont.) If search direction p n converges to Newton direction, then steps will be nearly Newton for large n, and algorithm will have superlinear convergence. However, this algorithm is safeguarded, guaranteed to take only downhill steps. 24

Line Search (cont.) Convergence to a local minimum cannot be guaranteed. In order for convergence to a stationary point to be guaranteed, search directions p n must satisfy angle criterion: if g n = f(x n ) then p n, g n p n g n (cosine of angle between p n and g n ) must be bounded away from zero. 25

Trust Region Methods Trust region methods work by repeatedly solving the trust region subproblem where minimize m n (p) = f n + g n, p + 1 2 B kp, p subject to p n f n = f(x n ) g n = f(x n ) B n = 2 f(x n ) Like Newton, we minimize the quadratic approximation (Taylor series up to quadratic terms), but we only trust the approximation in a ball of radius n. 26

Trust Region Methods (cont.) Let p n be the solution of the trust region subproblem (see design document for trust package for details). Let ρ n = f(x n) f(x n + p n ) m(0) m(p n ) Numerator is actual decrease in objective function if we take step p n (positive if step is downhill). Denominator is decrease in objective function of trust region subproblem for step p n (always positive). 27

Trust Region Methods (cont.) Accept proposed step only if appreciable decrease in objective function. If ρ n 1/4, then x n+1 = x n + p n, otherwise x n+1 = x n. Adjust trust region radius based on quality of proposed step. If ρ k 1/4, then n+1 = p n /4. If ρ k 3/4 and p n = n, then n+1 = min(2 n, max ). Otherwise n+1 = n. 28

Trust Region Properties Any cluster point x of sequence of iterates satisfies first and second order necessary conditions for optimality. f(x) = 0 2 f(x) is positive semidefinite Converges quadratically if solution satisfies first and second order sufficient conditions (same as above except positive definite). Not bothered by restricted domain of objective function, so long as solution is in interior. Just works (trust package web page). 29

Constrained Optimization minimize f(x) subject to g i (x) = 0, g i (x) 0, where E and I are disjoint finite sets. i E i I (2) Say x is feasible if constraints hold. 30

Lagrange Multipliers Lagrangian function L(x, λ) = f(x) + i E I λ i g i (x) Theorem: If there exist x and λ such that (a) x minimizes x L(x, λ), (b) g i (x ) = 0, i E and g i (x ) 0, i I, (c) λ i 0, i I, and (d) λ i g i (x ) = 0, i I. Then x solves the constrained problem (2). 31

Lagrange Multipliers (cont.) Proof: By (a) L(x, λ) L(x, λ) (3) so for feasible x f(x) f(x ) + i E I = f(x ) f(x ) i E I λ i g i (x ) λ i g i (x) i E I λ i g i (x) top inequality is (3), equality is (d), bottom inequality is (c) and feasibility of x. 32

Kuhn-Tucker Conditions (a) f(x ) + i E I λ i g i (x ), (b) g i (x ) = 0, i E and g i (x ) 0, i I, (c) λ i 0, i I, and (d) λ i g i (x ) = 0, i I. Only (a) changed. Rest stay same. Now no theorem. (a) too weak. 33

Existence of Lagrange Multipliers Lagrange multipliers always exist such that theorem applies if all constraints affine, objective function convex, inequality constraint functions convex, and equality constraint functions affine, or the set is linearly independent. { g i (x ) : i E I and g i (x ) = 0 } (May exist in other cases, but not guaranteed.) 34

R packages Problem (2) called linear programming if all functions affine. R packages linprog (on-line help), lpsolve (on-line help), and rcdd (on-line help), do that. Problem (2) called quadratic programming if objective function is quadratic and constraint functions affine. R package quadprog (on-line help), does that. Problem (2) called nonlinear programming if general functions are allowed. R package npsol (not installed, not free software, obtain from me) does that. 35

Isotonic Regression Suppose we observe pairs (y i, x i ), i = 1,..., n and we wish to estimate the regression function E(Y X) subject to the condition that it is monotone. How? Treat x i as fixed (condition on them). Assume Y i Normal(µ i, σ 2 ) and x i x j implies µ i µ j What is the MLE of the vector µ = (µ 1,..., µ n )? 36

Isotonic Regression (cont.) Problem solved in 1955 independently by two different groups of statisticians. Let z 1,..., z k be the unique x values in sorted order, and define w j = V j = 1 w j n 1 i=1 x i =z j n Y i i=1 x i =z j 37

Isotonic Regression (cont.) Then equivalent weighted least squares problem is the following. Assume and V j Normal(µ j, σ 2 /w j ) µ 1 µ 2 µ k What is the MLE of the vector µ = (µ 1,..., µ n )? 38

Isotonic Regression (cont.) Lagrangian function is k j=1 w j (v j µ j ) 2 + k 1 i=1 λ j (µ j+1 µ j ) 39

Isotonic Regression (cont.) Kuhn-Tucker conditions are (a) 2w j (v j µ j ) λ j + λ j 1 = 0, j = 1,..., k. (b) µ i µ j, 1 i j k. (c) λ j 0, j = 1,..., k, and (d) λ j (µ j+1 µ j ) = 0, j = 1,..., k 1. where to make (a) simple we define λ 0 = λ k = 0. 40

Isotonic Regression (cont.) Solution must be step function. Have blocks of equal µ s. Let j i, i = 1,..., m be the starting indices of the blocks, so µ jl 1 < µ jl = = µ jl+1 1 < µ jl+1 where to make this simple we define µ 0 = and µ k+1 = +. Complementary slackness implies λ jl 1 = λ jl+1 1 = 0. Hence 0 = j l+1 1 = 2 j=j l j l+1 1 [ 2wj (v j µ j ) λ j + λ j 1 ] j=j l w j (v j µ) where µ = µ jl = = µ jl+1 1. 41

Isotonic Regression (cont.) Hence µ = j l+1 1 w j v j j=j l j l+1 1 w j j=j l is the value for this block that satisfies the first and fourth Kuhn- Tucker conditions. The same goes for every block. The estimate for the block is the average of the data values for the block. So that gives us an algorithm: try all possible partitions into blocks. There is exactly one such partition that satisfies primal feasibility (makes an increasing function). And that is the solution. 42

Isotonic Regression (cont.) The pool adjacent violators algorithm (PAVA) is much more efficient. 1. Initialize. Set µ = v. Every point is a block by itself. 2. Test. If µ satisfies primal feasibility stop. Have solution. 3. Pool. Since test fails, there exist adjacent blocks that violate primal feasibility. Pool them, setting the estimate for the new block to be the average of all data for it. 4. go to 2. 43

Exponential Families One-parameter exponential family has log likelihood and derivatives l(θ) = xθ c(θ) l (θ) = x c (θ) l (θ) = c (θ) The differentiation under the integral sign identities prove E θ {l (θ)} = 0 var θ {l (θ)} = E θ {l (θ)} E θ (X) = c (θ) var θ (X) = c (θ) 44

Exponential Families (cont.) From this it follows that the mean value parameter µ = E θ (X) = c (θ) is a strictly increasing function of the natural parameter θ, because the derivative of the map c : µ θ is c, which is positive, being a variance. Hence if θ i are the natural parameters and µ i the mean value parameters, we have θ i θ j if and only if µ i µ j. 45

Isotonic Regression and Exponential Families Now we assume Y i all have distributions in the same one-parameter exponential family, and θ i are the natural parameters. We do isotonic regression, maximum likelihood subject to The Lagrangian is n i=1 θ 1 θ 2 θ n [ yi θ i c(θ i )] + n 1 i=1 The first Kuhn-Tucker condition is λ i (θ i+1 θ i ) 0 = y j c (θ j ) λ j + λ j 1 = y j µ j λ j + λ j 1 46

Isotonic Regression and Exponential Families (cont.) The rest of the Kuhn-Tucker conditions are the same as when we assumed the response was normal. When we write µ j = θ j as on the preceding slide, the Kuhn- Tucker conditions only involve the mean value parameters and are the same regardless of the exponential family involved. Hence the same algorithm, PAVA, solves all exponential family isotonic regression problems! 47