Optimization and Root Finding. Kurt Hornik

Optimization and Root Finding Kurt Hornik

Basics Root finding and unconstrained smooth optimization are closely related: Solving ƒ () = 0 can be accomplished via minimizing ƒ () 2 Slide 2

Basics Root finding and unconstrained smooth optimization are closely related: Solving ƒ () = 0 can be accomplished via minimizing ƒ () 2 Unconstrained optima of ƒ must be critical points, i.e., solve ƒ () = 0 (Note: ƒ scalar for optimization and typically vector-valued for root finding.) Linear equations and linear least squares problems can be solved exactly using techniques from numerical linear algebra. Otherwise: solve approximately as limits of iterations k+1 = g( k ). Slide 2

Fixed Point Iteration Consider iteration k+1 = g( k ) with limit and g smooth. Then = g( ) and where k+1 = g( k ) g( ) + J g ( )( k ) = + J g ( )( k ) J g () = g () j is the Jacobian of g at (sometimes also written as ( g) ()). Thus: k+1 J g ( )( k ) Slide 3

Fixed Point Iteration In general, for local convergence it is necessary and sufficient that ρ(j g ( )) < 1 where ρ(a) = mx{ λ : λ is an eigenvalue of A} is the spectral radius of A. In this case, we get (at least) linear (local) convergence, i.e., k+1 C k for some 0 < C < 1 and k sufficiently large. Slide 4

Outline Root Finding Unconstrained Optimization Constrained Optimization Optimization with R Slide 5

Newton s Method Suppose we have an approximation k to and that in a neighborhood of k, the linearization L k () = ƒ ( k ) + J ƒ ( k )( k ) is a good approximation to ƒ. An obvious candidate for a next approximation is obtained by solving L k () = 0, i.e., k+1 = k J ƒ ( k ) 1 ƒ ( k ) = g( k ), g() = J ƒ () 1 ƒ (), This is the mathematical form: the computational one is Solve J ƒ ( k )s k = ƒ ( k ) for s k ; k+1 = k + s k. Slide 6

Newton s Method Iteration function g has components and partials g () = [J ƒ () 1 ] ƒ (), g j () = δ j As ƒ ( ) = 0, g j ( ) = δ j [J ƒ () 1 ] ƒ () j [J ƒ () 1 ] ƒ j (). [J ƒ ( ) 1 ] ƒ j ( ) = δ j [J ƒ ( ) 1 J ƒ ( )] j = 0 so that J g ( ) vanishes! Thus, local convergence is super-linear (and can in fact be shown to be at least quadratic) in the sense that Slide 7 k+1 α k k, lim k α k = 0.

Newton s Method: Discussion Due to super-linear convergence, works very nicely once we get close enough to a root where the above approximations apply. However (with n the dimension of the problem): In each step, O(n 2 ) derivatives needs to be computed (exactly or numerically), which can be costly (in particular if function and/or derivative evaluations are costly); Slide 8

Super-Linear Convergence Attractive because faster than linear. Also, k+1 k k k+1 so that with super-linear convergence, k+1 k k 1 k+1 k = α k 0. I.e., relative change in the k update is a good approximation for the relative error in the approximation of by k (which is commonly used for stopping criteria). Slide 9

Super-Linear Convergence One can show: for sequences k+1 = k B 1 k ƒ ( k) with suitable non-singular matrices B k one has super-linear (local) convergence to with ƒ ( ) = 0 if and only if (B k J ƒ ( ))( k+1 k ) lim = 0, k k+1 k i.e., if B k converges to J ƒ ( ) along the directions s k = k+1 k of the iterative method. Suggests investigating iterative Quasi-Newton methods with such B k which are less costly to compute and/or invert. Slide 10

Broyden s Method Note that Newton s method is based on the approximation ƒ ( k+1 ) ƒ ( k ) + J ƒ ( k )( k+1 k ). Suggests considering approximations B k+1 which exactly satisfy the secant equation ƒ ( k+1 ) = ƒ ( k ) + B k+1 ( k+1 k ). Of all such B k+1, the one with the least change to B k seems particularly attractive. I.e., solutions to B B k F min, y k = ƒ ( k+1 ) ƒ ( k ) = Bs k (where F is the Frobenius norm). Slide 11

Broyden s Method We have B B k 2 F = vec(b) vec(b k) 2, vec(bs k y k ) = (s k )vec(b) y k. Thus, writing b = vec(b), b k = vec(b k ), A k = s k, optimization problem is b b k 2 min, A k b = y k. Convex quadratic optimzation problem with linear constraints: can solve using geometric ideas (orthogonal projections on affine subspaces) or using Lagrange s method. Slide 12

Broyden s Method Lagrangian: L(b, λ) = 1 2 b b k 2 + λ (A k b y k ). For critical point: b L = b b k + A k λ = 0, λl = A k b y k = 0. Thus b = b k A k λ with y k = A k b = A k (b k A λ), so k λ = (A k A k ) 1 (A k b k y k ) and b = b k A k (A ka k ) 1 (A k b k y k ). Slide 13

Broyden s Method As A k A k = (s k )(s k ) = s k s k and A k b k = (s k )vec(b k) = vec(b k s k ), A k (A ka k ) 1 (A k b k y k ) = 1 s k s (s k )(A k b k y k ) = 1 k s k s vec((a k b k y k )s k ) k and hence vec(b) = b = vec(b k ) vec and thus B = B k + (y k B k s k )s k s k s k (Ak b k y k )s k s k s k = vec B k (B ks k y k )s k s k s k Slide 14

Broyden s Method Based on a suitable guess 0 and a suitable full rank approximation B 0 for the Jacobian (e.g., ), iterate using Solve B k s k = ƒ ( k ) for s k k+1 = k + s k y k = ƒ ( k+1 ) ƒ ( k ) B k+1 = B k + (y k B k s k )s k /(s k s k). Again, this is the mathematical form. Computationally, can improve! Note that iteration for B k performs rank-one updates of the form B k+1 = B k + k k, and all we need is the inverse H k = B 1 k! Slide 15

Broyden s Method Remember Sherman-Morrison formula for inverse of rank-one update: Thus, (A + ) 1 = A 1 (1/σ)A 1 A 1, σ = 1 + A 1 = 0. H k+1 = B 1 k+1 = (B k + k k ) 1 = H k with k = (y k B k s k ) and k = s k /(s k s k) so that and Slide 16 H k k = B 1 k k = B 1 k y k s k = H k y k s k 1 + k H k k = 1 + s k (H ky k s k ) s k s k 1 1 + k H k k H k k k H k = s k H ky k s k s. k

Broyden s Method Computational form: based on suitable guess 0 and a suitable full rank approximation H 0 for the inverse of the Jacobian (e.g., ), iterate using s k = H k ƒ ( k ) k+1 = k + s k y k = ƒ ( k+1 ) ƒ ( k ) H k+1 = H k (H ky k s k )s k H k s k H ky k (of course needs that s k H ky k = 0). Slide 17

Spectral Methods Used for very large n. E.g., Barzilai-Borwein methods. One variant (DF-SANE) based on idea to use very simple B k or H k proportional to the identity matrix. However: when B k = β k, secant condition B k+1 s k = y k typically cannot be achieved exactly. Instead, determine β k+1 to solve y k B k+1 s k 2 = y k β k+1 s k 2 min with solution β k+1 = y k s k/s k s k. Gives iteration k+1 = k ƒ ( k )/β k, β k+1 = y k s k/s k s k or equivalently (with s k and y k as before), Slide 18 k+1 = k α k ƒ ( k ), α k+1 = s k s k/y k s k.

Spectral Methods Alternatively, when using H k = α k, the exact secant condition would be H k+1 y k = s k, and one determines α k+1 to solve s k H k+1 y k 2 = s k α k+1 y k 2 min with solution α k+1 = s k y k/y k y k. Gives iteration k+1 = k α k ƒ ( k ), α k+1 = s k y k y k y. k Slide 19

Root Finding with R Methods only only available in add-on packages. Package nleqslv has function nleqslv() providing the Newton and Broyden methods with a variety of global strategies. Slide 20

Root Finding with R Methods only only available in add-on packages. Package nleqslv has function nleqslv() providing the Newton and Broyden methods with a variety of global strategies. Package BB has function BBsolve() providing Barzilai-Borwein solvers, and and multistart() for starting solvers from multiple starting values. Slide 20

Examples System 2 1 + 2 2 = 2, e 1 1 + 3 2 = 2 Has one solution at 1 = 2 = 1. Slide 21

Outline Root Finding Unconstrained Optimization Constrained Optimization Optimization with R Slide 22

Theory Consider unconstrained minimization problem with objective function ƒ. If ƒ is smooth: ƒ ( + s) = ƒ ( ) + ƒ ( ) s + 1 2 s H ƒ ( + αs)s for some 0 < α < 1, with H ƒ () = 2 ƒ () the Hessian of ƒ at. Hence: j local minimum = ƒ ( ) = 0 ( is a critical point of ƒ) and H ƒ ( ) non-negative definite (necessary first and second order conditions) Slide 23

Steepest Descent Linear approximation ƒ ( + αs) ƒ () + α ƒ () s suggests looking for good α along the steepest descent direction s = ƒ (). Typically performed as s k = ƒ ( k ) Choose α k to minimize ƒ ( k + αs k ) k+1 = k + α k s k. (method of steepest descent [with line search]). Slide 24

Newton s Method Use local quadratic approximation ƒ ( + s) ƒ () + ƒ () s + 1 2 s H ƒ ()s (equivalently, linear approximation for ƒ) and minimize RHS over s to obtain Newton s method k+1 = k H ƒ ( k ) 1 ƒ ( k ) with computational form Solve H ƒ ( k )s k = ƒ ( k ) for s k ; k+1 = k + s k. Slide 25

Quasi-Newton Methods Similarly to the root-finding case, there are many modifications of the basic scheme, the general form k+1 = k α k B 1 k ƒ ( k) referred to as quasi-newton methods: adding the α k allows to improve the global performance by controlling the step size. (Note however that away from a critical point, the Newton direction s does not necessarily result in a local decrease of the objective function.) Slide 26

Quasi-Newton Methods Similar to the root-finding case, one can look for simple updating rules for Hessian approximations B k subject to the secant constraint B k s k = y k = ƒ ( k+1 ) ƒ ( k ), additionally taking into account that the B k should be symmetric (as approximations to the Hessians). This motivates solving B B k F min, B symmetric, y k = Bs k. which can easily be shown to have the unique solution B k + (y k B k s k )(y k B k s k ) (y k B k s k ) s k. Symmetric rank-one (SR1) update formula (Davidon, 1959). Slide 27

Quasi-Newton Methods SR1 has numerical difficulties: one thus considers similar problems with different norms, such as weighted Frobenius norms B B k F,W = W 1/2 (B B k )W 1/2 F. One can show: for any (symmetric) W for which Wy k = s k (such as the inverse of the average Hessian 1 0 H ƒ ( k + τs k ) dτ featuring in Taylor s theorem) which makes the approximation problem non-dimensional, B B k F,W min, B symmetric, y k = Bs k has the the unique solution ( ρ k y k s k )B k( ρ k s k y k ) + ρ ky k y k, ρ k = 1/y k s k. This is the DFP (Davidon-Fletcher-Powell) update formula. Slide 28

Quasi-Newton Methods Average Hessian: we have ƒ ( + s) ƒ () = ƒ ( + τs) 1 0 = = 1 d ƒ ( + τs) dτ 0 dτ 1 H ƒ ( + τs) dτ s 0 and hence (ignoring α k for simplicity) 1 y k = ƒ ( k + s k ) ƒ ( k ) = H ƒ ( k + τs k ) dτ s k. 0 Slide 29

Quasi-Newton Methods Most effective quasi-newton formula is obtained by applying the corresponding approximation argument to H k = B 1 k, i.e., by using the unique solution to H H k F,W min H symmetric, s k = Hy k. with W any matrix satisfying Ws k = y k (such as the average Hessian matrix), which is given by ( ρ k s k y k )H k( ρ k y k s k ) + ρ ks k s k, ρ k = 1 y k s. k This is the BFGS (Broyden-Fletcher-Goldfarb-Shanno) update formula. Slide 30

Quasi-Newton Methods Using Sherman-Morrison-Woodbury formula (A + UV ) 1 = A 1 A 1 U( + V A 1 U) 1 V A 1 and lengthy computations, one can show: BFGS update for Hk corresponds to B k+1 = B k + y ky k y k s k B ks k s k B k s k B ks k DFP update for Bk corresponds to H k+1 = H k + s ks k s k y H ky k y k H k k y k H ky k Rank-two updates. (Note the symmetry.) Slide 31

Conjugate Gradient Methods Alternative to quasi-newton methods. Have the form k+1 = k + α k s k, s k+1 = ƒ ( k+1 ) + β k+1 s k where the β are chosen such that s k+1 and s k are suitably conjugate. E.g., the Fletcher-Reeves variant uses β k+1 = g k+1 g k+1 g k g, g k+1 = ƒ ( k+1 ). k Advantage: require no matrix storage (uses only gradients) and are faster than steepest descent, hence suitable for large-scale optimization problems. Slide 32

Gauss-Newton Method Minimizing objective functions of the form ϕ() = 1 r () 2 = r() 2 /2, 2 is known (roughly) as non-linear least squares problems, corresponding to minimizing the sum of squares of residuals r () = t ƒ (e, ) (in statistics, we usually write and y for the inputs and targets, and θ instead of for the unknown parameter). Clearly, ϕ() = J r () r(), H ϕ () = J r () J r () + r ()H r (). Slide 33

Gauss-Newton Method Straightforward Newton method typically replaced by the Gauss-Newton method (J r ( k ) J r ( k ))s k = J r ( k ) r( k ) which drops the terms involving the H r from the Hessian based on the idea that in the minimum the residuals r should be small. Replaces a non-linear least squares problem by a sequence of linear least-squares problems with the right limit. However, simplification may lead to ill-conditioned or rank deficient linear problems: if so, the Levenberg-Marquardt method (J r ( k ) J r ( k ) + μ k )s k = J r ( k ) r( k ) with suitably chosen μ k > 0 can be employed. Slide 34

Outline Root Finding Unconstrained Optimization Constrained Optimization Optimization with R Slide 35

Theory Local optima are relative to neighborhoods in feasible set X. A suitably interior point in X is a local minimum iff t ƒ ((t)) has a local minimum at t = 0 for all smooth curves t (t) in X defined in some open interval containing 0 with (0) =. (Alternatively, if J g ( ) has full column rank, use the implicit mapping theorem to obtain a local parametrization = (s) of the feasible set with, say, (0) =, and consider the conditions for ƒ ((s)) to have an unconstrained local minimum at s = 0.) Slide 36

Theory We have dƒ ((t)) dt = d 2 ƒ ((t)) dt 2 = ƒ j j,k j ((t)) j (t) 2 ƒ ƒ ((t)) j (t) k (t) + ((t)) j (t) j k j j From the first: for local minimum necessarily ƒ ( ), s = 0 for all feasible directions s at. Slide 37

Theory Suppose that X is defined by equality constraints g() = [g 1 (),..., g m ()] = 0. Then g ((t)) 0 and hence g ((t)) j (t) = 0. j j Hence: set of feasible directions at is given by Nll(J g ( )). Necessary first order condition translates into: ƒ ( ) (Nll(J g ( ))) = Rnge(J g ( ) ), i.e., there exist Lagrange multipliers λ such that ƒ ( ) = J g ( ) λ. Slide 38

Theory Differentiating once more, j,k 2 g ((t)) j (t) k (t) + j k j g ((t)) j (t) = 0. j Multiply by λ from above and add to expression for d 2 ƒ ((t))/dt 2 at t = 0: d 2 ƒ ((t)) 2 ƒ 2 g dt 2 = ( ) + λ ( ) s j s k t=0 j,k j k j k ƒ g + ( ) + λ ( ) j (0) j j j = s H ƒ ( ) + λ H g ( ) s = s B(, λ)s Slide 39

Theory For local minimum, must thus also have s B(, λ)s 0 for all feasible directions s at, or equivalently, for all s Nll(J g ( ). Let Z be a basis for Nll(J g ( ). Then: constrained local minimum = ƒ ( ) = J g ( ) λ and projected (reduced) Hessian Z B(, λ)z non-negative definite Slide 40

Theory Lagrangian L(, λ) = ƒ () + λ g() has ƒ () + Jg () λ B(, λ) Jg () L(, λ) =, H g() L (, λ) = J g () O Above first-order conditions conveniently translate into (, λ) being a critical point of L. However, saddle point: H L cannot be positive definite. Need to check Z B(, λ)z as discussed above. If also inequality constraints h k () 0: corresponding λ k 0 and h k ( )λ k = 0 (k with λ k > 0 give active set where h k ( ) = 0). Slide 41

Sequential Quadratic Programming Consider minimization problem with equality constraints. Suppose we use Newton s method to determine the critical points of the Lagrange function, i.e., to solve ƒ () + Jg () λ L(, λ) = = 0. g() Gives the system B(k, λ k ) J g ( k ) sk = J g ( k ) O δ k ƒ (k ) + J g ( k ) λ k g( k ). which are the first-order conditions for the constrained optimization problem min 2 s B( k, λ k )s + s (ƒ ( k ) + J g ( k ) λ k ), J g ( k )s + g( k ) = 0. 1 s Slide 42

Sequential Quadratic Programming Why? Dropping arguments, want to solve 1 2 s Bs + s min, Js + = 0. Lagrangian is L(s, δ) = 1 2 s Bs + s + δ (Js + ) for which the first-order conditions are Bs + + J δ = 0, Js + = 0 Equivalently, B J s = J O δ. Slide 43

Sequential Quadratic Programming I.e., when using Newton s method to determine the critical points of the Lagrange function, each Newton step amounts to solving a quadratic optimization problem under linear constraints: hence, method known as Sequential Quadratic Programming. One does not necessarily solve the linear system directly (remember that the Hessian is symmetric but indefinite, so Choleski factorization is not possible). Slide 44

Penalty Methods Idea: convert the constrained problem into a sequence of unconstrained problems featuring a penalty function to achieve (approximate) feasibility. E.g., for the equality-constrained problem ƒ () min, g() = 0 one considers ϕ ρ () = ƒ () + ρ g() 2 /2 min; Under appropriate conditions it can be shown that the solutions (ρ) converge to the solution of the original problem for ρ. Slide 45

Augmented Lagrangian Methods For penalty method, letting ρ increase without bounds often leads to problems, which motivates to instead use an augmented Lagrangian function L ρ (, λ) = ƒ () + λ g() + ρ g() 2 /2 for which the Lagrange multiplier estimates can be manipulated in ways to keep them bounded while converging to the constrained optimum: augmented Lagrangian methods. Illustration where fixed ρ > 0 suffices: convex minimization with linear constraints. Slide 46

Augmented Lagrangian Methods Primal problem ƒ () min, A = b. with ƒ convex. Duality theory: with Lagrangian L(, λ) = ƒ () + λ (A b), dual function is inf L(, λ) = ƒ ( A λ) b λ (where ƒ is the convex conjugate), and optimality conditions are A b = 0, ƒ () + A λ = 0 (primal and dual feasibility). Slide 47

Augmented Lagrangian Methods Consider augmented Lagrangian L ρ (, λ) = ƒ () + λ (A b) + (ρ/2) A b 2. Can be viewed as unaugmented Lagrangian for ƒ () + (ρ/2) A b 2 min, A = b. Consider iteration k+1 = rgmin L ρ (, λ k ), λ k+1 = λ k + ρ(a k+1 b). Slide 48

Augmented Lagrangian Methods By definition, k+1 minimizes L ρ (, λ k ), hence 0 = L ρ ( k+1, λ k ) = ƒ ( k+1 ) + A (λ k + ρ(a k+1 b)) = ƒ ( k+1 ) + A λ k+1. Thus, iteration ensures dual feasibility, convergence implies primal feasibility, hence altogether optimality. General case uses the same idea, but may need increasing ρ = ρ k. (For additional inequality constraints, add slack variables and use bound constraints.) Slide 49

Barrier Methods Also known as interior point methods. Use barrier functions which ensure feasibility and increasingly penalize feasible points as they increase the boundary of the feasible set. E.g., with constraints h() 0, 1 ϕ μ () = ƒ () μ h (), ϕ μ() = ƒ () μ log( h ()) are the inverse and logarithmic barrier functions, respectively. Idea: for μ > 0, unconstrained minimum μ of ϕ μ is necessarily feasible; as μ 0, μ (the constrained minimum). Need starting values and keep all approximations in the interior of the feasible set, but allow convergence to points on the boundary. Slide 50

Outline Root Finding Unconstrained Optimization Constrained Optimization Optimization with R Slide 51

Optimization with R: Base Packages nlm() does a Newton-type algorithm for unconstrained minimization. Slide 52

Optimization with R: Base Packages nlm() does a Newton-type algorithm for unconstrained minimization. optim() provides several methods, including the Nelder-Mead simplex algorithm (default: derivative-free but slow, not discussed above), quasi-newton (BFGS), conjugate gradient as well as simulated annealing (a stochastic global optimization method, not discussed above). The quasi-newton method can also handle box constrains of the form (where and r may be and, respectively), known as L-BFGS-B. Slide 52

Optimization with R: Base Packages constroptim() performs minimization under linear inequality constraints using an adaptive (logarithmic) barrier algorithm in combination with the Nelder-Mead or BFGS minimizers. Note that this is an interior point method, so there must be interior points! I.e., linear equality constraints cannot be handled. Slide 53

Optimization with R: Contributed Packages Several add-on packages perform linear programming. Slide 54

Optimization with R: Contributed Packages Several add-on packages perform linear programming. Package quadprog performs quadratic programming (strictly convex case with linear equality and/or inequality constrains). Package BB provides BBoptim() which itself wraps around spg(), which implements the spectral projected gradient method for large-scale optimization with convex constraints. Needs a function for projecting on the feasible set. Package nloptr provides an interface to NLopt (http://ab-initio.mit.edu/nlopt), a general purpose open source library for non-linear unconstrained and constrained optimization. One needs to consult the NLopt web page to learn about available methods (see http://ab-initio.mit.edu/wiki/index.php/nlopt_algorithms). Slide 54

Optimization with R: Contributed Packages Package alabama provides auglag(), an Augmented Lagrangian routine for smooth constrained optimization, which conveniently does not require feasible starting values. Slide 55

Example Rosenbrock banana function ƒ () = 100( 2 2 1 )2 + (1 1 ) 2. Clearly has a global minimum at = (1, 1). Slide 56

Example Rosenbrock banana function ƒ () = 100( 2 2 1 )2 + (1 1 ) 2. Clearly has a global minimum at = (1, 1). Multi-dimensional generalization (e.g., http://en.wikipedia.org/wiki/rosenbrock_function): extended Rosenbrock function 2n =1 (100( 2 2 2 1 )2 + (1 2 1 ) 2 (sum of n uncoupled 2-d Rosenbrock functions). Slide 56