Interior-Point and Augmented Lagrangian Algorithms for Optimization and Control

Interior-Point and Augmented Lagrangian Algorithms for Optimization and Control Stephen Wright University of Wisconsin-Madison May 2014 Wright (UW-Madison) Constrained Optimization May 2014 1 / 46

In This Section... My first talk was about optimization formulations, optimality conditions, and duality for LP, QP, LCP, and nonlinear optimization. This section will review some algorithms, in particular: Primal-dual interior-point (PDIP) methods Augmented Lagrangian (AL) methods. Both are useful in control applications. We ll say something about PDIP methods for model-predictive control, and how they exploit the structure in that problem. Wright (UW-Madison) Constrained Optimization May 2014 2 / 46

Recapping Gradient Methods Considering unconstrained minimization min f (x) where f is smooth and convex, or the constrained version in which x is restricted to the set Ω, usually closed and convex. First-order or gradient methods take steps of the form x k+1 = x k α k g k, where α k R + is a steplength and g k is a search direction. d k is constructed from knowledge of the gradient f (x) at the current iterate x = x k and possibly previous iterates x k 1, x k 2,.... Can extend to nonsmooth f by using the subgradient f (x). Extend to constrained minimization by projecting the search line onto the convex set Ω, or (similarly) minimizing a linear approximation to f over Ω. Wright (UW-Madison) Constrained Optimization May 2014 3 / 46

Prox Interpretation of Line Search Can view the gradient method step x k+1 = x k α k g k as the minimization of a first-order model of f plus a prox-term which prevents the step from being too long: x k+1 = arg min x f (x k ) + g T k (x x k) + 1 2α k x x k 2 2. Taking the gradient the quadratic and setting to zero, we obtain which gives the formula for x k+1. g k + 1 α k (x k+1 x k ) = 0, This viewpoint is the key to several extensions. Wright (UW-Madison) Constrained Optimization May 2014 4 / 46

Extensions: Constraints When a constraint set Ω is present we can simply minimize the quadratic model function over Ω: x k+1 = arg min x Ω f (x k) + g T k (x x k) + 1 2α k x x k 2 2. Gradient Projection has this form. We can replace the l 2 -norm measure of distance with some other measure φ(x; x k ): x k+1 = arg min x Ω f (x k) + g T k (x x k) + 1 2α k φ(x; x k ). Could choose φ to match Ω. For example, a measure derived from the entropy function is a good match for the simplex Ω := {x x 0, e T x = 1}. Wright (UW-Madison) Constrained Optimization May 2014 5 / 46

Extensions: Regularizers In many modern applications of optimization, f has the form f (x) = l(x) }{{} + τψ(x). }{{} smooth function simple nonsmooth function Can extend the prox approach above by choosing g k to contain gradient information from l(x) only; including τ ψ(x) explicitly in the subproblem. Subproblems are thus: x k+1 = arg min x l(x k ) + g T k (x x k) + 1 2α k x x k 2 + τψ(x). Wright (UW-Madison) Constrained Optimization May 2014 6 / 46

Extensions: Explicit Trust Regions Rather than penalizing distance moved from current x k, we can enforce an explicit constraint: a trust region. x k+1 = arg min x f (x k ) + g T k (x x k) + I x xk k (x), where I Λ (x) denotes an indicator function with { 0 if x Λ I Λ (x) = otherwise. Adjust trust-region radius k to ensure progress e.g. descent in f. Wright (UW-Madison) Constrained Optimization May 2014 7 / 46

Extension: Proximal Point Could use the original f in the subproblem rather than a simpler model function: x k+1 = arg min f (x) + 1 x x k 2 x 2α 2. k although the subproblem seems just as hard to solve as the original, the prox-term may make it easier by introducing strong convexity, and may stabilize progress. Can extend to constrained and regularized cases also. Wright (UW-Madison) Constrained Optimization May 2014 8 / 46

Quadratic Models: Newton s Method We can extend the iterative strategy further by adding a quadratic term to the model, instead of (or in addition to) the simple prox-term above. Taylor s Theorem suggests basing this term on the Hessian (second-derivative) matrix. That is, obtain the step from x k+1 := arg min x f (x k ) + f (x k ) T (x x k ) + 1 2 (x x k) T 2 f (x k )(x x k ). Can reformulate to solve for the step d k : Then have x k+1 = x k + d k, where d k := arg min d f (x k ) + f (x k ) T d + 1 2 d T 2 f (x k )d. See immediately that this model won t have a bounded solution if 2 f (x k ) is not positive definite. It usually is positive definite near a strict local solution x, but need something that works more globally. Wright (UW-Madison) Constrained Optimization May 2014 9 / 46

Practical Newton Method One obvious strategy is to add the prox-term to the quadratic model: d k := arg min f (x k ) + f (x k ) T d + 1 ( d 2 d T 2 f (x k ) + 1 ) I d, α k choosing α k so that The quadratic term is positive definite; Some other desirable property holds, e.g. descent f (x k + d k ) < f (x k ). We can also impose the trust-region explicitly: d k := arg min d or alternatively: d k := arg f (x k ) + f (x k ) T d + 1 2 d T 2 f (x k )d + I d k (x), min d : d k f (x k ) + f (x k ) T d + 1 2 d T 2 f (x k )d. But this is equivalent: For any k, there exists α k such that the solutions of the prox form and the trust-region form are identical. Wright (UW-Madison) Constrained Optimization May 2014 10 / 46

Quasi-Newton Methods Another disadvantage of Newton is that the Hessian may be difficult to evaluate or otherwise work with. The quadratic model is still useful when we use first-order information to learn about the Hessian. Key observation (from Taylor s theorem): the secant condition: 2 f (x k )(x k+1 x k ) f (x k+1 ) f (x k ). The difference of gradients tells us how the Hessian behaves along the direction x k+1 x k. By aggregating such information over multiple steps, we can build up an approximation to the Hessian than is valid along multiple directions. Quasi-Newton Methods maintain an approximation B k to 2 f (x k ) that respects the secant condition. The approximation may be implicit rather than explicit, and we may store an approximation to the inverse Hessian instead. Wright (UW-Madison) Constrained Optimization May 2014 11 / 46

L-BFGS A particularly populat quasi-newton method, suitable for large-scale problems is the limited-memory BFGS method (L-BFGS) which stores the Hessian or inverse Hessian approximation implicitly. L-BFGS stores the last 5 10 update pairs: s j := x j+1 x j, y j := f (x j+1 ) f (x j ). for j = k, k 1, k 2,..., k m. Can implicitly construct H k+1 that satisfies H k+1 y j = s j. In fact, an efficient recursive formula is available for evaluating d k+1 := H k+1 f (x k+1 ) the next search direction directly from the (s j, y j ) pairs and from some initial estimate of the form (1/α k+1 )I. Wright (UW-Madison) Constrained Optimization May 2014 12 / 46

Newton for Nonlinear Equations There is also a variant of Newton s method for nonlinear equations: Find x such that F (x) = 0, where F : R n R n (n equations in n unknowns.) Newton s method forms a linear approximation to this system, based on another variant of Taylor s Theorem which says F (x + d) = F (x) + J(x)d + 1 0 [J(x + td) J(x)]d dt, where J(x) is the Jacobian matrix of first partial derivatives: F 1 x 1 J(x) =. F n x 1 F 1 x n. F n x n (usually not symmetric). Wright (UW-Madison) Constrained Optimization May 2014 13 / 46

When F is continuously differentiable, we have F (x k + d) F (x k ) + J(x k )d, so the Newton step is the one that make the right-hand side zero: d k := J(x k ) 1 F (x k ). The basis Newton method takes steps x k+1 := x k + d k. Its effectiveness can be improved by Doing a line search: x k+1 := x k + α k d k, for some α k > 0; Levenberg strategy: add λi to J and set d k := (J(x k ) + λi ) 1 F (x k ). Guide progress via a merit function, usually φ(x) := 1 2 F (x) 2 2. Achtung! Can get stuck in a local min of φ that s not a solution of F (x) = 0. Wright (UW-Madison) Constrained Optimization May 2014 14 / 46

Homotopy Tries to avoid the local-min issue with the merit function. Start with an easier set of nonlinear equations, and gradually deform it to the system F (x) = 0, tracking changes to the solution as you go. F (x, λ) := λf (x) + (1 λ)f 0 (x), λ [0, 1]. Assume that F (x, 0) = F 0 (x) = 0 has solution x 0. Homotopy methods trace the curve of solutions (x, λ) until λ = 1 is reached. The corresponding value of x then solves the original problem. Many variants. Some supporting theory. Typically more expensive than enhanced Newton methods, but better at finding solutions to F (x) = 0. We mention homotopy mostly because of its connection to interior-point methods. Wright (UW-Madison) Constrained Optimization May 2014 15 / 46

Interior-Point Methods Recall the monotone LCP: Find z R n such that 0 z Mz + q 0, where M R n n is positive semidefinite, and q R n. Recall too that monotone LCP is a generalization of LP and convex QP. Rewrite LCP as w = Mz + q, (w, z) 0, w i z i = 0, i = 1, 2,..., n, which is a constrained system of nonlinear equations: F (z, w) = 0, (z, w) 0, where F : R 2n R 2n is defined as [ ] Mz + q w F (z, w) = = WZe where [ ] 0, 0 W = diag(w 1, w 2,..., w n ), Z = diag(z 1, z 2,..., z n ), e = (1, 1,..., 1) T. Wright (UW-Madison) Constrained Optimization May 2014 16 / 46

Interior-Point Methods Interior-point methods generate iterates that satisfy the nonnegativity constraints strictly, that is, (z k, w k ) > 0 for all k. An obvious strategy is to search along Newton directions for F, with a line search to maintain positivity of (z k+1, w k+1 ). The Newton equations are: [ M I W Z ] [ ] z = w [ ] Mz + q w WZe This affine scaling approach can actually work, and convergence can be proved, but it needs very precise conditions on the steplength to avoid getting jammed at the boundary. Much better performance is obtained from methods that use homotopy, in particular, the idea of following a central path. Wright (UW-Madison) Constrained Optimization May 2014 17 / 46

The Central Path The central path is a critical object in primal-dual interior-point methods. It s defined by the parametric equations: [ ] [ ] Mz + q w 0 F (z, w; τ) := =, WZe τe 0 where τ 0 is the central path parameter, along with the conditions (z, w) > 0. The second block of conditions is saying that w i z i = τ, for all i = 1, 2,..., n, so if τ > 0, we must have all components of w and z strictly positive hence the term interior point. 1 1 Actually, interior-point methods don t force the first condition Mz + q w = 0 to hold except in the limit, so the iterates are not really interior in the send of being strictly feasible. Wright (UW-Madison) Constrained Optimization May 2014 18 / 46

Primal-Dual Interior-Point Methods The central path guides iterates to the solution. The basic primal-dual interior-point method works as follows: Compute Newton steps for F (z, w; τ k ) for some value of τ k. Take a step (z k+1, w k+1 ) = (z k, w k ) + α k ( z k, w k ), choosing α k so that (z k+1, w k+1 ) remains sufficiently positive. Possibly enhance with some auxiliary corrector steps. Choose a smaller value τ k+1 < τ k and repeat. The effect is that it chases a moving target along the central path, staying reasonably close to the central path to avoid getting jammed near the boundary of the region (w, z) 0. Wright (UW-Madison) Constrained Optimization May 2014 19 / 46

A Simple PDIP Method Usually works, simple to code (one page!) Choose τ k+1 =.9τ k ; Choose α k to be the largest α for which (z k i + α z k i )(w k i + α w k i ).01(z k + α z k ) T (w k + α w k )/n. The second condition ensures that iterates stay in a loose neighborhood of the central path none of the variables go to zero prematurely. Wright (UW-Madison) Constrained Optimization May 2014 20 / 46

A More Elaborate PDIP Method (Mehrotra) Chooses τ k+1 by a heuristic based on the performance of the affine scaling step ( z k aff, w k aff), which is computed by solving the Newton equations with τ = 0. If we can take a long step along the affine scaling direction before reaching the boundary, choose τ k+1 much smaller than τ k an aggressive choice. Otherwise, if the affine scaling step quickly jams at the boundary, make a more conservative choice τ k+1 only a little smaller than τ k. Mehrotra s method also includes a corrector step, that corrects for the nonlinearity revealed by the affine-scaling step. The combined centering-corrector step is obtained from [ ] [ ] [ M I zcc = W Z w cc 0 τ k+1 e z k aff w k affe This is added to the affine-scaling step to obtain the search direction. Wright (UW-Madison) Constrained Optimization May 2014 21 / 46 ].

Solving the Linear Equations At each iteration need to solve two or more linear systems of the form: [ ] [ ] [ ] M I z rf =, W Z w where W and Z are the diagonals of w k and z k which contain allpositive numbers. r zw r f and r zw are different right-hand sides (affine-scaling, centering-corrector). Note that there is a lot of structure in this system three of the blocks are N N nonsingular diagonal and that the structure is the same at every interior-point iteration. Wright (UW-Madison) Constrained Optimization May 2014 22 / 46

Reduced Form Can do block elimination (of w) to obtain Note that (M + Z 1 W ) z = r f + Z 1 r wz. When M is positive semidefinite (as it is for monotone LCP, then M + Z 1 W is positive definite). If we can identify a good strategy for reordering rows and columns of M to solve this efficiently, we can re-use this ordering at every iteration, because M + Z 1 W has the same nonzero pattern (only the diagonals change). Since some elements of Z 1 W are going to as the iterates near a solution while others are going to zero, this system can become highly ill-conditioned but it doesn t seem to matter much. Wright (UW-Madison) Constrained Optimization May 2014 23 / 46

Application to LP When we apply this technique to LCPs arising from LP, M has a particular form, and the linear system that we solve has the form: 0 A T I x r p A 0 0 λ = r d, S 0 X s r xs where X = diag(x 1, x 2,..., x n ) and S = diag(s 1, s 2,..., s n ). By eliminating s we obtain a reduced form: [ S 1 X A T ] [ ] [ x rp S = 1 ] r xs. A 0 λ Since S 1 X is positive diagonal we can go a step further and eliminate x: A(SX 1 )A T λ = r d + ASX 1 (r p S 1 r xs ). r d Wright (UW-Madison) Constrained Optimization May 2014 24 / 46

Thus we have a coefficient matrix of the form ADA T where D = SX 1 is positive diagonal. Some elements of D go to zero as the the iterates converge (those for which s i = 0 at the solution) while others go to (those for which x i = 0). Hence the system MAY be ill conditioned. In practical codes, this matrix product is actually formed and factored at every iteration, using a variant of the Cholesky factorization. This factorization is stable regardless of the ordering of rows/columns, so we are free to use orderings that create the least fill-in during Cholesky. Good codes: MOSEK, GUrobi, CPLEX. Freebie: PCx. Wright (UW-Madison) Constrained Optimization May 2014 25 / 46

Optimal Control (Open Loop) Given current state x 0, choose a time horizon N (long), and solve the optimization problem for x = {x k } N k=0, u = {u k} N 1 k=0 : min L(x, u), subject to x 0 given, x k+1 = F k (x k, u k ), k = 0, 1,..., N 1, other constraints on x, u. Then apply controls u 0, u 1, u 2,... blindly, without monitoring the state. flexible with respect to nonlinearity and constraints; if model F k is inaccurate, solution may be bad; doesn t account for system disturbances during time horizon; never used in industrial practice! Wright (UW-Madison) Constrained Optimization May 2014 26 / 46

Linear-Quadratic Regulator (Closed Loop) A canonical control problem. For given x 0, solve: min x,u Φ(x, u) := 1 2 xk T Qx k + uk T Ru k s.t. x k+1 = Ax k + Bu k. k=0 From KKT conditions, dependence of optimal values of x 1, x 2,... and u 0, u 1,... on initial x 0 is linear. We can substitute these variables out to obtain min x,u Φ(x, u) = 1 2 x T 0 Πx 0, for some s.p.d. matrix Π. By using this dynamic programming principle, isolating the first stage, can write the problem as: 1 min x 1,u 0 2 (x 0 T Qx 0 + u0 T Ru 0 ) + 1 2 x 1 T Πx 1 s.t. x 1 = Ax 0 + Bu 0. Wright (UW-Madison) Constrained Optimization May 2014 27 / 46

By substituting for x 1, get unconstrained quadratic problem in u 0. Minimizer is so that u 0 = Kx 0, where K = (R + B T ΠB) 1 B T ΠA. By substituting for u 0 and x 1 in obtain the Riccati equation: x 1 = Ax 0 + Bu 0 = (A + BK)x 0. 1 2 x T 0 Πx 0 = 1 2 (x T 0 Qx 0 + u T 0 Ru 0 ) + 1 2 x T 1 Πx 1, Π = Q + A T ΠA A T Π T B(R + B T ΠB) 1 B T ΠA. There are well-known techniques to solve this equation for Π, hence K. Hence, we have a feedback control law u = Kx that is optimal for the LQR problem. This is a closed-loop control strategy that can respond to changes in state. Wright (UW-Madison) Constrained Optimization May 2014 28 / 46

Model Predictive Control (MPC) Closed-loop has the highly desirable property of being able to respond to system disturbances and modeling errors. But closed-form solutions can be found only for simple models such as LQR. Open-loop allows for a rich variety of models and constraints, and yields a highly structured optimization problem that can often be solved efficiently. But it can t fix disturbances and errors. Set and forget at your peril! Model-Predictive Control (MPC) is a way of doing closed-loop control using open-loop / optimal-control techniques. Set up a optimal control problem, initialized at the current state, with a finite planning horizon. Solve this open-loop problem (quickly!) and implement the control u 0. At the next decision point, repeat the process. Can use the previous solution (shifted forward one period) as a warm start. Wright (UW-Madison) Constrained Optimization May 2014 29 / 46

Linear MPC A generalization of the linear-quadratic regulator (LQR) problem includes constraints: min x,u 1 2 xk T Qx k + uk T Ru k, k=0 subject to x k+1 = Ax k + Bu k, k = 0, 1, 2,..., x k X, u k U, possibly also mixed constraints, and constraints on u k+1 u k. Assuming that 0 int(x ), 0 int(u) and that the system is stabilizable, we expect that u k 0 and x k 0 as k. Therefore, for large enough k, the non-model constraints become inactive. Wright (UW-Madison) Constrained Optimization May 2014 30 / 46

Hence, for N large enough, the problem is equivalent to the following (finite) problem: min x,u N 1 1 xk T 2 Qx k + uk T Ru k + 1 2 x N T Πx N, k=0 subject to x k+1 = Ax k + Bu k, k = 0, 1, 2,..., N 1 x k X, u k U, k = 0, 1, 2,..., N 1, where Π is the solution of the Riccati equation. In the tail of the sequence (k > N) simply apply the unconstrained LQR feedback law, derived above. When constraints are linear, need to solve a finite, structured, convex quadratic program. Wright (UW-Madison) Constrained Optimization May 2014 31 / 46

Interior-Point Method for MPC min u,x,ɛ N 1 k=0 subject to 1 2 (x T k Qx k + u T k Ru k + 2x T k Mu k + ɛ T k Zɛ k) + z T ɛ k + x T N Πx N, x 0 = ˆx j, (fixed) x k+1 = Ax k + Bu k, k = 0, 1,..., N 1, Du k Gx k d, k = 0, 1,..., N 1, Hx k ɛ k h, k = 1, 2,..., N, ɛ k 0, k = 1, 2,..., N, Fx N = 0. Soft contraints on the state (involving ɛ k ); Also general hard constraints on (x k, u k ). Can use these to implement constraints on control changes u k+1 u k Wright (UW-Madison) Constrained Optimization May 2014 32 / 46

Introduce dual variables, use stagewise ordering. Primal-dual interior-point method yields a block-banded system at each iteration:... Q M G T A T.. x M T R D T B T k r x k G D Σ D u k r u k k λ A B I k r λ k Σ ɛ p k+1 I k+1 r p = k+1 ξ Σ H k+1 I H k+1 r ξ k+1 η k+1 I I Z r η k+1 ɛ k+1 I H T Q... x rk+1 ɛ k+1 rk+1 x.. where Σ D k, Σɛ k+1, etc are diagonal. Wright (UW-Madison) Constrained Optimization May 2014 33 / 46

By performing block elimination, get reduced system R 0 B B T u I 0 r u 0 I Q 1 M 1 A T p 0 r p 0 M1 T R 1 B T x 1 r x 1 u A B I 1 r u 1 I Q 2 M 2 A T p 1 r p 1 M2 T R 2 B T x 2 = r x 2. u. A B... 2 r u 2......... QN F T x N r N x β r β F which has the same structure as the KKT system of a problem without side constraints (soft or hard). Wright (UW-Madison) Constrained Optimization May 2014 34 / 46

Can solve by applying a banded linear solver: O(N) operations. Alternatively, seek matrices Π k and vectors π k such that the following relationship is satisfied between p k 1 and x k : p k 1 + Π k x k = π k, quadk = N, N 1,..., 1. By substituting in the linear system, find a recurrence relation: Π N = Q N, π N = r x N, Π k 1 = Q k 1 + A T Π k A (A T Π k B + M k 1 )(R k 1 + B T Π k B) 1 (B T Π k A + Mk 1), T π k 1 = r k 1 x + A T Π k r p k 1 + AT π k (A T Π k B + M k 1 )(R k 1 + B T Π k B) 1 ( r k 1 u + B T Π k r p k 1 + BT π k ). The recurrence for Π k is the discrete time-varying Riccati equation! Wright (UW-Madison) Constrained Optimization May 2014 35 / 46

Duality for Nonlinear Programming (Refresher) Recall yesterday s conversation about duality for general constrained problems: min f (x) subject to c i (x) 0, i = 1, 2,..., m. with the Lagrangian defined by: L(x, λ) := f (x) λ T c(x) = f (x) m λ i c i (x). i=1 Remember that we defined primal and dual objectives: r(x) = sup L(x, λ); λ 0 q(λ) = inf x so the primal and dual problems are: L(x, λ), min x r(x), max λ 0 q(λ). Wright (UW-Madison) Constrained Optimization May 2014 36 / 46

Augmented Lagrangian Can motivate it crudely as proximal point on the dual. Consider the dual: max λ 0 q(λ) = max inf f (x) λ T c(x). λ 0 x Add a proximal point term a quadratic penalty on the distance moved from the last dual iterate λ k : max inf f (x) λ T c(x) 1 λ λ k 2 λ 0 x 2 2α k Note that the objective here is simple in λ. Now switch the max and inf: { inf max f (x) x λ 0 λt c(x) 1 } λ λ k 2 2. 2α k We can solve the max problem, to get an explicit value for λ! This is easy because the components of λ are separated in the objective; we can solve for each one individually. Wright (UW-Madison) Constrained Optimization May 2014 37 / 46

Explicit solution is λ i = { 0 if α k c i (x) + λ k i 0; α k c i (x) + λ k i otherwise. Substitute this result in the previous expression to get min x f (x) + + i : c i (x) λ k i /α k i : c i (x)<λ k i /α k 1 (λ k i ) 2 2α k ( αk ) 2 c i(x) 2 λ k i c i (x). Thus the basic augmented Lagrangian process is (iteration k): Minimize the augmented Lagrangian function above for x (approximately); call it x k+1 ; Plug this x into the explicit-max formula for λ to get λ k+1 ; Choose α k+1 for the next iteration. Wright (UW-Madison) Constrained Optimization May 2014 38 / 46

Equality Constraints For the equality constrained case, the formulae simplify a lot. We have min x f (x) subject to d j (x) = 0, j = 1, 2,..., p. Dual objective is inf x L(x, µ) := f (x) µ T d(x). Augmented Lagrangian subproblems are: x k+1 = arg min x µ k+1 = µ k α k d(x k+1 ). f (x) (µ k ) T d(x) + α k 2 d(x) 2 2, Wright (UW-Madison) Constrained Optimization May 2014 39 / 46

Augmented Lagrangian: History and Practice How accurately to solve the subproblem for x, which is unconstrained but nonlinear? How to adjust α k? Use a different α k value for each constraint; increase it when this constraint is not becoming feasible rapidly enough. Historical Sketch: Dates from 1969: Hestenes, Powell. Developments in 1970s- early 1980s by Rockafellar, Bertsekas, others. Lancelot code for nonlinear programming: Conn, Gould, Toint, 1990. Largely lost favor as an approach for general nonlinear programming during the next 15 years. Recent revival in the context of sparse optimization and its many applications, in conjunction with splitting / coordinate descent. Wright (UW-Madison) Constrained Optimization May 2014 40 / 46

Separable Objectives: ADMM Alternating Directions Method of Multipliers (ADMM) arises when the objective in the basic linearly constrained problem is separable: for which min f (x) + h(z) subject to Ax + Bz = c, (x,z) L(x, z, λ; ρ) := f (x) + h(z) λ T (Ax + Bz c) + α 2 Ax Bz c 2 2. Standard augmented Lagrangian would minimize L(x, z, λ; ρ) over (x, z) jointly but these are coupled through the quadratic term, so the advantage of separability is lost. Instead, minimize over x and z separately and sequentially: x k = arg min x L(x, z k 1, λ k 1 ; α k ); z k = arg min z L(x k, z, λ k 1 ; α k ); λ k = λ k 1 α k (Ax k + Bz k c). Wright (UW-Madison) Constrained Optimization May 2014 41 / 46

ADMM Basically, does a round of block-coordinate descent in (x, z). The minimizations over x and z add only a quadratic term to f and h, respectively. This does not alter the cost much. Can perform these minimizations inexactly. Convergence is often slow, but sufficient for many applications. Many recent applications to compressed sensing, image processing, matrix completion, sparse principal components analysis. ADMM has a rich collection of antecendents. A nice recent survey, including a diverse collection of machine learning applications, is Boyd et al. (2011). Wright (UW-Madison) Constrained Optimization May 2014 42 / 46

ADMM for Consensus Optimization Given min m f i (x), form m copies of the x, with the original x as a master variable: m f i (x i ) subject to x i = x, i = 1, 2,..., m. min x,x 1,x 2,...,x m i=1 i=1 Apply ADMM, with z = (x 1, x 2,..., x m ), get xk i = arg min f i (x i ) (λ i x i k 1 )T (x i x k 1 ) + α k 2 x i x k 1 2 2, i, x k = 1 m ( xk i m 1 ) λ i k 1, α k i=1 λ i k = λi k 1 α k(x i k x k), i Obvious parallel possibilities in the x i updates. Synchronize for x update. Wright (UW-Madison) Constrained Optimization May 2014 43 / 46

ADMM for Awkward Intersections The feasible set is sometimes an intersection of two or more convex sets that are easy to handle separately (e.g. projections are easily computable), but whose intersection is more difficult to work with. Example: Optimization over the cone of doubly nonnegative matrices: General form: min X f (X ) s.t. X 0, X 0. min f (x) s.t. x Ω i, i = 1, 2,..., m Just consensus optimization, with indicator functions for the sets. m x k = arg min f (x) (λ i x k 1 )T (x xk 1 i ) + α k 2 x x k 1 i 2 2, x i k = arg min x i Ω i i=1 (λ i k 1 )T (x k x i ) + α k 2 x k x i 2 2, i λ i k = λi k 1 α k(x k x i k ), i. Wright (UW-Madison) Constrained Optimization May 2014 44 / 46

ADMM and Prox-Linear Given min f (x) + τψ(x), x reformulate as the equality constrained problem: ADMM form: x k := min x min x,z f (x) + τψ(z) subject to x = z. f (x) + τψ(z k 1 ) + (λ k ) T (x z k 1 ) + α k 2 z k 1 x 2 2, z k := min f (x k ) + τψ(z) + (λ k ) T (x k z) + α k z 2 z x k 2 2, λ k+1 := λ k + α k (x k z k ). Minimization over z is the shrink operator often inexpensive. Minimization over x can be performed approximately using an algorithm suited to the form of f. Wright (UW-Madison) Constrained Optimization May 2014 45 / 46

The subproblems are not too different from those obtained in prox-linear algorithms: λ k is asymptotically similar to the gradient term in prox-linear, that is, λ k f (x k ); Thus, the minimization over z is quite similar to the prox-linear step. Wright (UW-Madison) Constrained Optimization May 2014 46 / 46

References I Bertsekas, D. P. (1999). Nonlinear Programming. Athena Scientific, second edition. Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction methods of multipliers. Foundations and Trends in Machine Learning, 3(1):1 122. Nocedal, J. and Wright, S. J. (2006). Numerical Optimization. Springer, New York, second edition. Rao, C. V., Wright, S. J., and Rawlings, J. B. (1998). Application of interior-point methods to model predictive control. Journal of Optimization Theory and Applications, 99:723 757. Wright, S. J. (1997a). Applying new optimization algorithms to model predictive control. In Kantor, J. C., editor, Chemical Process Control-V, volume 93 of AIChE Symposium Series, pages 147 155. CACHE Publications. Wright, S. J. (1997b). Primal-Dual Interior-Point Methods. SIAM, Philadelphia, PA. Wright (UW-Madison) Constrained Optimization May 2014 1 / 1