High Performance Nonlinear Solvers

What is a nonlinear system? High Performance Nonlinear Solvers Michael McCourt Division Argonne National Laboratory IIT Meshfree Seminar September 19, 2011 Every nonlinear system of equations can be described as F(u) = 0 for u R N and F : R N R. F is often referred to as a residual function. This includes x + 2 = 3 Ax = b x 3 = 3 x This does not include x + 2 < 3 mint Ax = b(t) x 3 = 3 x ; x Z mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 1 / 34 mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 2 / 34 What can become a nonlinear system? Consider the problem u F(u) = α e t2 dt = 0, α R. 0 This is a nonlinear equation, but because e t2 has no antiderivative, there is no way to compute F(u). Solution Approximate the integral with e.g., trapezoid rule, Gauss quadrature, monte carlo, and call that discretization Ĩ. Then call F (u) = α Ĩ(u). mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 3 / 34 What can become a nonlinear system? Consider the problem ut(t) f (u, t) = 0, u(0) = u 0 In trying to solve for u, what does it mean to apply d dt? Solution Among other possible options, we could discretize the solution on a grid and solve for u(t) at specific t (labeled u k+1 ), with a finite difference approximation to ut(t) yielding u k : 1 t (uk+1 u k ) f (u k+1, t) = 0, k = 0, 1,... mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 4 / 34

What can become a nonlinear system? How do we solve nonlinear systems? Consider the problem min u Ω R N G(u) Picard iteration: u k+1 = f (u k ) Also called fixed point iteration, or nonlinear Richardson As mentioned earlier, optimization problems are not nonlinear systems because there is no residual function to evaluate. Solution A technique referred to as Quasi-Newton leverages the fact that local minima are reached when G(u) = 0. By discretizing the gradient as we can define F(u) = (G)(u). Charles Émile Picard Limitations to Picard include Must be able to write F (u) = u f (u) such that f < 1 near the solution. May need good initial guess u 0. Convergence may be slow. mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 5 / 34 mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 6 / 34 How do we solve nonlinear systems? How do we solve nonlinear systems? Stochastic search: F(u) = 0 minu F(u) Reformulate the nonlinear system as an optimization problem and solve it with optimization techniques. Newton s Method: u k+1 = u k J(F)(u k ) 1 F(u k ) Quadratically convergent algorithm from back in the day Limitations to stochastic search include Limitations to Newton s method include Nicholas Metropolis Produces a solution in distribution Computationally costly; may require extra memory Less rigorous mathematics ( F(u) may not have smooth derivatives) Sir Isaac Newton Good initial guess needed Requires Jacobian knowledge Linear solve required at each nonlinear iteration mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 7 / 34 mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 8 / 34

Derivation of Newton s Method Making Newton s method practical Where does the iteration u k+1 = u k J(F)(u k ) 1 F(u k ), k = 0, 1,... to solve F(u) = 0 come from? Taylor series Assume you are at step u k and the solution is u, meaning u k = u u k. F(u k + u k ) = F(u k ) + J(F)(u k ) u k + O( u k 2 ) }{{}}{{} F (u)=0 0 0 F(u k ) + J(F)(u k ) u k Quadratic convergence makes Newton s method the optimal choice, if we can circumvent the limitations. For Newton s method to be practical we need Globalization - How bad can our initial guess be to still see convergence? Linear solvers - Can we efficiently invert the Jacobian? Jacobian computation - How can we efficiently evaluate the Jacobian? Can we make do with a cheap approximation to the Jacobian? When the steps u k get small enough, u k u. mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 9 / 34 mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 10 / 34 Globalization Why does a bad initial guess prevent convergence? Recall the Taylor expansion F(u k + u k ) = F(u k ) + J(F)(u k ) u k + O ( u k 2) J(F)(u k ) 1 F(u k ) = u k + O ( J(F )(u k ) 1 u k 2) Globalization How do we implement Newton s method for a bad initial guess? Line search - take a shorter step in the Newton direction and make sure to reduce the norm. Why does that make sense? Newton s method converges quadratically with a decent initial guess. If u k is too large, the assumption that O( u k 2 ) is negligible is invalid. This means that the linear system solution J(F)(u k ) 1 F(u k ) is a poor approximation to u k. mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 11 / 34 As long as we are reducing the norm, we will eventually get close enough for Newton s method to converge as it should. mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 12 / 34

Globalization How do we implement Newton s method for a bad initial guess? Trust region - prevent the iteration from entering a region with unacceptable values. Why does that make sense? If you have physical knowledge about the system, use it to restrict the steps when possible. Example - Pressure cannot be negative, so if the iteration produces a negative value, take a smaller step. Globalization How do we implement Newton s method for a bad initial guess? Pseudotransient continuation - solve an equivalent system where the. Why does that make sense? This one is a little more difficult to understand. In trying to solve F(u) = 0, we can find the steady-state solution to ut(x, t) = F(u(x, t)), u(x, 0) = u 0 (x) This time dependent system at steady-state is independent of the initial condition. It is much better conditioned, although we re not interested in why here. mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 13 / 34 mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 14 / 34 Linear Solvers How do we find the Newton step J(F)(u k ) u k = F(u k ) efficiently? Question Do we even need the exact inverse J(F)(u k ) 1 F(u k )? Actually, no It turns out that Inexact Newton will also converge quadratically J(F)(u k ) u k + F(u k ) < ɛ This means an iterative solver can be used. Furthermore, what s the point in exactly solving the linear system if a globalization technique (e.g., line search) is being used? mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 15 / 34 Linear Solvers Now that we know an iterative solver can be used to find the Newton step, new opportunities are available: The Jacobian no longer needs to be computed - only the action J(F)(u)v. How can we take advantage of this? Finite differences F(u + hv) = F(u) + hj(f )(u)v + O(h 2 ) J(F)(u)v = 1 (F(u + hv) F(u)) h Jacobian-vector products can be approximated by finite differences at the cost of 1 function evaluation. This does not require computing the full Jacobian. mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 16 / 34

Linear Solvers Now that we know an iterative solver can be used to find the Newton step, new opportunities are available: The Jacobian no longer needs to be computed - only the action J(F)(u)v. How can we take advantage of this? Complex derivatives To avoid cancelation from finite differences, F(u + ıv) = F(u) + ıj(f)(u)v + O(h 2 ) R(F(u + ıhv)) = F(u) I(F(u + ıhv)) = J(F)(u)v Function evaluations and Jacobian-vector products can be computed simultaneously given a real function F if it is overloaded to accept complex arguments. mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 17 / 34 Linear Solvers After choosing a linear solver tolerance ɛ, J(F)(u k ) u k + F(u k ) < ɛ can be solved via GMRES or some other iterative method without ever computing the true Jacobian. This introduces the Krylov into Newton-Krylov-Schwarz. Unfortunately, most problems of interest are rather ill-conditioned, meaning that an iterative solver will converge very slowly. Preconditioning To combat this, it is common to use a preconditioner. Unfortunately, since we don t have the true Jacobian, we have no idea what a good preconditioner looks like. mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 18 / 34 Jacobian Computation Recall the iterative approach to solving systems: unpreconditioned methods for Ax = b form the Krylov space Kn = {b, Ab,..., A n b}. We only have the ability to conduct matrix-vector products and do not have access to the true Jacobian. Since the Jacobian-vector products are being approximated via finite differences, the true Jacobian is not necessary. Recall the structure of a preconditioned Krylov subspace for the problem (AM 1 )(Mx) = b: Kn = {b, AM 1 b,..., (AM 1 ) n b} How can we approximate a Jacobian matrix with which to create a preconditioner? (Hint: it doesn t need to be perfect...) mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 19 / 34 Jacobian Computation Approximating the Jacobian can be done via finite differences: J(F)(u)v = 1 (F(u + hv) F(u)) h If v is set to the k th column of the identity matrix IN, J(F)(u)v will be the k th column of J(F)(u). J(F)(u)IN = J(F)(u) Approximating the Jacobian with this approach will require N function evaluations, which is unacceptably high. mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 20 / 34

Jacobian Computation Approximating the Jacobian can be done via finite differences is practical when working with a sparse matrix. The nonzero structure of the matrix may produce columns which are orthogonal. These columns can be computed with a single function evaluation. Jacobian Computation Approximating the Jacobian can be done via automatic differentiation (AD). This will compute derivatives of functions without loss of accuracy from cancelation or truncation, as was present in finite differences. AD likely requires access to the source code, which may be unreasonable in some cases. mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 21 / 34 mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 22 / 34 Where are we now? Preconditioners We have the following steps to solve F(u) = 0: 1 Use Newton s method to iterate from an initial guess u 0 to the solution u. 2 Find the next iterate by solving J(F)(u k ) u k + F(u k ) < ɛ iteratively. 3 ** Precondition the iterative method using an approximate Jacobian. 4 Apply line search to the Newton iterate to improve convergence. Now that we have an approximate Jacobian via coloring, how can we precondition our system? There are literally thousands of preconditioners that exist for solving systems. There is a cottage industry for every application where a specialized preconditioner could exist. The most common preconditioners are: LU - Use the full inverse of M. ILU - Cheaply approximate the full inverse while controlling memory costs. Multigrid - Multilevel solvers are much more complicated but helpful for many problems. Schwarz - Domain decomposition techniques help reduce parallel communication and improve scalability. FFT - Some systems respond well to transforms. mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 23 / 34 mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 24 / 34

Preconditioners What does I ll use preconditioner [pick one] mean? When we compute M via coloring we get a matrix M J(F)(u k ). This matrix is not necessarily the matrix which is inverted in AM 1 b. What s going on here? In order to make M 1 easier to compute, some values are often discarded from M before computing M 1. Components in preconditioned GMRES J(F)(u k )v products are approximated via finite differences M is an approximate Jacobian computed via finite difference with coloring Note that (M 1 ) 1 M because some values are lost. M 1 is applied efficiently by dumping some values in M. Preconditioners For example, consider a simple Schwarz preconditioner called block Jacobi on 2 processors. Each processor retains only the M values which it owns, and ignores the rest. The blocks of M are inverted by LU. ( ) M1 M2 M = M3 M4 ( ) M 1 M 1 = 1 0 0 M 1 4 Even though the full matrix M may have been computed, some terms were dumped to speed up the computation and application of M 1. mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 25 / 34 mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 26 / 34 Preconditioners Example of preconditioning To allow for a speedy solve, the preconditioners have to be tailored to the physics of the system: 1 If the system is well-conditioned, ILU may be used in place of LU. 2 If the system is elliptic, Multigrid will be effective. 3 If you need a large system solved, Schwarz methods will allow you to reduce communication between processors. 4 When the system is very ill-conditioned, sometimes all you can use is LU. The more you know about the system, the better your preconditioner can be... mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 27 / 34 The neutral terms make the system so ill-conditioned that the LU preconditioner needs to be used. mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 28 / 34

Example of preconditioning Example of preconditioning The LU preconditioner shows poor scalability. What can we do?? What if we used a targeted approach of solving the ill-conditioned neutral velocity terms with LU, elliptic neutral density terms with Multigrid, and well-conditioned plasma terms with a Schwarz method? mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 29 / 34 mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 30 / 34 Example of preconditioning Conclusion By targeting the preconditioning, the solver can be sped up significantly because unnecessary work is removed from the process. Today we have gone over techniques to make Newton s method a practical solver nonlinear systems F(u) = 0: Line search is a common approach to allow for bad initial guesses. Iterative solvers may be used to find Newton directions. Jacobian-vector products can be approximated via finite differences. A preconditioning matrix can be computed with graph coloring. Targeting your preconditioner to your system can speed it significantly. mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 31 / 34 mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 32 / 34

Other cool stuff Other bad stuff There are other things which may be important in speeding up your nonlinear solver, including Jacobian lagging - Recompute the preconditioner less frequently, since matrix-vector products are independent of the M matrix Variable linear tolerance (Eisenstat-Walker trick) - Some of your linear solves can be crummy and you can still reach the solution Nonlinear preconditioning - Is there a F which you can apply as F(F(u)) = 0 to make your system easier to solve? High order finite differences - Will more accurate Jacobian vector products speed the solution? There are problems I didn t talk about today Jacobian coloring - How does your choice of coloring hurt the accuracy of the finite difference approximation? Line search - Can this trap you in a local minimum? Preconditioning - How do I pick a good preconditioner? Note: This is the main impediment for people not using implicit methods. Storage - Newton-Krylov-Schwarz can demand a lot of memory that simpler nonlinear schemes don t demand. mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 33 / 34 mccomic@mcs.anl.gov (Argonne) Newton-Krylov-Schwarz September 19, 2011 34 / 34