Numerical Methods for Differential Equations

Size: px

Start display at page:

Download "Numerical Methods for Differential Equations"

Neal Stokes
5 years ago
Views:

1 Gustaf Söderlind Numerical Methods for Differential Equations An Introduction to Scientific Computing November 29, 2017 Springer

3 Contents Part I Scientific Computing: An Orientation 1 Why numerical methods? Concepts and problems in analysis Concepts and problems in algebra Computable problems Principles of numerical analysis The First Principle: Discretization The Second Principle: Polynomials and linear algebra The Third Principle: Iteration The Fourth Principle: Linearization Correspondence Principles Accuracy, residuals and errors Differential equations and applications Initial value problems Two-point boundary value problems Partial Differential Equations Summary: Objectives and methodology Part II Initial Value Problems 5 First Order Initial Value Problems Existence and uniqueness Stability Stability theory Linear stability theory Logarithmic norms Inner product norms v

4 vi Contents 6.4 Matrix categories Nonlinear stability theory Stability in discrete systems The Explicit Euler method Convergence Alternative bounds The Lipschitz assumption The Implicit Euler Method Convergence Numerical stability Stiff problems Simple methods of order Conclusions Runge Kutta methods An elementary explicit RK method Explicit Runge Kutta methods Taylor series expansions and elementary differentials Order conditions and convergence Implicit Runge Kutta methods Stability of Runge Kutta methods Embedded Runge Kutta methods Adaptive step size control Stiffness detection Linear Multistep methods Adams methods BDF methods Operator theory General theory of linear multistep methods Stability and convergence Stability regions Adaptive multistep methods Special Second Order Problems Standard methods for the harmonic oscillator The symplectic Euler method Hamiltonian systems The Störmer Verlet method Time symmetry, reversibility and adaptivity Part III Boundary Value Problems

5 Contents vii 12 Two-point Boundary Value Problems The Poisson equation in 1D. Boundary conditions Existence and uniqueness Notation: Spaces and norms Integration by parts and ellipticity Self-adjoint operators Sturm Liouville eigenvalue problems Finite difference methods FDM for the 1D Poisson equation Toeplitz matrices The Lax Principle Other types of boundary conditions FDM for general 2pBVPs Higher order methods. Cowell s difference correction Higher order problems. The biharmonic equation Singular problems and boundary layer problems Nonuniform grids Finite element methods The weak form The cg(1) finite element method in 1D Convergence Neumann conditions cg(1) FEM on nonuniform grids

7 Part I Scientific Computing: An Orientation

9 Chapter 1 Why numerical methods? Numerical computing is the continuation of mathematics by other means Science and engineering rely on both qualitative and quantitative aspects of mathematical models. Qualitative insight is usually gained from simple model problems that may be solved using analytical methods. Quantitative insight, on the other hand, typically requires more advanced models and numerical solution techniques. With increasingly powerful computers, ever larger, and more complex mathematical models can be studied. Results are often analyzed by visualizing the solution, sometimes for a large number of cases defined by varying some problem parameter. Scientific computing is the systematic use of highly specialized numerical methods for solving specific classes of mathematical problems on a computer. But are numerical methods different from just solving the mathematical problem, and then inserting the data to evaluate the solution? The answer is yes. Most problems of interest do not have a closed form solution at all. There is no formula to evaluate. The problems are often nonlinear and almost always too complex to be solved by analytical techniques. In such cases numerical methods allow us to use the powers of a computer to obtain quantitative results. All important problems in science and engineering are solved in this manner. It is important to note that a numerical solution is approximate. As we cannot obtain an exact solution to our problem, we construct an approximating problem that is amenable to automatic computation. The construction and analysis of computational methods is a mathematical science in its own right. It deals with questions such as how to obtain accurate results, and whether they can be computed efficiently. This cannot be taken for granted. As a simple example, let us consider the problem of solving a linear system of equations, Ax = b, on a computer using standard Gaussian elimination. Let us assume that A is large, with (say) N = 10,000 equations, and that A is a dense matrix. Because Gaussian elimination has an operation count of O(N 3 ), the total number of operations in solving the problem is on the order of operations. Not only do we need a fast computer and a large memory, but we might also ask the question whether it is at all possible to obtain any accuracy. After all, in standard IEEE arithmetic, operations are only carried out to 16 digit precision. Can we trust the computed result, given that our computational sequence is so long? 3

10 4 1 Why numerical methods? This is a nontrivial issue, and the answer depends both on the problem s mathematical properties as well as on the numerical algorithms used to solve the problem. It typically requires a high level of mathematical and numerical skills in order to deal with such problems successfully. Nevertheless, it can be done on a routine basis, provided that one masters central notions and techniques in scientific computing. Mathematical problems can roughly be divided into four categories. Thus we distinguish problems in algebra and problems in analysis. On the other hand, we distinguish between linear problems and nonlinear problems, see Table 1.1. The distinction between these categories is very important. In algebra, all constructs are finite, but in analysis, one is allowed to use transfinite constructs such as limits. Thus analysis involves infinite processes, and whether these converge or not. In scientific computing this presents an issue, because on a computer we can only carry out finite computations. Therefore all computational methods are based on algebraic methodology, and it becomes a central issue whether we can devise algebraic problems that (in one sense or another) have solutions that approximate the solutions of analysis problems. Table 1.1 Categories of mathematical problems. Only problems in linear algebra are computable. Category Algebra Analysis Linear computable not computable Nonlinear not computable not computable In a similar way, the distinction between linear and nonlinear problems is of importance. Most classical mathematical techniques deal with linear problems, which share a lot of structure, while nonlinear problems usually have to be approached on a case-by-case basis, if they van be solved by analytical techniques at all. Borrowing a notion from computer science, we say that a solution is computable if it can be obtained by a finite algorithm. However, a computer is limited to finite combinations of the four arithmetic operations, +,,,, and logical operations. Therefore, in general, only problems in the linear algebra category are computable, with the standard examples being linear systems of equations, linear least squares problems, linear programming problems in optimization, and some problems in discrete Fourier analysis. There are a few problems in analysis or nonlinear problems that can be solved by analytical techniques, but the vast majority cannot. For such problems, the only way to obtain quantitative results is by using numerical methods to obtain approximate results. This is where scientific computing plays its central role, and as we shall see, computational methods almost invariably work by, in one way or another, approximating the problem at hand by (a sequence of) problems in linear algebra, since this is where we possess finite, or terminating, algorithms.

11 1.1 Concepts and problems in analysis 5 However, in practice it is also impossible to solve linear systems exactly, unless they are very small. It is not uncommon to encounter large linear systems, perhaps having millions or even billions of unknowns. In such cases it is again only possible to obtain approximate results. This may perhaps sound disheartening to the mathematician, but the interesting challenge for the numerical analyst is to construct algorithms that have the capacity of computing approximations to any prescribed accuracy. Better still, if we can construct fast converging algorithms, such accurate approximations can be obtained with a moderate computational effort. Needless to say, there is a trade-off; more accuracy will almost always come at a cost, although on a good day, a good numerical analyst may occasionally enjoy a free lunch. This lays out some of the interesting issues in scientific computing. The objective is to solve complex mathematical problems on a computer, and to construct reliable mathematical software for given classes of problems, and making sure that quantitative results can be obtained both with accuracy and efficiency. In order to further outline some basic thoughts, we have to investigate what mathematical concepts we have to deal with in the different problem categories above. 1.1 Concepts and problems in analysis The central ideas of analysis that distinguish it from algebra are limits and convergence. The two most important concepts from analysis are derivatives and integrals. Let us start with the derivative, classically defined by the limit d f dx = lim h 0 f (x + h) f (x). (1.1) h If the function f is differentiable, the difference quotient converges to the limit f (x) as h goes to zero. Later, we will consider several alternative expressions to that of (1.1) for approximating derivatives. In practical computations, mere convergence is rarely enough; we would also like fast convergence, so that the trade-off between accuracy and computational effort becomes favorable. The derivatives of elementary functions, such as polynomials, trigonometric functions and exponentials, are well known. Nevertheless, if a function is complex enough, or an explicit expression of the function is unknown, the derivative may be impossible to compute analytically. However, it can always be approximated. At first, this may not seem to be an important application, but as we shall see later, it is important in scientific computing to approximate derivatives to a high order of accuracy, as this is key to solving differential equations. Let us consider the problem of computing an algebraic approximation to (1.1). Since we cannot compute the limit in a finite process, we consider a finite difference approximation, d f dx f (x + x) f (x), (1.2) x

12 6 1 Why numerical methods? Relative error Delta x Fig. 1.1 Finite difference approximation to a derivative. The graph plots the relative error r in the finite difference approximation (1.3) vs. x in a log-log diagram. Blue straight line on the right represents r as described by (1.4). A nearby dashed reference line of slope +1 shows that r = O( x). For x < 10 8, however, roundoff becomes dominant, as demonstrated by the red erratic part of the graph on the left. Roundoff error grows like O( x 1 ), as indicated by the second dashed reference line on the left, of slope 1. Thus the maximum accuracy that can be obtained by (1.3) is 10 8 where x > 0 is small, but non-zero. Thus the finite difference approximation simply skips taking the limit in (1.1), replacing the limit by (1.2). Let us see how accurate this approximation is. Example Consider the function f (x) = e x with derivative f (x) = e x, and consider computing a finite difference approximation to f (1), i.e., f (1) f (1 + x) f (1). (1.3) x Since f (1) = e, the relative error in the approximation is, by Taylor series expansion, r = f (1 + x) f (1) e x 1 = e x 1 1 = x x 2 + O( x2 ). (1.4) Thus we can expect to obtain an accuracy proportional to x. For example, in order to obtain six-digit accuracy, we should take x In order to check this result, we compute the difference quotient (1.3) for a large number of values of x and compare to the exact result, f (1) = e, to evaluate the error. It is an important to note that the approximation (1.2) is convergent, as shown by (1.4). Convergence means that the error can be made arbitrarily small, by reducing x. However,

13 1.1 Concepts and problems in analysis 7 in the numerical experiment in Figure 1.1, we see that we cannot obtain arbitrarily high accuracy. How is this discrepancy consistent with the claim that the approximation (1.2) is convergent? The answer is that convergence can only take place on a dense set of numbers, such as on the set of real numbers, which allows us to use the standard epsilon delta arguments of analysis. However, the real number system cannot be implemented in computer arithmetic (since it requires an infinite amount of information) and in numerical computation we have to make do with computer representable numbers, usually the IEEE 754 standard. Because this is a finite set of numbers, the notion of convergence does not exist in computer arithmetic, which explains why there will be a breakdown, sooner or later, if we take x too small. Even so, the set of computer numbers is dense enough for almost all practical purposes, and in the right part of Figure 1.1, we can clearly see the beginning of convergence, as the error behaves exactly as expected according to (1.4). Thus we can in practice usually observe the correct order of convergence. In this case the order of convergence is p = 1, meaning that the error is r = O( x p ). In this example we have also seen that the results were plotted in a log-log diagram. This is a standard technique used whenever the plotted quantity obeys a power law. A power law is a relation of the form η = C ξ p. Assuming that ξ and η are positive and taking logarithms, we obtain logη = logc + p logξ. This is the equation of a straight line. Thus logη is a linear function of logξ, with an easily identifiable slope of p, which is the power in the power law. In our example, η is the error r, and ξ is the step size x. Consequently, we have the power law r C x p, where we have neglected the higher order terms in the expansion (1.4). We plot logr as a function of log x to be able to identify p. In this case we have p = 1 and C = 1/2, which represents the leading term of the error. If x is not too large, the higher order terms can be neglected and the order of convergence clearly observed. A first order approximation, such as the one above, is rarely satisfactory in advanced scientific computing. We are therefore interested in whether it is possible to improve accuracy, and, in particular, to improve the order of convergence. This will play a central role in our analysis later, because it will allow us to construct highly accurate methods that require only a moderate computational effort. That sounds like a free lunch, but we shall see that if we develop our techniques well, it is indeed possible to obtain a vastly improved performance. Example We shall again consider the function f (x) = e x with derivative f (x) = e x. This time, however, we are going to use a symmetric a finite difference approximation to f (1), of the form f f (1 + x) f (1 x) (1). (1.5) 2 x

14 8 1 Why numerical methods? Relative error Delta x Fig. 1.2 Symmetric finite difference approximation. Relative error r in the symmetric difference quotient (1.5) is plotted vs. x in a log-log diagram. Green straight line on the right represents r as described by (1.6). A nearby dashed reference line of slope +2 shows that r = O( x 2 ). For x < 10 5, roundoff becomes dominant, as demonstrated by the red erratic part of the graph on the left. Roundoff error grows like O( x 1 ), as indicated by the second dashed reference line on the left. The maximum accuracy that can be obtained by (1.5) is The limit, as x 0, equals the derivative, but as we shall see both theoretically and practically, the convergence is much faster. By expanding both f (1 + x) and f (1 x) in a Taylor series around x = 1, we find the relative error r = f (1 + x) f (1 x) 1 = e x e x 1 = x2 2e x 2 x 6 + O( x4 ). (1.6) Once again we would observe a power law, but this time p = 2, which makes the symmetric finite difference approximation convergent of order p = 2. Again, we compare with real computations, and obtain the result in Figure 1.1. Evidently we can obtain considerably higher accuracy, while we still only use a a finite difference quotient requiring two function evaluations. It is of some interest to make a closer comparison, and by superimposing the two plots we obtain Figure 1.1. This shows that a higher order approximation will produce much higher accuracy, and obtain that higher accuracy at a relatively large value of x. We also see that the roundoff error is largely unaffected. To see how the roundoff error occurs, we assume that the function f (x) can be evaluated to a relative accuracy of ε10 16, which approximately equals the IEEE 754 roundoff unit. This means that we obtain an approximate value f (x) satisfying f (x) f (x) f (x) ε.

15 1.1 Concepts and problems in analysis Relative error Delta x Fig. 1.3 Comparison of finite difference approximations. Relative errors in first and second order finite difference approximations are compared as functions of x in a log-log diagram. The straight lines corresponds to order p = 1 (blue) and p = 2 (green), respectively. At x = 10 5, the second order approximation is more than five orders of magnitude more accurate. Roundoff errors remain similar for both approximations As a consequence, when we evaluate the difference quotient (1.5) we obtain a perturbed value, f (1) f (1 + x) f (1 x) f (1 + x) f (1 x) = + ρ, (1.7) 2 x 2 x where we will have a maximum roundoff error of ρ ˆρ = ε f (1) x = ε e x. Likewise, in (1.3) the maximum roundoff is 2 ˆρ in (1.3). Upon close inspection, the graphs also reveal that the roundoff is slightly larger in (1.3). However, in both cases ˆρ = O( x 1 ), meaning that the effect of the roundoff is inversely proportional to x when x 0. In other words, the approximation eventually deteriorates when x is reduced, explaining the negative unit slope observed in the three plots above. The brief glimpse above of problems in analysis demonstrates that computations (and problem solving) require approximate techniques. These are constructed in such a way that the approximating quantities converge, in a mathematical sense, to the original analysis quantities. However, because convergence itself is a notion from analysis, the successive improvement of the approximations is limited by the accuracy of the computer number system. This means that, while it is possible to obtain highly accurate approximations, one must be aware of the shortcomings of

16 10 1 Why numerical methods? finite computations. Some difficulties can be overcome by constructing fast converging approximations, but the bottom line is still that the approximations will be in error, and that it is important to be able to estimate and bound the remaining errors. This requires considerable mathematical skill as well as a thorough knowledge of the main principles of scientific computing. 1.2 Concepts and problems in algebra The key concepts in linear algebra are vectors and matrices. The central problems relate vectors and matrices, and the most important problem is linear systems of equations, Ax = b, (1.8) where A is an n n matrix, x is the unknown n-vector, and b is a given n-vector. In mathematics the solution (if it exists) is often expressed as where A 1 is the inverse of A. x = A 1 b, (1.9) In practice the inverse is rarely computed. Instead we employ some computational technique, such as Gaussian elimination, to solve the problem. This standard numerical method obtains the exact solution in a finite number of steps, i.e., the solution is computable. More precisely, it takes about 2n 3 /2 arithmetic operations to compute the solution in this way. Although the expression A 1 b is frequently used in theoretical discussions, Gaussian elimination avoids computing the inverse, which is more expensive to compute. In fact, the elimination procedure is equivalent to a matrix factorization, A = LU, where L and U are lower and upper triangular matrices. This transforms the original system to LUx = b, (1.10) which is solved in two steps: first the triangular system Ly = b is solved for y, and then the system Ux = y is solved for x. This procedure is used for the simple reason that it is faster to solve triangular systems than full systems, thus economizing the total work for obtaining x. In summary, Gaussian elimination first factorizes the matrix A = LU (this is the expensive operation), after which the equation solving is reduced to solving triangular systems. In scientific computing it is very common that one has to solve a sequence of systems, Ax k = b k, (1.11) where the right hand sides b k often depend on the previous solution x k 1. Here the superscript denotes the recursion index, not to be confused with the vector component index. In such situations the LU factorization of A is only carried out once, and

17 1.2 Concepts and problems in algebra 11 the factors L and U are then used to solve each system (1.11) cheaply. These forward and back substitutions only have an operation count of 2n 2 arithmetic operations. Most computational algorithms in linear algebra are built on some matrix factorization technique, chosen to provide suitable benefits. For example, in linear least squares problems it is standard procedure to factorize the matrix A = QR, where Q is an orthogonal matrix and R is upper triangular. This again reduces the amount of work involved, and the use of orthogonal matrices is beneficial to maintaining stability and accuracy in long computations. For very large systems (say problems involving millions of unknowns) the computational complexity of matrix factorization can become prohibitive. It is then common to employ iterative methods for solving linear systems. This means that we sacrifice obtaining an exact solution (up to roundoff), in favor of obtaining an approximate solution at a much reduced computational effort. In fact, large-scale problems in linear partial differential equations may be too large for factorization methods, leaving us with no choice other than iterative methods. Another important standard problem in linear algebra is to find the eigenvalues and eigenvectors of a matrix A. They satisfy the equation Ax = λx. (1.12) As both λ and x are unknowns, this problem is technically nonlinear, although it is referred to as the linear eigenvalue problem. This designation stems from the fact that A is a linear operator. The eigenvalues are the (complex) roots λ of the characteristic equation det(a λi) = 0. (1.13) If A is n n, then P(λ) := det(a λi) is a polynomial of degree n, so we seek the n roots of the polynomial equation P(λ) = 0. Polynomial equations are obviously nonlinear whenever the degree of the polynomial is 2 or higher. The major complication is, however, that there are in general no closed form expressions for the roots of (1.13) if n > 4. For higher order matrices, therefore, numerical methods are necessary. In practice eigenvalues are always computed by iterative methods, producing a sequence of approximations λ k converging to the eigenvalue λ. These computational techniques are non-trivial, and usually employ various matrix factorizations as part of the computation. If the matrix A is large, the computational effort can easily become overwhelming, and most computational techniques for such problems offer the possibility of only computing a few (dominant) eigenvalues. Thus we see that convergence becomes an issue also for computational methods in some standard linear algebra problems. Although this may appear counterintuitive, it shows that numerical methods are necessary in the vast majority of mathematical problems. Computational methods are constructive approximations, and are not a matter of merely inserting data into a mathematical formula in order to obtain the solution to the problem.

18 12 1 Why numerical methods? It is fair to say that almost every single problem in applied mathematics involves linear algebra problems at some level. Therefore linear algebra techniques are at the core of scientific computing, although we will see that real problems in science and engineering often have to be approached by a number of preparatory techniques before the task has been brought to a linear algebra problem where standard techniques may be employed. 1.3 Computable problems We have seen that mathematical problems can be divided into problems from algebra and analysis, and that these two categories can further be divided into linear and nonlinear problems. That leaves us with four categories of problems. We say that a problem is computable if its exact solution can be obtained in a finite number of operations. Unfortunately, only some of the problems in linear algebra are computable. It is important to note that the notion of computable problems is concerned with inexact computer arithmetic. For example, Gaussian elimination will in theory solve a linear system exactly in a finite number of operations, but on the computer it is subject to roundoff errors, since it is impossible to implement the real number system on a computer. Instead we have to make do with the IEEE 754 computer arithmetic. This system only contains a (small) subset of rational numbers, and further, each arithmetic operation will in general incur additional errors, since the set of computer representable numbers is not closed under the standard arithmetical operations. As a consequence, not even computable problems are necessarily solved exactly on a computer. This is why numerical methods are necessary, even for the smallest of problems. They are not an alternative to analytical techniques, but rather the only way to obtain quantitative results. In spite of the necessity of using computational methods in almost all of applied mathematics, the analytical tools of pure mathematics are no less important in numerical analysis. The construction and analysis of computational methods rely on basic as well as advanced concepts in mathematics, which the numerical analyst must master in order to devise stable, accurate and efficient computational methods. Numerical analysis is the continuation of mathematics by other means. Its goal is to construct computational methods and provide software for the efficient approximate solution of mathematical problems of all kinds, using the computer as its main tool. The aim is to compute an approximations to a prescribed accuracy, preferably in tandem with error bounds or error estimates. In other words, we want to design methods that produce accurate approximations that converge as fast as possible to the exact solution. This convergence will only be observed in part, as the exact solution is generally both unknown and non-computable. The design of such computational methods is a nontrivial task. Although their aim is to address the most challenging problems in nonlinear analysis, computa-

19 1.3 Computable problems 13 tional methods are usually assessed by applying them to well known problems from applied mathematics, where there are known analytical expressions for the exact solution. It is of course not necessary to solve such problems, but they remain the best benchmarks for new computational methods. Thus, a method which cannot excel at solving a standard problem in applied mathematics is almost bound to fail on real-life problems.

21 Chapter 2 Principles of numerical analysis The subject of numerical analysis revolves around how to construct computable problems that approximate the mathematical problem of interest. There is a very large number of computational methods, superficially having rather little in common. But it is important to realize that all computational methods are constructed from, and rely on, only four basic principles of numerical analysis. These are: The principle of discretization Linear algebra, including polynomial spaces The principle of iteration The principle of linearization. Because the computable problems are essentially those of linear algebra, almost all numerical methods will at the core work with linear algebra techniques. The purpose of the other three principles listed above is to construct various approximations that bring the original problem to computable form. Below we shall outline what these principles entail and why these ideas are so important. They will be encountered throughout the book in many different shapes. 2.1 The First Principle: Discretization The purpose of discretization is to reduce the amount of information in a problem to a finite set, making way for algebraic computations. It is used as a general technique for converting analysis problems into (approximating) algebra problems, see Table 1.1, and is the key to numerical methods for differential equations. Consider an interval Ω = [0,1] and select from it a subset Ω N = {x 0,x 1,...,x N } of distinct points, ordered so that x 0 < x 1 < < x N. The set Ω N is called discrete, to distinguish it from the continuous set (or rather the continuum) Ω. We refer to Ω N as the grid. Let a continuous function f be defined on Ω. Discretization (of f ) means that we convert f into a vector, by the map f F = f (Ω N ), i.e., 15

22 16 2 Principles of numerical analysis Fig. 2.1 Discretization of a function. A continuous function f on [0,1] (top). A grid Ω N is selected on the x-axis and samples from f are drawn (center). The grid function F (bottom) is a vector with a finite number of components (red), plotted vs. the grid points Ω N (black) f (x 0 ) F =. f (x N ). (2.1) The function f has the independent variable x [0,1], meaning that f (x) is a particular value of that function. Likewise, the vector F has an index as its independent variable, and F k = f (x k ) is a particular value of the vector, with 0 k N. As F is only defined on the grid, it is also called a grid function, see Figure 2.1. As the grid function is obtained by drawing equidistant samples of the function f, we may in effect think of the process as an analog-to-digital conversion, akin to recording an analog audio signal to a digital format. For theoretical reasons, one may also wish to consider the case N. In such situations the grid as well as grid functions are sequences rather than vectors. In practical computations, however, the number is always finite, meaning that computational methods work with vectors as discrete representations of functions. To see how discretization can be used in differential equations, consider the simple radioactive decay problem u = αu; u(0) = u 0, (2.2)

23 2.1 The First Principle: Discretization 17 with exact solution u(t) = e αt u 0, and suppose we want to solve this equation on [0,T ]. To make the problem computable, we will turn it into a linear algebra problem by using a discrete approximation to the derivative. Introduce a grid Ω N = {t 0,...,t N } with t k = k t and N t = T. Noting that u(t) = lim t 0 u(t + t) u(t), (2.3) t we introduce a grid function U approximating u, i.e., U k u(t k ). We then have u(t k ) u(t k + t) u(t k ) t U k+1 U k. (2.4) t This is referred to as a finite difference approximation. Next, we will use this to replace the derivative in (2.2). Thus the original, non-computable problem is replaced by a computable, discrete problem, U k+1 U k t = αu k. (2.5) This can be rewritten as a linear system of equations, U 0 u 0 1 α t 1 U α t 1.. =., (2.6) α t 1 U N 0 showing that the approximating problem is indeed a problem of linear algebra. As the matrix is lower triangular, the system is easily solved using forward substitution, meaning that we can compute successive approximations to u( t), u(2 t),... by repetitive use of the formula U k+1 = (1 + α t)u k. We then find that u(k t) U k = (1 + α t) k u 0. All approximations are obtained using elementary arithmetic operations. If, in particular, we want to find an approximation to u(t ) in N steps, we take t = T /N, and get ( u(t ) U N = 1 + αt ) N u 0. N From analysis we recognize the limit ( lim 1 + αt ) N = e αt, N N

24 18 2 Principles of numerical analysis so apparently the method is convergent the numerical approximation approaches the exact solution u(t ) = e αt u 0 as N. Formally, the method needs an infinite number of steps to generate the exact solution. Although this may at first seem to be a drawback, it is the very idea of convergence that saves the situation. Thus, for every ε > 0, an ε-accurate approximation is computable in N(ε) steps. This means that we can obtain a solution to any desired accuracy with a finite effort. Although N(ε) is finite for every ε > 0, it need not be bounded as ε 0; the numerical problem is still computable, although the original analysis problem is not. Now, above we note that the discretization is built on the finite difference approximation (2.4), which incidentally is the same as the approximation (1.2) we used to approximate a derivative numerically. The resulting method, (2.5), is known as the explicit Euler method. Since the finite difference approximation (2.4) of the derivative is of first order, we might expect that the resulting method for the differential equation is also first order convergent. Indeed, one can show that u N u(t ) C N 1 = O( t). Because the error is proportional to t p with p = 1, the method is first order convergent. This is a slow convergence; if we want ten times higher accuracy, we will have to use a ten times shorter time step t, or, equivalently, ten times more steps (work) to reach the end point T, since t = T /N. This is demonstrated below using a MATLAB implementation of the function eulertest(alpha, u0, t0, tf), where alpha is the parameter α in the problem, u0 is the initial value u(0), and t0 and tf are the initial time and terminal time of the integration. The problem was run with α = 1 on the time interval [0,1] with initial value u(0) = 1. It was further run for N = 10,100,1000 and 10000, corresponding to step sizes t = 1/N. The results are found in Figures 2.1 and 2.1. function eulertest(alpha, u0, t0, tf) % Test of explicit Euler method. Written (c) for k = 1:4; N = 10ˆ(5-k); dt = (tf-t0)/n; u = u0; t = t0; sol = u; time = t; err = 0; for i=1:n u = u + dt*alpha*u; t = t + dt; sol = [sol u]; time = [time t]; end % Number of steps % Time step % Initial value % Time stepping loop % A forward Euler step % Update time % Collect solution data

25 2.1 The First Principle: Discretization 19 uexact = exp(alpha*(time - t0))*u0; % Exact solution error = sol - uexact; % Numerical error h(k) = dt; % Save step size r(k) = abs(error(end)); % Save endpoint error figure(1) subplot(2,1,1) plot(time,sol,'r') hold on plot(time,uexact,'b') grid on axis([ ]) xlabel('t') ylabel('u(t)') hold off subplot(2,1,2) semilogy(time,abs(error)) grid on hold on axis([0 1 1e-6 1e-1]) xlabel('t') ylabel('error') end % Numerical solution (red) % Exact solution (blue) % Error vs. time figure(2) loglog(h,r,'b') xlabel('dt') ylabel('error') grid on hold on xref = [1e-4 1e-1]; yref = [1e-5 1e-2]; loglog(xref,yref,'k--') hold off % Endpoint error vs dt The explicit Euler method performs as expected. Although the results are textbook perfect, the convergence is slow and the accuracy is less impressive. Even at N = 10 4 we do not obtain more than four-digit accuracy. However, when approximating derivatives, we also used a second order approximation, (1.5). It would then be of interest to try out that approximation too, for the differential equation u = αu. This leads to a different discretization, V k+1 V k 1 2 t = αv k, (2.7) where V k u(t k ) as before, and t k = k t. By Taylor series expansion, we find that u(t k+1 ) u(t k 1 ) 2 t = u(t k + t) u(t k t) 2 t = u(t k ) + O( t 2 ),

26 20 2 Principles of numerical analysis u(t) t 10-2 error t Fig. 2.2 Test of the explicit Euler method. Top panel shows the exact solution (blue) and the numerical solution (red) using N = 10 steps. The step size is coarse enough to make the error readily visible. Bottom panel shows the error U k u(t k ) along the solution in a lin-log diagram. From top to bottom, the individual graphs correspond to N = 10,100,1000 and N = At each point in time, the error decreases by a factor of 10 when N increases by a factor of 10, demonstrating that the error is O( t) so we would expect the method (2.7) to be of second order and therefore more accurate than the explicit Euler method. The method is known as the explicit midpoint method. We note that there is one minor complication using this method it requires two starting values instead of one, since we need both V 0 and V 1 to get the recursion V k+1 = V k 1 + 2α tv k (2.8) started. We shall take V 0 = u(0) and V 1 = e α t u(0). This corresponds to taking initial values from the exact solution. The previous code can easily be modified to test the explicit midpoint method. When this was done, the code was tested in a slightly more challenging setting, by taking α = 4.5 but otherwise solving the same problem as before. A wider range of step sizes were chosen, by taking N = 10,32,100,...,32000,100000, so that t varies between 0.1 and This time the results are not in line with expectations. As we see in Figures 2.1 and 2.1, the method appears to be second order convergent, but the error is not as regular as one would have hoped for. This turns out to depend on the construction of the method, and it illustrates that advanced methods in scientific computing cannot, in general, be constructed by intuitive techniques. The loss of performance for this

27 2.1 The First Principle: Discretization error Fig. 2.3 Test of the explicit Euler method. The endpoint error U N u(1) is plotted in a log-log diagram for N = 10,100,1000 and N = (blue). A dashed reference line of slope 1 shows that the error is O( t), i.e., the method is first order convergent dt method is due to some stability issues; in fact, this method is unsuitable for the radioactive decay problem, although it excels for other classes of problems, such as problems in celestial mechanics and in hyperbolic partial differential equations. Thus, one needs a deep understanding of both the mathematical problem and the discretization method in order to match the two and obtain reliable results. For the time being, and until the computational behavior of the explicit midpoint method has been sorted out and analyzed, we have to consider the method a potential failure, in spite of the unquestionable success we observed before, when using exactly the same difference quotient for approximating derivatives. We have seen above that a discretization method only computes approximations at discrete points to a continuous function. In the simplest case, we approximate derivatives at distinct points, and in the more advanced cases we compute grid functions (vectors) that approximate a continuous function solving a differential equation. The distinctive feature is that discretization uses algebraic computations to approximate problems from analysis. Thus the computations can be carried out in finite time, at the price of being approximate. Even so, we have seen that it is a nontrivial task to find such approximations; intuitive techniques do not always produce the desired results. For this reason, we need to carefully examine how approximate methods are constructed, and how the approximate methods differ in character from the exact solutions to problems of analysis.

28 22 2 Principles of numerical analysis u(t) t 10 0 error t Fig. 2.4 Test of the explicit midpoint method. Top panel shows the exact solution (blue) and the numerical solution (red) using N = 10 steps. The numerical solution has an undesirable oscillatory behavior of growing amplitude, indicating instability. The method eventually produces negative values, in spite of the exact solution always remaining positive, being an exponential. Bottom panel shows the error V k u(t k ) along the solution in a lin-log diagram. From top to bottom, the individual graphs correspond to N = 10,320,1000, Initially, the error is O( t 2 ), but here, too, we observe an oscillatory behavior, and a faster error growth when t grows The examples above are very simple. In particular, the differential equation is linear. It is chosen simply to illustrate the errors of the approximate methods. Linear problems usually have solutions that can be expressed in terms of elementary functions. By contrast, most nonlinear differential equations of interest cannot be solved in terms of elementary functions. For example, the van der Pol equation, u = v v = µ (1 u 2 )v u with u(0) = 2 and v(0) = 0, is a system of ordinary differential equations, which is nonlinear if µ 0. An analytical solution can only be found for µ = 0; then u(t) = 2 cost and v(t) = 2 sint. Nonlinearity is therefore an added difficulty, and for most nonlinear problems, there is no alternative to computing approximate numerical solutions. In a linear problem the unknowns enter only in their first power. There are no other powers, like u 2 or v 1/2, nor any nonlinear functions, such as u 2, sinu or logv, occurring in the equation. In the van der Pol equation above, we see that the cubic

29 2.2 The Second Principle: Polynomials and linear algebra error Fig. 2.5 Test of the explicit midpoint method. The endpoint error V N u(1) is plotted in a log-log diagram for N = 10,..., (blue). A dashed reference line of slope 2 shows that the error is O( t 2 ), but only if t < (left part of graph, corresponding to N > 3200). Thus the method is second order convergent, but for larger t, the error grows rapidly and the proper convergence order is no longer observed (right part of graph) dt term u 2 v occurs. Note that a nonlinearity can take the form of a product of firstdegree terms, such as uv, which is a quadratic term. Discretization methods are in general aimed at solving nonlinear problems, provided that the original problem has a unique solution. A necessary condition for solving nonlinear problems successfully is that we are able to solve linear problems in an accurate and robust way. In addition, many classical problems in applied mathematics are linear, but still have to be solved numerically due to various complications, such as problem size or complex geometry. Discretization therefore remains one of the most important principles in scientific computing. 2.2 The Second Principle: Polynomials and linear algebra Linear algebra is synonymous with matrix problems of various kinds, such as solving linear systems of equations and eigenvalue problems. But to fully appreciate the role of linear algebra in numerical analysis, it must be recognized that most computational problems that give rise to algebraic computations do not come ready-made in matrix vector form.

30 24 2 Principles of numerical analysis Problems in analysis typically involve continuous functions, for instance the computation of integrals. It may sometimes be difficult to solve such problems analytically, as one cannot always find primitive functions in terms of elementary functions. But polynomials are different. It is straightforward to work with polynomials in analysis the primitive function of a polynomial is again a polynomial, and conversely, the derivative of a polynomial is also a polynomial. This makes them attractive and convenient tools in numerical analysis, and a large number of numerical methods are therefore based on polynomial approximation. The simplest (but far from the best) representation of a polynomial is P(x) = c 0 + c 1 x + c 2 x c N x N. (2.9) Thus every polynomial is completely defined by a finite set of information, the coefficient vector, (c 0,c 1,...,c N ) T. Consequently, many problems involving polynomials lead directly to matrix vector computations, or, in other words, linear algebra. The interpolation problem is one of the most important uses of polynomials in numerical analysis. In interpolation, given a grid function F on a grid Ω N, one constructs a polynomial P on the interval Ω, such that P(Ω N ) = F. The purpose is the opposite of discretization. Interpolation aims at generating a continuous function from discrete data. If we think of discretization as an analog-to-digital conversion, interpolation is the opposite, a digital-to-analog conversion. Example. Let us consider approximating the function f (x) = 1+0.8sin2πx by a polynomial P on the interval [0, 1], such that the polynomial reproduces the correct values of f at some selected points, say at x 0 = 0, x 1 = 1/4, x 2 = 1/2, x 3 = 3/4 and x 4 = 1. At those points, f (x) takes the values 1, 1.8, 1, 0.2 and 1, respectively. Using the ansatz (2.9) with N = 4 and imposing the five conditions c 0 + c 1 x k + c 2 x 2 k + c 3x 3 k + c 4x 4 = f (x k ); k = 0,1,...,4 we obtain the linear system of equations 1 x 0 x0 2 x0 3 x x 1 x1 2 x1 3 x x 2 x2 2 x2 3 x2 4 1 x 3 x3 2 x3 3 x3 4 1 x 4 x4 2 x4 3 x4 4 c 0 c 1 c 2 c 3 c 4 f (x 0 ) f (x 1 ) = f (x 2 ). (2.10) f (x 3 ) f (x 4 ) This determines the unknown coefficients c 0,...,c 4 that uniquely characterize the interpolation polynomial. Inserting the data and solving for the coefficient vector c gives P(x) = x x x3 + 0 x 4.

31 2.2 The Second Principle: Polynomials and linear algebra data interpolant error x Fig. 2.6 Interpolation of a function. A grid function F, indicated by blue markers, represents samples of a trigonometric function f on [0,1], indicated by dashed curve (top panel). The grid function is interpolated by a polynomial P of degree 3 (solid blue curve, center panel). As P deviates visibly from f, interpolation does not recover the original function f from the grid function F. The error P(x) f (x) is plotted as a function of x [0,1] (red curve, lower panel). The error is zero only at the interpolation points, indicated by red markers We note that there is no fourth degree term due to the partial symmetry of the function f. In general, having to interpolate at five points, we would expect the polynomial to have five coefficients, i.e., P would have to be a degree four polynomial. Here, however, the interpolant is a polynomial of degree 3. Let us leave aside the question of whether this is a good approximation the point here is rather to recognize that although both f (x) and the approximating polynomial P(x) are nonlinear functions, the interpolation problem is linear and therefore computable. The unknown coefficients c k enter the problem linearly, and we obtained the linear system of equations (2.10). The interpolation problem therefore falls in the linear algebra category. As the example in Figure 2.2 shows, interpolation generates a continuous function from a grid function, but it is not the inverse operation of discretization. Thus, discretization maps a function f to a grid function F, and interpolation maps F to a polynomial P. This does not recover f unless f originally was identical to P. Therefore, in general, it holds that f (x) P(x), except at the interpolation points. In scientific computing it is often preferable to use more general functions than the standard polynomial (2.9). Just like linear algebra uses basis vectors, it is common to choose a set of basis functions, ϕ 0 (x),ϕ 1 (x),...ϕ N (x). These are often, but

32 26 2 Principles of numerical analysis not always, polynomials that are carefully chosen for their computational advantages, or to reflect some special properties of the problem at hand. Given a function f (x), we can then try to find the best approximation among all linear combination of the basis functions ϕ k (x). In other words, we want to find the best choice of coefficients c k such that c k ϕ k (x) f (x), (2.11) k for x in some set. This could either be an interval, say Ω = [0,1], or a grid Ω N = {x 0,x 1,...,x N }. The interpolation problem above was a particular case, where we used the well-known monomial basis, ϕ k (x) = x k ; k = 0,...,N (2.12) together with the grid Ω N. With general basis functions, that system would become ϕ 0 (x 0 ) ϕ 1 (x 0 )... ϕ N (x 0 ) c 0 f (x 0 ) ϕ 0 (x 1 ) ϕ 1 (x 1 ) ϕ N (x 1 ) c = f (x 1 ).. (2.13) ϕ 0 (x N )... ϕ N (x N ) c N f (x N ) The important observation here is that the basis functions enter the matrix as its column vectors, and that the system remains linear. A natural question is whether the interpolation is improved by letting N. In general this is not the case; it depends in a complicated way on the choice of basis functions, as well as on the grid points Ω N and the regularity of f. Because the grid points are often defined by the application (e.g. in image processing the grid points correspond to pixels with fixed locations), the basis functions usually play the more important role. An example is the monomial basis (2.12), a choice that is fraught with prohibitive shortcomings for large N. It is often better to use piecewise low-degree interpolation, a technique frequently used in connection with differential equations and the Finite Element Method, where N is often very large. Let ϕ k and F denote the grid functions ϕ k (x 0 ) f (x 0 ) ϕ k =. ϕ k (x N ) F =. f (x N ). (2.14) Then (2.13) can be written N k=0 c k ϕ k = F. (2.15)

33 2.2 The Second Principle: Polynomials and linear algebra 27 It expresses the vector F as a linear combination of the basis vectors ϕ k, while (2.11) expresses the function f (x) as a linear combination of the basis functions ϕ k (x). The least-squares problem is quite closely related to the interpolation problem. There the number of basis functions, M + 1, is small compared to the number of elements N +1 in Ω N (not to mention the case when Ω is an interval). For example, fitting a straight line (a first-degree polynomial) to a data set means that M = 1, i.e., we have but two coefficients, c 0 and c 1, to determine, while the data set, defined by f -values on Ω N, may be arbitrarily large. The system M k=0 c k ϕ k = F (2.16) is then an overdetermined system; it has more equations than unknowns. This corresponds to (2.13) but with a rectangular matrix containing only a few columns. In such a situation, we form the scalar product of (2.16) with any ϕ j, to get M k=0 c k ϕ T j ϕ k = ϕ T j F ; j = 0,...,M. (2.17) This is an M M linear system of equations, referred to as the normal equations. It can be written Ac = g, with matrix elements a jk = ϕ T j ϕ k, and where g j = ϕ T j F. If the basis vectors {ϕ k } are linearly independent, it can be solved for the vector c of unknown coefficients c k. An even better approach is to first select the basis functions ϕ k (x) so that the vectors ϕ k are orthogonal on Ω N, which means that ϕ T j ϕ k = 0 if j k. Then the sum on the left hand side of (2.17) has only one term; the system reads c j ϕ T j ϕ j = ϕ T j F. Thus we can immediately solve for the coefficient c j, to get c j = ϕt j F ϕ T j ϕ. (2.18) j This means that the coefficient vector is easily computed in terms of scalar products alone, one of the basic techniques in numerical linear algebra. A most important special case is Fourier analysis, which is exclusively based on this technique, making it very useful and efficient in practical computation. The coefficients c j in (2.18) are commonly referred to as Fourier coefficients. In Fourier analysis, the basis functions are typically chosen as trigonometric functions. By Euler s formula, e iωn = cosωn + isinωn. Therefore, trigonometric functions are polynomials in the variable x = e iω, motivating the commonly used term trigonometric polynomials. More interestingly, but less obviously, the Finite Element Method (FEM) for differential equations is based on similar ideas, using scalar products to construct the linear system that needs to be solved. In effect, the best approximation is then found by solving a system of the form (2.17), although we then have M = N. There

34 28 2 Principles of numerical analysis are many different FEM approaches, consisting in choosing the basis functions {ϕ j } to carefully represent the desired properties of the solution, in terms of the degree of the polynomial basis functions as well as in order to satisfy boundary and continuity conditions in a proper way. All these methods lead to linear systems of special structure, and provide strong examples of how numerical analysis prefers to work in various polynomial settings to make efficient use of linear algebra for the determination of the best linear combination of the chosen basis functions. Thus, it is fair to say that approximation by polynomials is a cornerstone of scientific computing, allowing us to employ the full range of linear algebra techniques in the process of finding approximate solutions to problems from analysis. An added difficulty is that in many applications, not least in FEM, the linear systems are extremely large and sparse. It is not uncommon to have millions of unknowns or more. In such cases it may no longer be possible to use standard factorization methods to solve the system. Instead, other approximate techniques have to be considered. These are iterative. 2.3 The Third Principle: Iteration Nonlinear problems can rarely be solved by analytical techniques. The computational techniques used are iterative. This means that they are repetitive, in the sense that the computational method generates a sequence of successively improved approximations that converge to the solution of the nonlinear problem. Since convergence is usually an infinite process, the iteration will in practice be terminated after a finite number of steps, but if convergence is observed it is possible to terminate the process when an acceptable accuracy has been achieved. The very simplest example is the problem of computing the square root of a number, say 2. As this is an irrational number, we only have numerical approximations to it, although today s software allows us to compute such approximations at the touch of a button. The square root of 2, symbolized by 2, is the positive root of the nonlinear equation x 2 = 2. The root is obviously greater than 1 and less than 2, so a simple guess is x 1.5. Since the root is now in the interval [1,1.5], one could repeat halving the interval to improve the accuracy. This process, called bisection, is however unacceptably slow. Much better and more general techniques are needed. Two thousand years ago, Heron of Alexandria noted that if an approximation x was a bit too large (like 1.5), then 2/x would be a bit too small, and vice versa. He thus suggested that one take the average of x and 2/x to obtain a better approximation. This looks deceptively similar to bisection, but it was an enormous advance, which 1,700 years later became known as Newton s method. If iterated, the computation is x k+1 = 1 (x k + 2x ) 2 k,

35 2.3 The Third Principle: Iteration 29 where the superscript k is the iteration index. To verify that limx k = 2, and to analyze the iterative process, let ε k = (x k 2)/ 2 denote the relative error of the approximation x k. Then, by expanding in a Taylor series, we get x k+1 = 1 (x k + 2x ) ( 2 k = 1 ) 2(1 2 + ε k ) ε k 1 ) (1 + ε k + (1 ε k + (ε k ) 2...) 2 + (εk ) Hence it follows that the relative error in x k+1 is ε k+1 = xk (εk ) 2 2. This means that the accuracy more than doubles each iteration! If the relative error of x 0 is 10 2 (or 1%, corresponding to two correct digits, as in x 0 = 1.4), then the relative error of x 1 is In fact, the third iterate, x 3, is correct to 16 digits, implying that full IEEE precision has been attained. By contrast, simple bisection would require 50 iterations to achieve the same. This shows the power of well-designed iterative methods. In scientific computing there are two basic types of iterative methods for solving nonlinear equations. These are fixed-point iteration and Newton s method. The iteration demonstrated above is an example of both methods. In fact, they solve slightly different problems. Newton s method solves problems of the form f (x) = 0, while fixed-point iteration solves problems of the form x = g(x). The functions f and g are both nonlinear. We shall leave Newton s method for the next section, and only analyze fixed-point iteration here. Given a nonlinear function g : D R m R m, fixed-point iteration starts from an initial approximation x 0 and computes the next approximation from x 1 = g(x 0 ). The iteration becomes x k+1 = g(x k ). (2.19) The name fixed-point iteration comes from the fact that if the iteration converges, we have limx k = x, where x = g(x ), (2.20) i.e., x is left unchanged by the map g. Thus x maps to itself, and is termed a fixed point of g. Whether there exist fixed points and whether the iteration converges depend on the function g. Subtracting (2.20) from (2.19), we get x k+1 x = g(x k ) g(x ). (2.21) Taking norms (for the time being, any vector norm will do), we find that x k+1 x = g(x k ) g(x ).

36 30 2 Principles of numerical analysis Next, we shall assume that the map g is Lispchitz continuous. Definition 2.1. Let g : D R m R m. The Lipschitz constant of g on D is defined by g(u) g(v) L[g] = sup (2.22) u v u v for all u,v D. Using this definition, it holds that x k+1 x = g(x k ) g(x ) L[g] x k x. Letting ε k = x k x denote the norm of the absolute error in x k, we note that ε k+1 L[g] ε k. Hence, if L[g] < 1, the error decreases in every iteration, and, in fact, ε k 0. A map with L[g] is called a contraction map, as the distance between two points (here x k and x ) decreases under the map g. For contraction maps the fixed point iteration is therefore convergent, provided that x 0,x D. The fixed point theorem (also known as the contraction mapping theorem) is one of the most important theorems in nonlinear analysis, and it states the following: Theorem 2.1. (Fixed point theorem) Let D be a closed set and assume that g is a Lipschitz continuous map satisfying g : D R m D. Then there exists a fixed point x D. If, in addition, L[g] < 1 on D, then the fixed point x is unique, and the fixed point iteration (2.19) converges to x for every starting value x 0 D. We shall not give a complete proof. The key points here are that g maps the closed set D into itself (existence) and that g is a contraction, L[g] < 1 (uniqueness). Proving existence is the hard part (20th century mathematics), while uniqueness and convergence are quite simple. Naturally, existence is always of fundamental importance, but in computational practice it is the contraction that matters most, since the theorem offers a constructive way of obtaining successively improved approximations, converging to the proper limit x. It is worth noting that Lipschitz continuity is quite close to differentiability. In fact, we can rewrite (2.21) as x k+1 x = g(x k ) g(x ) = g(x + (x k x )) g(x ) g (x ) (x k x ), (2.23) provided that g is differentiable. Thus x k+1 x g (x ) x k x. This indicates that the error is diminishing provided that g (x ) < 1. In fact, if g is differentiable and the set D is convex, then one can show that

37 2.3 The Third Principle: Iteration 31 L[g] = sup g (x). x D Returning to the classical (but trivial) problem of computing the square root of 2, we note that g(x) = 1 ( x + 2 ). 2 x It follows that g (x) = 1 2 (1 2x 2 ). Therefore, g (x ) = g ( 2) = 0. This means that g has a very small Lipschitz constant in a neighborhood of the root, explaining the fast convergence of Heron s formula. In scientific computing, it is always of interest to find error estimates. Again, we can rewrite (2.21) as x k+1 x = g(x k ) g(x ) = g(x k ) g(x k+1 ) + g(x k+1 ) g(x ). (2.24) By the triangle inequality, we have ε k+1 g(x k ) g(x k+1 ) + g(x k+1 ) g(x ) L[g] x k x k+1 + L[g] ε k+1. Hence we have (1 L[g])ε k+1 L[g] x k x k+1. Solving for ε k+1, we obtain the error bound ε k+1 L[g] 1 L[g] xk x k+1. (2.25) Thus, while the true error remains unknown, it can be bounded in terms of the computable quantity x k x k+1, provided that we know L[g]. Unfortunately, the Lipschitz constant is rarely known, but a rough lower estimate can be obtained during the iteration process from L[g] g(xk ) g(x k 1 ) x k x k 1 = xk+1 x k x k x k 1. This makes it possible to compute a reasonably accurate error estimate from (2.25). While iterative methods are necessary for nonlinear problems, iterative methods are also of interest for large-scale linear systems. For example, linear systems arising in partial differential equations may have hundreds of millions of equations, which excludes the use of conventional direct methods such as Gaussian elimination. The remaining option is then iterative methods. If the problem is Ax = b, one usually tries to split the matrix so that A = M N, and rewrite the system

38 32 2 Principles of numerical analysis Mx = Nx + b. If one can find a splitting such that M is inexpensive to invert (for example if M is diagonal), we can use a fixed point iteration x k+1 = M 1 (Nx k + b). Here, if x k = x + δx k, with δx k representing the absolute error, we see that δx k+1 = M 1 Nδx k. This iteration will be convergent (δx k 0) if all eigenvalues of M 1 N are located strictly inside the unit circle; this then becomes a measure of a well designed iterative method. Interestingly, whether this can be achieved depends on the choice of discretization as well as on the properties of the differential equation. It is easily seen that the iteration above can be rewritten x k+1 = x k M 1 (Ax k b). Here the quantity r k = Ax k b is known as the residual. In many large-scale problems it is easy to compute the residual, but the matrix A itself may not be available for separate processing. The iterative method then attempts to successively improve the approximation x k by only evaluating the residual, but changing its direction by the matrix M in order to speed up convergence. For such a technique to be successful, much work goes into the construction of M, using all available knowledge of the problem at hand. Depending on on the construction of the iterative method, the matrix M is often referred to as a preconditioner. The ideal choice of M would be to take M = A, but this choice requires that A is inverted or factorized by conventional techniques. Therefore M is always some approximation, for example an incomplete factorization of A. As mentioned before, there are many other kinds of problems that require iterative methods, e.g. eigenvalue problems. It is impossible to give a comprehensive overview in a limited space, but suffice it to say that without iterative methods, scientific computing would not be able to address important classes of problems in applied mathematics. 2.4 The Fourth Principle: Linearization The last principle that is a key element in many numerical methods is linearization. This simply means that one converts a nonlinear problem to a linear one. Often, this implies that one considers small variations around a given point, using differentiation as an approximation.

39 2.4 The Fourth Principle: Linearization 33 Consider a differentiable map f : R m R n, mapping x R m to y R n, so that y = f (x). Analysis then allows us to write dy = f (x)dx. (2.26) This expresses that the differential dy is a linear function of the differential dx. It is valid for any differentiable function f, and the derivative f (x) can be a scalar (when n = m = 1), a gradient (when n = 1 with m arbitrary but finite), a Jacobian matrix (with both n and m arbitrary, but finite), or a linear operator (both n and m possibly infinite), as the case may be; the formalism is always the same. In numerical analysis, it is common to approximate the infinitesimal differentials by finite differences, such as (1.2). We then write y f (x) x. (2.27) This is again expressing (small) variations in y as a linear function of x. That is, we consider the effect y, due to small variations x in the independent variable x, to be proportional to x. This is the essential idea of linearization. To take this further, let f be a differentiable nonlinear map f : R m R m. Given any fixed vector x and a small, varying offset x, we can expand f in a Taylor series about x, to obtain f (x + x) f (x) + f (x) x, (2.28) by retaining the first two terms. This is equivalent to linearization because it approximates f (x + x) by a linear function of x; the approximation on the right hand side of (2.28) only contains x to its first power. Note that if the problem is scalar (m = 1) this is a standard Taylor series, with all quantities involved being scalar. One can then graph the right hand side of (2.28) and get a straight line with slope f (x). In the vector case, however, x is an m-vector, f (x) is a m-vector, and each component of f (x) depends on all components of x, i.e., f i (x) = f i (x 1,...,x j,...,x m ). Hence each component f i can be differentiated with respect to every component of x, to obtain its gradient f i (x). In other words, ( f i fi (x) = grad x f i (x) =,..., f i,..., f ) i. (2.29) x 1 x j x m If we write down the derivatives of all components of f simultaneously, we obtain the m m Jacobian matrix,

40 34 2 Principles of numerical analysis f (x) = f 1 x 1 f 1 x 2... f 2 x 1 f 2. f m x... 1 f 1 x m f 2 x 2 x m.... f m x m. (2.30) The Jacobian matrix enters as f (x) in the Taylor series expansion (2.28), which remains a linear approximation. Let us now turn to the problem of solving a nonlinear equation f (x) = 0. If one has some approximation x 0 of the solution, we can expand around x 0 to get f (x) = f ( x 0 + (x x 0 ) ) f (x 0 ) + f (x 0 ) (x x 0 ), retaining the first two terms of the expansion, corresponding to a linear approximation. Because nonlinear problems are not computable but linear ones are, we may consider the linear system of equations f (x 0 ) + f (x 0 ) (x x 0 ) = 0, (2.31) as an approximation to the nonlinear system f (x) = 0. In (2.31), f (x 0 ) is an m- vector, f (x 0 ) an m m-matrix, so this is a linear system that can be used to determine x. The formal solution is x = x 0 ( f (x 0 ) ) 1 f (x 0 ), where all expressions on the right hand side are computable, provided that the Jacobian matrix is available and nonsingular. It is obvious, however, that this is not the solution to the problem f (x) = 0, as that problem was replaced by a linear approximation. If the linearization is in good agreement with the nonlinear function, we may obtain an improved solution compared to x 0. Naturally, this process can be repeated, leading to the iterative method known as Newton s method, ( 1 x k+1 = x k f (x )) k f (x k ), (2.32) where the superscript is used as the iteration index as before, to distinguish it from the vector component subscript index used above. As we have seen above, Newton s method is based on approximating the original problem f (x) = 0 by a sequence of linear, computable problems, whose solutions are expected to produce successively better approximations, converging to the true solution. We encountered Newton s method already in the previous section, applied to the quadratic equation x 2 2 = 0. Thus, taking f (x) = x 2 2, we have f (x) = 2x, and Newton s method becomes x k+1 = x k (xk ) 2 2 2x k = xk x k = 1 (x k + 2x ) 2 k.

41 2.4 The Fourth Principle: Linearization 35 Thus we have obtained Heron s formula for computing square roots. Newton s method is far more general, however, and is one of the most frequently used methods for nonlinear equations, of equal importance to the fixed point iteration of the previous section. It is by far the most important example of the principle of linearization. Nevertheless, the convergence of Newton s method remains an extremely difficult matter. To summarize, it will converge if f (x) is nonsingular in the vicinity of the solution x to f (x) = 0. This is equivalent to requiring that ( f (x)) 1 C 1 in a neighborhood B(r) = {x : x x r} of x. In addition, we have to require that the second derivative (a 3-tensor) is bounded, i.e., f (x) C 2 for x B(r). Finally we need an initial approximation x 0 B(r) and f (x) C 0 on B(r). Success depends on further conditions on f and on the relation between the bounds C 0, C 1 and C 2. In practice, it is impossible to verify these conditions mathematically, and as a consequence, one often experiences considerable difficulties or even failures in computational practice. This is not necessarily due to any shortcomings of the method; it is more of an indication that strong skills in mathematics and numerical analysis are needed. On the positive side, if one is successful, Newton s method is quadratically convergent, as we already saw in the case of Heron s formula. Thus, when all conditions of convergence are fulfilled, the error behaves like ε k+1 = O((ε k ) 2 ), which means that Newton s method offers an extraordinary fast convergence when all conditions are in place. By contrast, fixed point iteration is in general only linearly convergent, i.e., ε k+1 = O(ε k ), thus calling for a much larger number of iterations before an acceptable approximation can be obtained. These issues are not much noticed for small systems, but many mathematical models in science and engineering lead to large-scale systems, and to considerable difficulties. As an illustration of what Newton s method does in other special cases, consider solving the problem f (x) = 0, with f (x) = Ax b. (This means that we are going to use Newton s method to solve a system of linear equations.) Then f (x) = A, and Newton s method becomes (cp. 2.32) x k+1 = x k A 1 (Ax k b) = x k x k + A 1 b = A 1 b. Hence Newton s method solves the problem in a single iteration. This comes as no surprise; the idea of linearization is wasted on a linear problem, as the linear problem is its own linearization and the method is designed to solve the linear problem exactly. However, we also note that Newton s method is expensive to use, as it requires the Jacobian matrix and is supposed to work with a full linear solver. This again means that if the nonlinear problem is very large, a conventional factorization method is not affordable, and Newton s method will be modified to some less expensive variant, e.g.

42 36 2 Principles of numerical analysis x k+1 = x k M 1 f (x k ), (2.33) where M f (x k ) is inexpensive to invert. The matrix M is often kept constant for many successive iterations to further reduce computational effort. This is referred to as modified Newton iteration. While each iteration is cheaper to carry out, the savings come at a price the fast, quadratic convergence is lost, and convergence (if it converges at all) will be reduced to linear convergence. Naturally, (2.33) may be viewed as a fixed point iteration, with and Jacobian matrix g : x x M 1 f (x) g (x) = I M 1 f (x). In the fixed point iteration, we need g (x) 1 in a neighborhood of the solution x. This obviously requires M 1 f (x) I, i.e., M 1 must approximate ( f (x)) 1 well enough. Needless to say, finding such an M, which is also cheap to invert, is a tall order. Even so, it is a standard task for the numerical analyst who works with advanced, large-scale problems in applied mathematics. 2.5 Correspondence Principles Out of the four principles outlined above, three of them aim to take not computable problems into computable problems, or linear algebra. Discretization is the most important principle; it reduces the information in an analysis problem into a finite set and enables computations to be completed in finite time. Linearization is a tool for approximating a nonlinear problem by a linear problem; because the latter is not the same as the original problem, it must be combined with iteration to create a sequence of problems, whose solutions converge to the solution of the original problem. Finally, the principle of linear algebra and polynomials represents core methodology, allowing us to approximate and compute in finite time. On top of this methodology, it is important to understand that analysis problems are set in the continuous world of functions, derivatives and integrals. By contrast, the algebraic problems are set in the discrete world of vectors, polynomials and linear systems. There is a strong correspondence between the continuous world and the discrete world. Both have rich theories that are largely parallel, with similar structure. The numerical analyst must be well acquainted with both worlds, and be able to move freely between them, as all numerical methods work with finite data sets but are intended to solve problems from analysis. As an example, consider a linear first-order initial value problem u = Au. (2.34)

43 2.5 Correspondence Principles 37 Its discrete-world counterpart is v k+1 = Bv k. (2.35) While the first problem is a differential equation, the second problem is a difference equation. If we discretize the differential equation (2.34) in a way similar to (2.5) and let v k denote an approximation to u(t k ), where t k = k t, we get the difference equation v k+1 = (I + ta)v k, (2.36) corresponding to taking B = I + ta. The latter equation links the two equations above. Now, we are often interested in questions such as whether u(t) 0 as t. This will happen if all eigenvalues of A are located strictly inside the left half plane. But if we have discretized the system, we are also interested in knowing whether v k 0 as k. This will happen in (2.35) if all eigenvalues of B are located strictly inside the unit circle. Thus we see that the conditions we arrive at in the continuous and discrete worlds are different, although the two theories are largely analogous. The real problem in numerical analysis, however, is to relate the two. Thus we would like our discretization (2.36) to behave in a way that replicates the behavior of the continuous problem (2.34). As B = I + ta we can find out whether this is the case. Let x be an eigenvector of A, i.e., Ax = λx. We then have Bx = (I + ta)x = x + Ax = (1 + tλ)x, so we conclude that the eigenvalues of B are 1 + tλ. Now, if Reλ < 0, will 1+ tλ be less than 1? Interestingly, this puts a condition on t, which must be chosen small enough. In a broader context, it implies that not every discretization will do. Thus, in order to find out how to solve a problem successfully, we need to have a strong foundation both in the classical continuous world and in the discrete world, as well as knowing how problems, concepts and theorems from one world correspond to those of the other. It is of particular importance to learn to recognize the structural similarity of the continuous and discrete worlds. Above we have seen the similarity and connections between u = Au and v k+1 = Bv k. The beginner in numerical analysis is probably better acquainted with the analysis side, but one quickly recognizes how the discrete world works. In fact, for almost every continuous principle, there is a discrete principle, and vice versa. For example, the usual scalar product f T g between two vectors f and g is f T g = N k=0 f k g k. Is there a continuous scalar product? The answer is yes, and we will use it in particular in connection with the finite element method. Thus the corresponding operation in the continuous case is

44 38 2 Principles of numerical analysis 1 f,g = f (x)g(x) dx, 0 where, denotes the inner product of two functions. Here we recognize the pointwise product of the two functions, integrated ( summed ) over the entire range of the independent variable, x. Many operations in algebra have similar counterparts in analysis. Another important example is linear systems of equations, Ax = b, which, when written in component form, are a i, j x j = b i. j There are many different types of linear operators in analysis. A direct counterpart to the linear system above is the integral equation 1 0 k(x,y)u(y)dy = v(x). Here we see that a function u on [0,1] is mapped to another function v. Again this happens by multiplying the function u at the point y by a function of two independent variables, x and y, and integrating ( summing ) over the entire range of y, leaving us with a function of the remaining independent variable, x. This operation is often written, just like in the linear algebra case, as Ku = v. If the operator K and the function v are given, we need to solve the integral equation Ku = v. Such a problem can be solved using discretization, which takes us back to a linear algebraic system of equations. The simplest example of an integral equation, as well as a differential equation, is an equation of the form t ẋ = f (t), x(t) = f (τ)dτ. 0 Obviously, this is a problem from analysis, and we would approximate its solution in numerical analysis by using a discretization method. For example, we may choose a grid Ω and sample the function f on Ω. Because it may in general be impossible to find an elementary primitive function of f, we convert it to a grid function F. This, in turn, is reconverted to an approximating polynomial P, with f (t) P(t) ε on the entire interval. Replacing f in the integral by the polynomial P, we are now ready to find a numerical approximation to the integral. We have obtained a computable problem. As primitive functions of polynomials are polynomials, the integral can easily be computed. Thus t x(t) P(τ) dτ. 0 In the section on linear algebra and polynomials, we have seen that an interpolation polynomial is obtained by solving a linear system of equations, where the discrete samples F of the continuous function f form the data vector. As a result, the in-

45 2.5 Correspondence Principles 39 terpolation polynomial is a linear combination of the data samples, i.e., it can be written P(t) = F k ϕ k (t), k where the basis functions ϕ k (t) are also polynomials. Therefore, t 0 t P(τ)dτ = 0 k F k ϕ k (τ)dτ = k By introducing the notation w k = t 0 ϕ k(τ)dτ, we arrive at t 0 t f (τ)dτ 0 t F k ϕ k (τ)dτ. P(τ)dτ = F k w k. k This means that the integral in the continuous world can be approximated by a weighted sum in the discrete world; the latter is easily computable. In fact, we recognize this computation as a simple scalar product of the grid function vector F and a the weight vector w. We further note that if f (t) P(t) ε, then t t f (τ) P(τ)dτ f (τ) P(τ) dτ εt. 0 0 This means that not only can we compute an approximation in the discrete world, but we can also obtain an error bound, provided that we master the interpolation problem. In a similar way, it is of fundamental importance to master the four basic principles of numerical analysis, the correspondence between the discrete and the continuous worlds, and how well we can approximate solutions to problems in analysis by problems in algebra. In this book, we are going to focus on differential equations, and therefore the principle of discretization is of primary importance. Accuracy is a matter of two questions, first stability (which we have not yet been able to explore) and accuracy, which is usually a matter of how fine we make the discretization. Naturally there is a trade-off. The finer the discretization, the higher is the computational cost. On the other hand, a finer discretization offers better accuracy. Can we obtain high accuracy at a low cost? The answer is yes, provided that we can construct stable methods of a high order of convergence. This is not an easy task, but it is exactly what makes scientific computing an interesting and challenging field of study. For reasons of efficiency, as well as for computability, we need to stay away from infinity. However, accuracy is obtained near infinity. We obviously need to strike a balance. 0

46 40 2 Principles of numerical analysis 2.6 Accuracy, residuals and errors

47 Chapter 3 Differential equations and applications Differential equations are perhaps the most common type of mathematical model in science and engineering. Our objective is to give an introduction to the basic principles of computational methods for differential equations. This entails the construction and analysis of such methods, as well as many other aspects linked to computing. Thus, one needs to have an understanding of the application the mathematical model and what it represents the mathematical properties of the problem what properties a computational method needs to solve the problem how to verify method properties qualitative discrepancies between numerical and exact solutions how to obtain accuracy and estimate errors how to interpret computational results. We are going to consider three different but basic types of problems. The first is initial value problems (IVP) of the form ẏ = f (t,y), with initial condition y(0) = y 0. In general this is a system of equations, and the dot represents differentiation with respect to the independent variable t, which is interpreted as time. Thus the differential operator of interest is d dt. (3.1) The second type of problem is a boundary value problem (BVP) of the form u = f (x), with boundary conditions u(0) = u L and u(1) = u R. Here prime denotes differentiation with respect to the independent variable x, which is interpreted as space. The 41

48 42 3 Differential equations and applications differential operator of interest is now d 2 dx 2. (3.2) In connection with BVPs, we will also consider other types of boundary conditions. Occasionally, we will also consider first order derivatives, i.e., the operator d/dx. We will find that there are very significant differences in properties and computational techniques between dealing with initial and boundary value problems. Interestingly, these theories are combined in the third type of problem we will consider: time-dependent partial differential equations (PDE). This is a very large field, and we limit ourselves to some simple standard problems in a single space dimension. Thus, we are going to combine the differential operators (3.1) and (3.2). The simplest example is the diffusion equation, u t = 2 u x 2, which requires initial as well as boundary conditions. We will follow standard notation in PDE theory, and rewrite this equation as u t = u xx, where the subscript t denotes partial derivative with respect to time t, and the subscript x denotes partial derivative with respect to space x. Both initial conditions and boundary conditions are needed to have a well-defined solution, and the solution u is a function of two independent variables, t and x. Apart from combining / t with 2 / x 2, we shall also be interested in combining it with the first order space derivative / x, as in the advection equation, u t = u x. The two PDEs above have completely different properties. Both are standard model problems in mathematical physics. The advection equation has wave solutions, while the diffusion equation is dissipative with strong damping. The two problems cannot be solved using the same numerical methods, since the methods must be constructed to match the special properties of each problem. We will also see that there are many variants of the equations above, and we will consider various nonlinear versions of the equations, as well as other types of boundary conditions. In addition, we will consider eigenvalue problems for the differential operators mentioned above, and build a comprehensive theory from the elementary understanding that can be derived from the IVP and BVP ordinary differential equations.

49 3.1 Initial value problems Initial value problems As mentioned above, the general form of a first order IVP is ẏ = f (t,y), with f : R R m R m, and an initial condition y(0) = y 0. The task is to compute the evolution of the dependent variable, y(t), for t > 0. In practice, the computational problem is always set on a finite (compact) interval, [0,t end ]. Almost all software for initial value problems address this problem. This means that one has to provide the function f and the initial condition in order to solve the problem numerically. Most problems of interest have a nonlinear function f, but even in the linear case numerical methods are usually necessary, due to the size if the problem. A simple example of an IVP is a second-order equation, describing the motion of a pendulum of length L, subject to gravitation, as characterized by the constant g, ϕ + g L sinϕ = 0, ϕ(0) = ϕ 0, ϕ(0) = ω 0. Here ϕ represents the angle of the pendulum. The problem is nonlinear, since the term sinϕ occurs in the equation. Another issue is that that the equation is second order, modeling Newtonian mechanics. Since standard software solves first order problems, we need to rewrite the system in first order form by a variable transformation. Thus we introduce the additional variable ω = ϕ, representing the angular velocity. Then we have ϕ = ω ω = g L sinϕ. This is a first order system of initial value problems. The general principle is that any scalar differential equation of order d can be transformed in a similar manner to a system of d first order differential equations. Another nonlinear system of equations is the predator prey model ẏ 1 = k 1 y 1 k 2 y 1 y 2 ẏ 2 = k 3 y 1 y 2 k 4 y 2, where y 1 represents the prey population and y 2 the predator species. The coefficients k i are supposed to be positive, and the variables y 1 and y 2 are non-negative. This equation is a classical model known as the Lotka Volterra equation, and it exhibits an interesting behavior with periodic solutions. The problem is nonlinear, because it contains the quadratic terms y 1 y 2. The model can be used to model the interaction of rabbits (y 1 ) and foxes (y 2 ).

50 44 3 Differential equations and applications Let us consider the case where there are no foxes, i.e., y 2 = 0. The problem then reduces to ẏ 1 = k 1 y 1, which has exponentially growing solutions. Thus, without a predators, the rabbits multiply and the population grows exponentially. On the other hand, if there are no rabbits (y 1 = 0), the system reduces to ẏ 2 = k 4 y 2. The solution is then a negative exponential, going to zero, representing the fact that the foxes starve without food supply. If there are both rabbits and foxes, the product term y 1 y 2 represents the chance of an encounter between a fox and a rabbit. The term k 2 y 1 y 2 in the first equation is negative, representing the fact that the encounter has negative consequences for the rabbit. Likewise, the positive term k 3 y 1 y 2 in the second equation models the fact that the same encounter has positive consequences for the fox. When these interactions are accounted for, one gets an interesting dynamical system, with periodic variations over time in the fox and rabbit populations. The system does not reach an equilibrium. Initial value problems have applications in a large number of areas. Some wellknown examples are Mechanics, where Newton s second law is M q = F(q). Here M is a mass matrix, and F represents external forces. The dependent variable q represents position, and its second derivative represents acceleration. Thus mass times acceleration equals force. This is a second order equation, and it is usually transformed to a first order system before it is solved numerically. Applications range from celestial mechanics to vehicle dynamics and robotics. Electrical circuits, where Kirchhoff s law is C v = I(v), and where C is a capacitance matrix, v represents nodal voltages, and I represents currents. Applications are found in circuit design and VLSI simulation. Chemical reaction kinetics, where the mass action law reads ċ = f (c). Here the vector c represents the concentration of the reactants in a perfectly mixed reaction vessel, and the function f represents the actual reaction interactions. The same type of equation is used in biochemistry as well as in chemical process industry. We note, in the last case, that many other applications lead to equations that are structurally similar to the reaction equations. Thus the Lotka Volterra predator-prey model has a similar structure. Other examples of this type of equation are epidemiological models, where the spreading of an infectious disease has a similar dynamical form. The interactions describe the number of infected people, given that some are susceptible, while others are infected and some have recovered (or are vaccinated) to be immune to further infections. This is the classical SIR model, developed by Kermack and McKendrick in 1927; it was a breakthrough in the understanding of epidemiology and vaccination. Equations such as those above are all important for modeling complex, nonlinear interactions, whose outcome cannot be foreseen by intuitive or analytical techniques.

51 3.2 Two-point boundary value problems Two-point boundary value problems The two-point boundary value problems (2p-BVP) we will consider are mostly second order ordinary differential equations. A classical problem takes the form u = f (x), (3.3) on [0,1], with Dirichlet boundary conditions u(0) = u L and u(1) = u R. Here the function f is a source term, which only depends on the independent variable x. Occasionally one of the boundary conditions will be replaced by a Neumann condition, u (0) = u L. The Neumann condition can also be imposed at the right endpoint of the interval, but in order to have a unique solution, at least one boundary condition has to be of Dirichlet type. The task is to compute (an approximation to) a function u C 2 [0,1] that satisfies the boundary conditions and the differential equation on [0, 1]. Because we cannot compute functions in general, we will have to introduce a discretization, to compute a discrete approximation to u(x). The equation above is a one-dimensional version of the Poisson equation. Understanding this 2p-BVP is key to understanding the Laplace and Poisson equations in 2D or 3D. The latter cases are certainly more complicated, but many basic properties concerning solvability and error bounds are built on similar theories. An example of a fourth order problem is the beam equation M = f (x) u = M EI, where M represents the bending moment of a slender beam supported at both ends, subject to a distributed load f (x) on [0,1]. If the supports do not sustain any bending moment, the boundary conditions for the first equation are of homogeneous Dirichlet type, M(0) = M(1) = 0. In the second equation, which is structurally identical to the first, u represents the deflection of the beam, under the bending moment M. Here, too, the boundary conditions are Dirichlet conditions, u(0) = u(1) = 0. Finally, E is a material constant and I (which may depend on x) is the cross-section moment of inertia of the beam, which depends on the shape of the beam s cross section. This is a linear problem, as the dependent variables M and u only enter linearly. Many solvers for 2p-BVPs are written for second order problems, and often designed especially for Poisson-type equations. In this particular example, one first solves the moment equation, and once the moment has been computed, it is used as data for solving the second, deflection equation. Whether such an approach is possible or not depends on the boundary conditions. Thus, there is a variant of the beam equation, where the beam is clamped at both ends. This means that the supports do sustain bending moments, and that the bound-

52 46 3 Differential equations and applications ary conditions take the form u (0) = u (1) = 0 and u(0) = u(1) = 0. In this case, one cannot solve teh problem in two steps, because the problem is a genuine fourth order equation, known as the biharmonic equation. If I is constant, the equation is equivalent to u IV = f (x) EI, and will require its own dedicated numerical solver. In connection with 2p-BVPs, we will also consider problems containing first order derivatives, e.g. u + u + u = f (x), or nonlinear problems, such as or u + uu = f (x), u + u = f (u). In the last case, the function f is a function of the solution u, not just a source term. The structure of these equations will become clear in connection with PDEs, as these operators will correspond to the spatial operators of time-dependent PDEs. Another very important, and distinct, type of BVP is eigenvalue problems for differential operators. The simplest example is u = λu, (3.4) with homogeneous boundary conditions u(0) = u(1) = 0. Since an algebraic eigenvalue problem is usually written Au = λ u, where λ corresponds to the eigenvalues of the matrix A, we see that (3.4) is in fact an analytical eigenvalue problem for the differential operator d 2 /dx 2. This type of problem is usually referred to as Sturm Liouville problems. Here we will construct discretization methods that approximate this analysis problem by a linear algebra problem; in other words, we will be able to approximate the eigenvalues of (3.4) by solving a linear algebra eigenvalue problem. Two-point boundary value problems occur in a large number of applications. A short list of examples includes Materials science and structural analysis, as exemplified by the beam equation above. Microphysics, as exemplified by the (stationary) Shrödinger equation h 2m ψ +V (x)ψ = Eψ, where ψ is the wave function, V (x) is a potential well, and E is the energy of a particle in the state ψ.

53 3.3 Partial Differential Equations 47 Eigenmode analysis, such as in u = λu, which may describe buckling of structures; eigenfrequency oscillation modes of e.g. a bridge or a musical instrument. The last two examples are Sturm Liouville problems, which in many cases are directly related to Fourier analysis. Thus, we find new links between computational methods and advanced analysis. One of the great insights of applied mathematics is that there are applications in vastly differing areas that still share the same structure of their equations. For example, it is by no means obvious that buckling problems, first solved by Euler in the middle of the 18th century, satisfy the same (or nearly the same) equation as Schrödinger s particle-in-box problems of the 20th century, or could be used in the design of music instruments. The applications in macroscopic material science and in quantum mechanics appear to have nothing in common, but mathematics tells us otherwise. This makes it possible to develop common techniques in scientific computing, with only minor changes in details. The overall methodology will still be the same, having an impact in all of these areas. 3.3 Partial Differential Equations Partial differential equations are characterized by having two or more independent variables. Here we shall limit ourselves to the simplest case, where the two independent variables are time and space (in 1D). This is simply the combination of initial and boundary value problems, without having to approach the difficulty of representing geometry in space. While IVPs and BVPs have their own special interests, it is in PDEs that differential equations and scientific computing get really exciting. The particular difficulties of initial and boundary value problems are present simultaneously, and conspire to create a whole new range of difficulties, often far harder than one would have expected. As mentioned above, we are mainly going to consider two equations, the parabolic diffusion equation u t = u xx, (3.5) and the hyperbolic advection equation, u t = u x. (3.6) The latter is also often referred to as the convection equation. The space operators can be combined, and the equation u t = u xx + u x (3.7)

54 48 3 Differential equations and applications is usually referred to as the convection diffusion equation. It has two differential operators in space, 2 / x 2 (giving rise to diffusion), and / x (creating convection/advection). Convection and advection are transport phenomena, usually associated with some material flow, like in hydrodynamics or gas dynamics, and model wave propagation. By contrast, diffusion does not require a mass flow, and (3.5) is the standard model for heat transfer. The combined equation, possibly including further terms, is the simplest model equation for problems in heat and mass transfer. We have previously seen that in IVPs, equations of the form u = f (u) can be used to model chemical reactions. For this reason, if we add such a term to the diffusion equation (3.5), we obtain u t = u xx + f (u), known as the reaction diffusion equation. Here f depends on the dependent variable u, but in case it would only depend on the independent variable x, it is a source term, and the equation becomes u t = u xx + f (x). This is a plain diffusion equation with source term. Many time-dependent problems involving diffusion have stationary solutions. This means that after a long time, equilibrium is reached, and there is no longer any time evolution. The stationary solution is independent of time, implying that u t = 0. In the last equation, the equilibrium state therefore satisfies 0 = u xx + f (x). We already encountered this equation in the context of 2p-BVPs. Thus u = f (x), is the Poisson equation (3.3). While the time dependent problems considered above are either parabolic (both u t and u xx are present) or hyperbolic (u t is present, but the highest order derivative in space is u x ), the stationary equation is different; the Poisson equation is elliptic. We will later have a closer look at the classification of these equations. For the time being, let us just note that there are elliptic, parabolic and hyperbolic equations, and that the 2p-BVP (3.3) represents elliptic problems, so long as there is only one space dimension. With the exception of special equations such as Poisson s, the names of the equations we consider usually only list the terms included in the right hand side, provided that the left hand side only contains the time derivative u t. For example, by including the first derivative u x in the diffusion equation, we obtained the convection diffusion equation. Likewise, if we also include the reaction term f (u), we get u t = u xx + u x + f (u),

55 3.3 Partial Differential Equations 49 known as the convection diffusion reaction equation. Such names are practical, as the different terms often put different requirements on the solution techniques. Thus the name of the equation tells us about what difficulties we expect to encounter. Elliptic problems have characteristics that are similar to plain 2p-BVPs. Parabolic and hyperbolic equations, on the other hand, are both time- and space-dependent, and are more complicated. Parabolic equations are perhaps somewhat simpler, while hyperbolic equations are particularly difficult because they represent conservation laws. They have wave solutions without damping, and one of the main difficulties is to create discretization methods that replicate the conservation properties. As very few numerical methods conserve energy, qualitative differences between the exact and numerical solutions soon become obvious. There are also second order equations in time, with the most obvious example being the classical wave equation, u tt = u xx. This equation is closely related to the advection equation, u t = u x, and both equations are hyperbolic. The most interesting problems, however, are nonlinear. A famous test problem is the inviscid Burgers equation, u t + uu x = 0, which again is a hyperbolic conservation law with wave solutions. However, due to the nonlinearity, this equation may develop discontinuous solutions, corresponding to shock waves (cf. sonic booms ). Such solutions are extremely difficult to approximate, and we shall have to introduce a new notion, of weak solutions. These do not have to be differentiable at all points in time and space, and therefore only satisfy the equation in a generalized sense. The situation is somewhat relaxed if diffusion is also present, such as in the viscous Burgers equation, u t + uu x = u xx. The diffusion term on the right will then introduce dissipation, representing viscosity, meaning that wave energy will dissipate over distance. (The inviscid Burgers equation has no viscosity term.) The solution may initially be discontinuous, but over time it becomes increasingly regular and eventually becomes smooth. The viscous Burgers equation is a simple model in several applications, for example modeling seismic waves. Because of the presence of the diffusion term, this equation is no longer hyperbolic but parabolic. However, if the diffusion is weak, wave phenomena will still be observed over a considerable range in time and space. For this reason, the viscous Burgers equation is sometimes referred to by the oxymoron parabolic wave equation. PDEs is an extremely rich field of study, with applications e.g. in materials science, fluid flow, electromagnetics, potential problems, field theory, and multiphysics. Multiphysics is any area that combines two or more fields of application,

56 50 3 Differential equations and applications as have been suggested by the many combinations of terms above. One example is fluid structure interaction, such as air flow over an elastic wing, or blood flow inside the heart. Because of the different nature of the equations, there is typically a need for dedicated numerical methods, even though many components of such methods are common for spcial classes of problems. The area is a highly active field of research. From the vast variety of applications, it is clear that some difficulties will be encountered in the numerical solution of differential equations. Although basic principles covering e.g. discretization and polynomial approximation apply, it is still necessary to construct methods with a special focus on the class of problems we want to address. Even so, the basic principles are few, and the challenge is to find the right combination of techniques in each separate case.

57 Chapter 4 Summary: Objectives and methodology One of the pioneers in scientific computing, Peter D Lax, has said that The computer is to mathematics what the telescope is to astronomy, and the microscope is to biology. Thus ever faster computers, with ever larger memory capacity, allow us to consider more and more advanced mathematical models, and it is possible to explore by numerical simulation how complex mathematical models behave. Naturally, this leads to large-scale computations that cannot be carried out by analytical methods. The role of scientific computing is to bridge the gap between mathematics proper and computable approximations. Although our focus is on differential equations, the previous chapters have introduced a variety of basic principles in scientific computing, and outlined why they are needed in order to construct approximate solutions to mathematical problems. In particular, we noted that only problems in linear algebra are computable, in the sense that an exact solution can be constructed in a finite number of steps. By contrast, nonlinear problems, and problems from analysis (such as differential equations) can only be solved by approximate methods. The purpose of scientific computing is to construct and analyze methods for solving specific problems in applied mathematics to construct software implementing such methods for solving applied mathematics problems on a computer. Scientific computing is a separate field of study because conventional, analytical mathematical techniques have a very limited range. Few problems of interest can be solved analytically. Instead, we must use approximate methods. This can be justified by taking great care to prove that the computational methods are convergent, and therefore (at least in principle) able to find approximate solutions to any prescribed accuracy. This extends the analytical techniques by a systematic use of numerical 51

58 52 4 Summary: Objectives and methodology computing, which, from the scientific point of view is subject to no less rigorous standards than conventional mathematics. The methodology of scientific computing is based on conventional mathematics, with the usual proof requirements built on classical results in continuous as well as discrete mathematics. A few principles stand out as the main building blocks in all computational methods. These are Standard techniques in linear algebra A systematic use of polynomials, including trigonometric polynomials The principle of discretization, which brings a problem from analysis to a problem in algebra The principle of linearization, which approximates a nonlinear problem locally by a linear problem The principle of iteration, which constructs a sequence of approximations converging to the final result. All numerical methods are constructed by using elements of these principles. This does not mean that the methods are similar, or that a trick that works in one problem context also works in another. In fact, the variety of techniques used is very large; without recognizing the basic principles, one can easily be overwhelmed by the technicalities and miss the the broader pattern that are characteristic of a general approach. On top of the approximation techniques mentioned above, all computations are carried out in finite precision arithmetic, usually defined by the IEEE 754 standard. This will lead to roundoff errors. It is a common misunderstanding about scientific computing that the results are erroneous due to roundoff. However, most computational algorithms are little affected by roundoff, as other errors, such as discretization errors, linearization errors and iteration errors typically dominate. The only problems that are truly affected by roundoff are problems from linear algebra solved by finite algorithms. Such problems are unaffected by discretization, linearization and iteration errors, meaning that only roundoff remains. Apart from computing approximate solutions, scientific computing is also concerned with error bounds and error estimates. Thus we are not only concerned with obtaining an approximation, but also the accuracy of the computation. Finally we are interested in obtaining these results reasonably fast. In order of importance, computational methods must be 1. Stable 2. Accurate 3. Reliable 4. Efficient 5. Easy to use.

59 4 Summary: Objectives and methodology 53 Stability is priority number one, as unstable algorithms are unable to obtain any accuracy at all; instability will ruin the results. If stable algorithms can devised, we are interested in obtaining accuracy. Accurate algorithms should also be reliable and robust, and be able to solve broad classes of problems; high accuracy on a few selected test problems is not good enough. As part of the reliability, we usually also want reliable error estimates, indicating the accuracy obtained. Once these criteria have been met, we want to address efficiency. Efficiency is a very broad issue, ranging from data structures allowing efficient memory use, to adaptivity, which means that the software tunes itself automatically to the mathematical properties of the problem at hand. Finally, when efficient software has been constructed, we want ease of use, meaning that various types of interfaces are needed to set up problems (entering or generating data, e.g. the geometry of the computational domain), as well as postprocessing tools, such as visualization software. In this introduction to scientific computing, we focus less on a rigorous mathematical treatment of method construction and analysis, and more on an understanding of elementary computational methods, and how they interact with standard problems in applied mathematics. This entails understanding what the original equations represent, meaning that we will emphasize the links from physics, via mathematics, to methods and actual computational results. These will also be investigated in model implementations. We need to understand the basic mathematics of the problems, the properties of our computational methods, and how to assess their performance. Assessing performance is a difficult matter. It is invariably done by trying out the method on various well-known test problems and benchmarks, but the assessment is never complete. Ideally, we want to infer some general conclusions, but have to keep in mind that any numerical test only yields results for that particular problem. Since we are going to use standard test problems and benchmarks, which may often have a known analytical solution, we work in an idealized setting. In real computational problems, the exact solution is never known. However, unless we can demonstrate that standard benchmarks can be solved correctly, with both stability and accuracy, there is little hope that the method would work for the advanced problems. Many of our benchmarks will use simple tools of visualization. For example, we typically want to demonstrate that an implementation is correct by verifying its theoretical convergence order. This is usually done in graphs, like those used in the section on discretization above. There we saw use of lin-lin graphs, lin-log graphs and log-log graphs. This may appear to be a simple remark, but part of the skill in scientific computing is in visualizing the results in a proper way. Therefore it is important to master these techniques, and carefully try out the best tools in each situation. As we have remarked previously, log-log diagrams are used for power laws. Likewise, lin-log diagrams are preferred for exponentials, but there are numerous other situations where scaling is key to revealing the relevant information.

61 Part II Initial Value Problems

63 Chapter 5 First Order Initial Value Problems Initial value problems occur in a large number of applications, from rigid-body mechanics, via electrical circuits to chemical reactions. They describe the time evolution of some process, and the task is usually to predict how a given system will evolve. Most standard software is written for initial value problems of the form ẏ = f (t,y); y(0) = y 0, (5.1) where f : R R d R d is a given function, and where the initial condition y 0 is also given. Before constructing computational methods, it is often a good idea to verify, through mathematical means, whether there exists a solution and whether the solution is unique. A somewhat more ambitious approach is to verify that the initial value problem is well posed. This means that we need to demonstrate that there is a unique solution, which depends continuously on the data. This usually means that a small change in the initial value, or a small change in a forcing function, will only have a minor effect on the solution to the problem. While such a theory exists for initial value problems in ordinary differential equations, much less is known in partial differential equations. Thus, fortunately, the theory of initial value problems is quite comprehensive compared to other areas in differential equations. 5.1 Existence and uniqueness The classical result on existence and uniqueness is built on the continuity properties of the function f. Definition 5.1. Let f : R R d R d be a given function, and define its Lipschitz constant with respect to the second argument by 3

64 4 5 First Order Initial Value Problems L[ f ] = sup u v f (t,u) f (t(v). (5.2) u v Obviously, the Lipschitz constant depends on t, and we shall also assume that the function f is continuous with respect to time t. The classical existence and uniqueness result on (5.1) is the following. Theorem 5.1. Let f : R R d R d be a given function, and assume that it is continuous with respect to time t, and that it is Lipschitz continuous with respect to y, with L[ f ] <. Then there is a unique solution to (5.1) for t 0. If f is a linear, constant coefficient system, i.e., ẏ = Ay, then Au Av L[ f ] = sup u v u v = sup Ax = A, (5.3) x 0 x where A is the norm of the matrix A induced by the vector norm. Thus, for a linear map A, the Lipschitz constant is just the operator norm. But nonlinear functions are harder to deal with. In general, very few nonlinear maps satisfy a Lipschitz condition on all of R d. If it satisfies such a condition on a bounded domain D, then one can guarantee the existence of a solution up to the point where the solution y reaches the boundary of D. A classical example is the following. Example Consider the IVP with analytical solution ẏ = y 2 ; y(0) = y 0 > 0, y(t) = y 0 1 ty 0. The function f (t,y) = y 2 obviously does not satisfy a Lipschitz condition on all of R, but only on a finite domain. In fact, inspecting the solution, we see that the solution remains bounded only up to t = 1/y 0, when the solution blows up. Initial value problems having this property are said to have finite escape time. Thus, no matter how large the region D is where the Lipschitz condition holds, the solution will reach the boundary of D in finite time and escape. There are also other examples, when the Lipschitz condition does not hold. Example Consider the IVP ẏ = y; y(0) = 0. This problem does not have a unique solution. The solution may be chosen as y(t) = 0 on the interval t [0,2τ], with τ 0 an arbitrary number, followed by y(t) = (t/2 τ) 2 for t > 2τ. While this might seem like a contrived problem, it can actually arise, largely due to poor mathematical modeling. Thus, assume that we we want to model the free fall of a particle of mass m. Its potential energy is U = mgy, where g is the gravitational constant, and where the kinetic energy is T = mẏ 2 /2. The total energy is the sum of the two, i.e.,

65 5.1 Existence and uniqueness 5 E = mgy + mẏ2 2. Let us assume (a matter of normalization) that E = 0. Then, obviously ẏ 2 = 2gy, and, because y 0, ẏ = 2gy. To obtain the original equation, let us choose units so that g = 1/2. Then ẏ = y. The model is, at least in principle, correct. However, it is unfortunate, since we choose the initial condition y(0) = 0 and the constant E = 0. Although this choice may seem natural, it just happens to be at the singularity of the right-hand side function f of the differential equation, and we do not necessarily get the expected solution, y(t) = t 2 /4. Thus, when f (y) = y, we have f (y) = 1/(2 y). Hence f (y) is not defined at the initial value y = 0, and since the Lipschitz constant L[ f ] max f (y), the Lipschitz condition is not satisfied at the starting point either. Therefore, we are not guaranteed a unique solution. What this example shows is that even if one uses sensible mathematical modeling, one may end up with a poor mathematical description, lacking unique solutions. This appears to be less understood; even so, one needs to be aware that not all models, and not all normalizations of coordinate systems and initial values, will lead to proper models. Example A variant of the same problem is obtained by modeling a leaky bucket. Assuming that a cylindrical bucket initially contains water up to a level y(0) = y 0 and that the water runs out of a hole in the bottom, we apply Bernoulli s law, according to which the flow rate v out of the hole satisfies v 2 p, where p is the pressure at the opening. Because of the cylindrical shape of the bucket, the pressure is proportional to the level y of water. Likewise, if the flow rate is v, then ẏ v. Dropping all proportionality constants, it follows that ẏ 2 = y, so that ẏ = y; y(0) = y 0. Even though this equation is similar to the former, we do not face the same problem, since we start at a different initial condition. The problem can be solved analytically, and ( y(t) = y 0 t ) 2. 2 Thus we find that the bucket will be empty (y = 0) at time t = 2y 0 (where, again, proportionality constants have been neglected). This problem has an interesting connection to the former. The previous problem, of modeling the free fall of a mass leads to essentially the same equation. While the free fall problem did not have a unique solution, the leaky bucket problem does. However, the relation between the two problems is that the free fall problem is equivalent to the leaky bucket problem in reverse time. Thus, if the bucket is currently empty, we cannot answer the question of when it was last time full. Such a question is not well posed. This is evident in the leaky bucket problem, but less so in the free fall problem. As the examples above demonstrate, questions of existence, uniqueness and wellposedness are of great importance not only to mathematics, but also to the applied mathematician and the numerical analyst. In general it is of importance to obtain a good understanding of whether the problems we want to solve have unique solutions that depend continuously on the data. If this is not the case, we can hardly expect our computational methods to be successful.

66 6 5 First Order Initial Value Problems Even so, it is often impossible to verify that Lipschitz conditions are satisfied on sufficiently large domains. In fact, this is often neglected in practice, but one must then be aware of the risk of the odd surprise. Failures to solve problems are not uncommon, in which case it is also of important why there is a failure. Is it because of problem properties, or because of an unsuitable choice of computational method? Mastering theory is never a waste of time. Nothing is as practical as a good theory. 5.2 Stability Stability is a key concept in all types of differential equations. It will appear numerous times in this book. It will refer to the stability of the original mathematical problem, as well as to the stability the discrete problem, and, which is more complex, to the numerical method itself. By and large, these concepts are broadly related in the following way: if the mathematical problem is stable, then the dicscrete problem will be stable as well, provided that the numerical method is stable. This relation will, however, not hold without qualifications. For this reason, we will se that stability plays the central role in the numerical solution of differential equations. For initial value problems, stability refers to the question whether two neighboring solutions will remain close when t. Here our original IVP problem is and we will consider a perturbed problem d dt y = f (t,y); y(0) = y 0 d dt (y + y) = f (t,y + y); y(0) = y 0. It then follows that the perturbation satisfies d dt y = f (t,y + y) f (t,y); y(0) = y 0. In classical Lyapunov stability theory, we ask whether for every ε > 0 there is a δ > 0 such that y 0 δ y(t) ε, for all t > 0. If this holds, the solution to the IVP obviously depends continuously on the data, i.e., the initial value. The interpretation is that, even if we perturb the initial value by a small amount, the perturbed solution will never deviate by more than a small amount from the solution of the original solution, even when t. We then say that the solution y(t) is stable. If, in addition, y(t) 0 as t, then we say that y(t) is asymptotically stable. And if y(t) K e αt for some positive constants K and α, then y(t) is exponentially stable. There are also further qualifications that offer additional stability notions.

67 Chapter 6 Stability theory Because stability plays such a central role in the numerical treatment of differential equations, we shall devote a chapter to lay down some foundations of stability theory. The details vary between initial value problems and boundary value problems; between differential equations and their discretizations (usually some form of difference equations); and between linear systems and nonlinear systems. However, the stability notions have something in common: the solutions to our problem must have a continuous dependence on the data, even though what constitutes data also varies. The continuous data dependence would not be such a special issue if it did not also include some transfinite process. Thus we are interested in how solutions behave as t in the initial value case, or in how a discretization behaves as the number of discretization points N. As long as we study ordinary differential equations, we are interested in single parameter limits, but in partial differential equations we face multiparametric limits, which makes stability an intricate matter. As we cannot deal with all these issues at once, this chapter will focus on the immediate needs for first order initial value problems, but we will nevertheless develop tools that allow extensions also to other applications. 6.1 Linear stability theory By considering a linear, constant coefficient problem, the stability notions become more clear, since linear systems have a much simpler behavior. Thus, if d dt y = Ay; y(0) = y 0 and d dt (y + y) = A(y + y); y(0) = y 0, 7

68 8 6 Stability theory it follows that d dt y = A y; y(0) = y 0. We note that the perturbed differential equation is identical to the original problem. Thus we may consider the original linear system directly, and whether it has solutions depending continuously on y 0. In particular, no matter how we choose y 0, the solution y(t) must remain bounded. Since y = 0 is a solution of the problem (for y 0 = 0), we say that the zero solution is stable if y(t) remains bounded for all t > 0. But as this applies to every solution, we can speak of a stable linear system. We begin by considering a linear system with constant coefficients, ẏ = Ay; y(0) = y 0. Does the solution grow or decay? The exact solution is y(t) = e ta y 0, and it follows that y(t) e ta y 0. Therefore, y(t) 0 as y 0 0, provided that, for all t 0, it holds that e ta C. (6.1) This is the crucial stability condition, and the question is: for what matrices A is e ta bounded? The answer is well known, and is formulated in terms of the eigenvalues of A. Let us therefore make a brief excursion into eigenvalue theory, which will play a central role not only for initial value problems but also in boundary value problems and partial differential equations. Definition 6.1. The set of all eigenvalues of A R d d will be denoted by λ[a] = {λ C : Au = λu}, and is referred to as the spectrum of A. Whenever the eigenvalues and eigenvectors are numbered, we write Au k = λ k [A] u k for k = 1 : d, where u k is the k th eigenvector, associated with the k th eigenvalue λ k [A]. For A R d d there are always d eigenvalues. If the eigenvalues are distinct, then there are also d linearly independent eigenvectors, and the matrix can be diagonalized. Thus, writing the eigenproblem AU = UΛ, with Λ = diagλ k, and the eigenvectors arranged as the d columns of the d d matrix U, we have U 1 AU = Λ. Now, if Au = λu, it follows that A 2 u = λau = λ 2 u, and in general,

69 6.1 Linear stability theory 9 λ k [A p ] = (λ k [A]) p, for every power p 0. This motivates that we write λ[a p ] = λ p [A]. From this it follows that if P is any polynomial, we have λ[p(a)] = P(λ[A]). Hence: Lemma 6.1. Let P be a polynomial of any degree, and let λ[a] denote the spectrum of a matrix A R d d. Then λ[p(a)] = P(λ[A]). (6.2) If the matrix is diagonalizable, i.e., U 1 AU = Λ, then U 1 A p U = Λ p. For the exponential function, which is defined by n=0 e A = n=0 A n n!, it holds that λ[e A ] = e λ[a], even though the polynomial is a power series. This follows from e A u k A n u k λk = = n! n[a]uk = e λk[a] u k. n! n=0 Although this result does not rely on A being diagonalizable, it also holds that if U 1 AU = Λ, then U 1 e A U = e U 1 AU = e Λ. In fact, if f (z) is any analytical function, then U 1 f (A)U = f (U 1 AU) = f (Λ). More generally, we have Lemma 6.2. Let A R d d, and let λ[a] denote the spectrum of a matrix. Then λ[e A ] = e λ[a], and λ[e ta ] = e tλ[a] = e treλ[a]. (6.3) It immediately follows that e ta is bounded as t if λ[a] C, i.e., if the eigenvalues of A are located in the left half plane. Eigenvalues with zero real part are also acceptable if they are simple. The result of Lemma 6.1 can also be extended. We note that if A is nonsingular, it holds that Au = λu λ 1 u = A 1 u. Therefore λ k [A p ] = (λ k [A]) p holds also for negative powers, if only A is invertible, so that no eigenvalue is zero. It follows that if a polynomial Q(z) has the property that λ[q(a)] = Q(λ[A]) 0, then Q 1 (A) exists. This implies that Lemma 6.1 can be extended to rational functions. This will be useful in connection with Runge Kutta methods. Thus we have the following extension: Lemma 6.3. Let R(z) = P(z)/Q(z) be a rational function, where the degrees of the polynomials P and Q are arbitrary. Let A R d d, and let λ[a] denote the spectrum of a matrix. If Q(λ[A]) 0, then

70 10 6 Stability theory where R(A) = P(A) Q 1 (A) = Q 1 (A) P(A). λ[r(a)] = R(λ[A]), (6.4) Although a linear system ẏ = Ay is stable whenever the eigenvalues are located in the left half plane, this eigenvalue theory does not extend to non-autonomous linear systems ẏ = A(t)y and to nonlinear problems ẏ = f (t,y). As more powerful tools are needed, the standard approach is to analyze stability in terms of some norm condition on the vector field. 6.2 Logarithmic norms Let us begin by considering the linear constant coefficient system ẏ = Ay once more, and find an equation for the time evolution of y. It satisfies d y dt + y(t + h) y(t) = limsup h 0+ h y(t) + hay(t) + O(h 2 ) y(t) = lim h 0+ h lim h 0+ I + ha 1 y(t), h where d/dt + denotes the right-hand derivative. This is used as we are interested in the forward time evolution of y(t). Definition 6.2. Let A R d d. The upper logarithmic norm of A is defined by M[A] = lim h 0+ I + ha 1. (6.5) h This limit can be shown to exist for all matrices. From the derivation above, we have obtained the differential inequality d y dt + M[A] y. (6.6) This differential inequality is easily solved. Note that d ( y e tm[a]) ( ) d y = e tm[a] M[A] y dt + dt + 0. Hence y(t) e tm[a] y(0), and it follows that y(t) e tm[a] y(0) for all t 0. Since y(t) = e ta y(0), we have:

71 6.2 Logarithmic norms 11 Vector norm Matrix norm Log norm µ[a] x 1 = i x i max j i a i j max j [Rea j j + i a i j ] x 2 = i x i 2 ρ[a A] α[(a + A )/2] x = max i x i max i j a i j max i [Rea ii + j a i j ] Table 6.1 Computation of matrix and logarithmic norms. The functions ρ[ ] and α[ ] refer to the spectral radius and spectral abscissa, respectively. The matrix A is the (possibly complex conjugate) transpose of A Theorem 6.1. For every A R d d, for any vector norm, and for t 0, it holds that e ta e tm[a]. (6.7) The reason why this is of interest is that we have the following result on stability: Corollary 6.1. Let A R d d and assume that M[A] 0, then for all t 0. e ta 1, (6.8) Therefore, if the logarithmic norm is non-positive, the system is stable. Note that the logarithmic norm is not a norm in the proper sense of the word. Unlike a true norm, the logarithmic norm can be negative, which makes it especially interesting in connection with stability theory. Table 6.2 shows how the logarithmic norm is calculated for the most common norms. Note that it is easily computed for the norms 1 and, but that it is harder to compute it for the standard Euclidean norm, 2. We then need the spectral radius ρ[ ], and the spectral abscissa, α[ ], defined by ρ[a] = max k λ k [A], α[a] = maxreλ k [A]. (6.9) k In order to use norms and logarithmic norms in the analysis that follows, we recall the properties of these norms. Definition 6.3. A vector norm satisfies the following axioms: 1. x 0; x = 0 x = 0 2. γx = γ x 3. x + y x + y.

72 12 6 Stability theory Definition 6.4. The operator norm induced by the vector norm is defined by Ax A = sup x 0 x. (6.10) The operator norm is a matrix norm, and hence satisfies the same rules as the vector norm. However, being an operator norm, it has an additional property. By construction, it satisfies Ax A x, (6.11) from which it follows that the operator norm is submultiplicative. This means that AB A B. (6.12) This follows directly from ABx A Bx A B x. The logarithmic norm has already been defined above by (6.5), by the limit M[A] = lim h 0+ I + ha 1. h Thus it is defined in terms of the operator norm, which in turn is defined in terms of the vector norm. All three are thus connected, and the computation of these quantities are linked as shown in Table 6.2. The logarithmic norm has a wide range of applications in both initial value and boundary value problems, as well as in algebraic equations. Later on, we shall also see that it can be extended to nonlinear maps and to differential operators. Like the operator norm, it has a number of useful properties that play an important role in deriving error and perturbation bounds. Theorem 6.2. The upper logarithmic norm M[A] of a matrix A R d d has the following properties: 1. M[A] A 2. M[A + zi] = M[A] + Rez 3. M[γA] = γ M[A], γ 0 4. M[A + B] M[A] + M[B] 5. e ta e tm[a], t 0 It is also easily demonstrated that the operator norm and the logarithmic norm are related to the spectral bounds (6.9). Thus α[a] M[A]; ρ[a] A. (6.13) Consequently, if M[A] < 0, then all eigenvalues λ[a] C and A is invertible. We are now in a position to address more important stability issues. Let us consider a perturbed linear constant coefficient problem,

73 6.2 Logarithmic norms 13 ẏ = Ay + p(t); y(0) = y 0, (6.14) where p is a perturbation function. This satisfies the differential inequality with solution d y dt + M[A] y + p, t y(t) e tm[a] y 0 + e (t τ)m[a] p(τ) dτ. 0 We have already treated the case p 0, so let us instead consider the case p 0 and y 0 = 0, and answer the question of how large y(t) can become given the perturbation p. Thus, letting we have p = sup p(t), t 0 t y(t) p e (t τ)m[a] dτ. Assuming that M[A] 0, evaluating the integral, we find the perturbation bound 0 e tm[a] 1 y(t) p. (6.15) M[A] In case M[A] > 0 the bound grows exponentially. More interestingly, if M[A] < 0, the exponential term decays, and y p M[A]. (6.16) Then y(t) can never exceed the bound given by (6.18). We summarize in a theorem: Theorem 6.3. Let ẏ = Ay + p(t) with y(0) = 0. Let p = sup t 0 p(t). Assume that M[A] 0. Then e tm[a] 1 y(t) p ; t 0. (6.17) M[A] If M[A] = 0, then If M[A] < 0 it holds that y(t) p t; t 0. (6.18) y p M[A]. (6.19) Note that this theorem allows an exponentially growing bound in (6.17), a linearly growing bound in (6.18), and a uniform upper bound in (6.19). The latter is

74 14 6 Stability theory the limit in (6.17) as t if M[A] < 0. Note that the different bounds are primarily distinguished by the sign of M[A], which governs stability. Let us simplify the problem further and assume that p is constant in (6.14). Then, as M[A] < 0 implies that the exponential goes to zero (since λ[a] C by (6.13)), there must be a unique stationary solution ȳ to (6.14), satisfying Thus and it follows that 0 = Aȳ + p ȳ = A 1 p. ȳ = A 1 p p M[A], A 1 p p Because p is an arbitrary constant vector, 1 M[A]. A 1 p sup = A 1 1 p 0 p M[A]. Thus we have derived the following important, but less obvious result: Theorem 6.4. Let A R d d and assume that M[A] < 0. Then A is invertible, with A 1 1 M[A]. (6.20) This result will be used numerous times in various situations, both in initial value problems and in boundary value problems. It is the simplest version of a more general result, known as the uniform monotonicity theorem. The theory above collects some of the most important results in linear stability theory, both in terms of eigenvalues and in terms of norms and logarithmic norms. The theory is general and is valid for all norms. However, if we specialize to inner product norms (Hilbert space) we obtain stronger results, in the sense that they can be directly extended beyond elementary matrix theory. 6.3 Inner product norms Inner products generate (and generalize) the Euclidean norm. They are defined as follows. Definition 6.5. An inner product is a bilinear form, : C d C d C satisfying 1. u,u 0; u,u = 0 u = 0

75 6.3 Inner product norms u,v = v,u 3. u,αv = α u,v 4. u,v + w = u,v + u,w, generating the Euclidean norm by u,u = u 2 2. Above, the bar denotes complex conjugate. In most cases we will only consider real inner products, in which case the bar can be neglected. However, the complex notation above enables us to also discuss operators with complex eigenvalues. An inner product generalizes the notion of scalar product. Apart from the properties listed above, and has a few more essential properties. One of the most important is the following: Theorem 6.5. (Cauchy Schwarz inequality) For all u,v C d, it holds that u 2 v 2 Re u,v u 2 v 2. For u,v R d, it holds that u 2 v 2 u,v u 2 v 2. This distinguishes between the case of real and complex vector spaces. Note that when we have operators with complex conjugate eigenvalues, the corresponding eigenvectors are also complex, which more or less necessitates the use of complex vector spaces. Whether the vector space is real or complex, however, we always have the following: Definition 6.6. The operator norm associated with, is A 2 Au, Au 2 = sup u 0 u,u Au 2 2 = sup u 0 u 2 2 For vectors in finite dimensions, we may denote the inner product by u,v = u v, where u denotes the transpose, or, in the complex case, the complex conjugate transpose. With this notation, we have u u = u 2 2, from which it follows that Au,Au A 2 2 u 2 2. This is easily seen to be equivalent to the standard definition of the operator norm for a general choice of norm. For the logarithmic norm, the situation is similar, but we give an alternative definition for an inner product norm, since this is not only convenient, but also turns out to allow the logarithmic norm to be defined for a somewhat wider class of vector fields. In addition, we obtain a natural definition of a lower as well as the upper logarithmic norm. Definition 6.7. The lower and upper logarithmic norms associated with the inner product, are defined as

76 16 6 Stability theory m 2 [A] = inf u 0 Re u,au u 2 2, M 2 [A] = sup u 0 Re u,au u 2. (6.21) 2 This implies that m 2 [A] = M 2 [ A], and that we have the following bounds, m 2 [A] u 2 2 Re u,au M 2 [A] u 2 2. (6.22) Again, if we write the inner product as u,v = u v, we find that and A 2 u A Au 2 = sup u 0 u u M 2 [A] = sup u 0 Reu Au u. u Thus the norm of a matrix, as well as the lower and upper logarithmic norms are extrema of two quadratic forms, albeit with different matrices, A A and A, respectively. It is therefore in order to investigate how these extrema are found. Let C be a given matrix, and let q(u) = Reu Cu u u denote the Rayleigh quotient formed by C and the vector u. We will find its extrema by finding its stationary points, i.e., by solving the equation grad u q = 0. Now, grad u q = Re[u u grad u (u Cu) u Cu grad u (u u)] u u u u = Re[u u (u C + u C ) u Cu (2u )] u u u := 0, u which, upon (conjugate) transposition, gives the equation C +C u = q(u) u 2 for the determination of the stationary points. This is obviously an eigenvalue problem, where q(u) is an eigenvalue of the symmetric matrix (C +C )/2. Thus, in case we take C = A A, we obtain the eigenvalue problem A Au = λu, showing that A 2 2 is the largest eigenvalue of A A. This eigenvalue is real and positive and σ 2 := λ max [A A] is the square of the maximal singular value of A. On the other hand, if we take C = A (which is not a priori symmetric), we still end up with a symmetric eigenvalue problem for the stationary points,

77 6.3 Inner product norms 17 A + A u = λ u. 2 The eigenvalues of (A+A )/2 are real, but they are not necessarily positive. In fact, we have just demonstrated that the logarithmic norm is given by [ ] A + A M 2 [A] = maxλ k, k 2 as was indicated in Table 6.2. The Euclidean norm is sometimes referred to as the spectral norm, as operator norms and logarithmic norms are determined by the spectrum of symmetrized operators associated with A. We summarize: Theorem 6.6. In terms of the spectral radius and spectral abscissa, it holds that A 2 = [ ] A + A ρ[a A]; M 2 [A] = α. (6.23) 2 It remains to show that the alternative definition 6.21 is compatible with the previous definition 6.5 whenever A 2 < (which corresponds to a Lipschitz condition). Note that, as h 0+, (I + ha)u 2 2 (I + ha) 2 = sup u 0 u 2 2 u = sup (I + ha) (I + ha)u u 0 u u u = sup u + hu (A + A )u + O(h 2 ) u u u 0 = sup u h u (A + A )u u u + O(h 2 ) u (A + A )u = 1 + hsup u 0 2u + O(h 2 ) u = 1 + hm 2 [A] + +O(h 2 ). Hence it follows that 6.21 and 6.5 represent the same limit, in case A 2 <. However, we shall see later that 6.21 applies also in the case when A is an unbounded operator. A more important aspect of using inner products is that, since u 2 2 = u u is differentiable, u 2 d u 2 dt = 1 2 d u 2 2 dt = 1 d(u u) = u u + u u = Re(u u). 2 dt 2

78 18 6 Stability theory Therefore, if we consider the linear system u = Au, we can assess stability by investigating the projection of the derivative u on u, i.e., and it follows that Reu u = Reu Au M 2 [A] u 2 2 m 2 [A] u 2 d u 2 dt M 2 [A] u 2. (6.24) The upper bound is the same differential inequality as we had before, when the concept was introduced for general norms. The reason why the technique above is of special importance is because it is a standard technique in partial differential equations, when A represents a differential operator in space. 6.4 Matrix categories Although general norms have their place in matrix theory and in the analysis of differential equations, inner product norms are particularly useful. The mathematics of Hilbert space plays a central role in most of applied mathematics, and will be the preferred setting in this book. Inner products allow the notion of orthogonality. Thus two vectors are orthogonal if u,v = 0. In line with the notation used above, we will write this u v = 0. Orthogonality is the key idea behind some of the best known methods in applied mathematics, such as the least squares method, and, in the present context, the finite element method. These methods find a best approximation by requiring that the residual is orthogonal the span of the basis functions; hence the residual cannot be made any smaller in the inner product norm. Although there are several ways of constructing inner product norms, we will let u 2 2 = u u denote the associated norm unless a special construction is emphasized. In this section, however, the norm refer to the standard Euclidean norm. Just like there is a (conjugate) transpose of a vector, there is a (conjugate) transpose of a matrix. The definition is u,av = A u,v, for all vectors u,v. Now, since u,v = u v, we have u Av = u,av = A u,v = (A u) v, so (A u) = u A, and (A ) = A. We shall return to this in connection with differential operators, where A is known as the adjoint of A.

79 6.4 Matrix categories 19 Property Name λ k [A] m 2 [A] M 2 [A] A 2 A = A symmetric real α[ A] α[a] ρ[a] A = A skew-symmetric iω k 0 0 ρ[a] A = A 1 orthogonal e iϕ k α[ A] α[a] 1 A A = AA normal complex α[ A] α[a] ρ[a] positive definite > 0 negative definite < 0 indefinite < 0 > 0 contraction < 1 Table 6.2 Matrix categories and logarithmic norms. A is the (conjugate) transpose of A. All listed categories of matrices have (or can be arranged to have) orthogonal eigenvectors. The most general class is normal matrices; all categories above are normal Within this framework, there are several important classes of matrices that we will encounter many times. Below in Table 6.4 these classes are characterized. In addition, for each class, the spectral properties and logarithmic norms are given. The names of the classes of matrices vary depending on the context. The names given in Table 6.4 refer to standard terminology for real matrices, A R d d. For complex matrices A C d d, the corresponding terms are, respectively, Hermitian; skew-hermitian; unitary; and normal. For more general linear operators, such as linear differential operators, the terms are self-adjoint; anti-selfadjoint; unitary; and normal. For example, in a linear system u = Au, where A is skew-symmetric, we have d dt log u 2 = 1 d u 2 2 Re u, u 2 u 2 = = Re u,au = Reu Au 2 dt u,u u,u u = 0. u Thus it follows that u(t) 2 remains constant in such a system; problems of this type are referred to as conservation laws, and occur e.g. in transport equations in partial differential equations. They require special numerical methods, that replicate the conservation law, keeping the norm of the solution constant. For other classes of matrices, there may be similar concerns whether we can construct methods that have a proper qualitative behavior. By contrast, if M 2 [A] < 0, it follows that u(t 2 0. Thus the magnitude of the solution will decrease as t, and u(t) 2 u(0) 2 for all t 0. We also see that definiteness can be characterized in terms of the logarithmic norms. Thus a positive definite matrix has m 2 [A] > 0, corresponding to

80 20 6 Stability theory 0 < m 2 [A] = inf u 0 Reu Au u, u so the quadratic form only takes values in the right half-plane. Likewise, a negative definite matrix is characterized by M 2 [A] < 0. The upper and lower logarithmic norms provide additional quantitative information, however, as the actual values of the logarithmic norms tell us how positive (or negative) definite a matrix is; this allows us to find stability margins. Note, however, that there are matrices that are neither positive nor negative definite. 6.5 Nonlinear stability theory The stability of nonlinear systems is considerably more complicated than for linear systems. Yet there are strong similarities, even though the stability of a solution must be considered in a case by case basis. Let u and v be two solutions to the same IVP, with initial conditions u(0) = u 0 and v(0) = v 0. Then u v satisfies d (u v) = f (t,u) f (t,v). dt Taking the inner product with u v, we find the differential inequality 1 d 2 dt u v 2 2 = u v, f (t,u) f (t,v) M 2 [ f ] u v 2 2, where the upper logarithmic norm of f (t, ) is M 2 [ f ] = sup u v u v, f (t,u) f (t,v) u v 2, (6.25) 2 where u,v are in the domain of f (t, ). In a similar way, taking the infimum instead, we obtain the lower logarithmic norm m 2 [A]. Consequently, letting u = u v denote the difference between u and v, we have the differential inequalities m 2 [ f ] u 2 d dt u 2 M 2 [ f ] u 2, which are completely analogous to those we obtained in the linear case. Thus we could bound the growth rate of u 2 in terms of M 2 [ f ]. In fact, if f (t, ) is Lipschitz with respect to its second argument, we have L2[ 2 f (t,u) f (t,v), f (t,u) f (t,v) f ] = sup u v u v 2 2 = sup u v f (t,u) f (t,v) 2 2 u v 2. 2 We can easily verify that that L[ ] is an operator (semi) norm; in fact, if f (t,u) = Au is a linear map, we find that L 2 [A] = A 2. This part of the theory is easily extended

81 6.5 Nonlinear stability theory 21 to general norms, so that we can define M[ f ] = lim h 0+ L[I + h f (t, )] 1. h In case of the Euclidean norm, the logarithmic norm defined above is identical to the expression (6.25) above. No matter what norm we choose, we obtain differential inequalities and perturbation bounds of the same structure as in the linear case. This extends linear theory to nonlinear systems. Let us therefore state two results of great importance in the analysis that follows. Theorem 6.7. Let u = f (u) + p(t) and v = f (v) with u(0) v(0) = 0. Let p = sup t 0 p(t). Assume that f : R d R d with M[ f ] 0 and let u = u v. Then e tm[ f ] 1 u(t) p ; t 0. (6.26) M[ f ] If M[ f ] = 0, then If M[ f ] < 0 it holds that u(t) p t; t 0. (6.27) u p M[ f ]. (6.28) We note that this is a nonlinear version of the bounds ( ). Here v(t) represents the unperturbed solution, and u(t) the solution obtained when a forcing perturbation term p(t) drives the solution away from v(t). Whether the perturbed solution grows or not is primarily governed by M[ f ]. Above we note that this result is valid for any norm, even though we will give preference to the Euclidean norm. In the linear case we saw that if M[A] < 0, then any stationary solution is stable. In the nonlinear case we have a similar result. First we note that if p = 0 then u = v; this means that the solution to the system is unique. Now, in the theorem above, assume that M[ f ] < 0 and that p 0 is constant. Then there is a unique stationary solution ū, satisfying 0 = f (ū) + p We shall see that we can write ū = f 1 ( p), i.e., we want to show that the inverse map f 1 exists. To this end, note that, by the Cauchy Schwarz inequality, f (u) f (v) 2 u v 2 u v, f (u) f (v) M 2 [ f ] u v 2 2 < 0 for any distinct vectors u,v R d. Simplifying, we have f (u) f (v) 2 M 2 [ f ] u v 2 < 0. This means that if u v, then necessarily f (u) f (v). Hence f is one-to-one, and we may write f (u) = x and f (v) = y, with u = f 1 (x) and v = f 1 (y). It follows

82 22 6 Stability theory that f 1 (x) f 1 (y) 2 x y 2 1 M 2 [ f ] holds for all x y. By taking the supremum of the left hand side we arrive at the following theorem: Theorem 6.8. (Uniform Monotonicity Theorem) Assume that f : R d R d with M 2 [ f ] < 0. Then f is invertible on R d with a Lipschitz inverse, and L 2 [ f 1 ] 1 M 2 [ f ]. (6.29) The derivation above is only for inner product norms, but the result also holds for general norms. Likewise, it holds if we replace the condition M[ f ] < 0 by m[ f ] > 0. This can be compared to the linear case. Thus, if A is a positive definite matrix (i.e., m 2 [A] > 0) it has a bounded inverse, with A 1 2 1/m 2 [A]. Similarly, a negative definite matrix has a bounded inverse. The uniform monotonicity theorem above generalizes those classical results to nonlinear maps. An interesting consequence of the uniform monotonicity theorem is the following: Corollary 6.2. Let h > 0 and assume that f : R d R d with M 2 [h f ] < 1. Then I h f is invertible on R d with a Lipschitz inverse, and L 2 [(I h f ) 1 ] 1 1 M 2 [h f ]. (6.30) Proof. This result is obtained from the elementary properties of M[ ] in Theorem 6.2. Thus we note that M 2 [h f I] = M 2 [h f ] 1. By assumption M 2 [h f I] < 0, so (I h f ) 1 is Lipschitz, with the constant given by (6.30). This corollary will be seen to guarantee existence and uniqueness of solutions in implicit time-stepping methods for initial value problems. 6.6 Stability in discrete systems There is a strong analogy between differential equations and difference equations. Corresponding to the linear and nonlinear initial value problems

83 6.6 Stability in discrete systems 23 ẏ = Ay ẏ = f (y), we have the discrete systems y n+1 = Ay n y n+1 = f (y n ). Beginning with the linear system, we saw that in the continuous case, stability was governed by having all eigenvalues in the left half-plane, α[a] < 0, or, if norms were used, by having a non-negative upper logarithmic norm, M[A] 0. In the discrete linear case, the stability conditions require that we have the eigenvalues in the unit circle, ρ[a] < 1, or, in terms of norms, that A 1. Thus the role of left half-plane is replaced by the unit circle in the discrete case; the spectral abscissa by the spectral radius; and the logarithmic norm by the matrix norm. In the discrete, linear case, the solution is y n = A n y 0, and the solution is bounded (stable) if A n C for all n 1. A matrix satisfying this condition is called power bounded. Power boundedness is necessary for stability, but as it depends on the spectrum of A as well as the eigenvectors, it may often be difficult to establish. On the other hand, using the submultiplicativity of the operator norm, we have A n A n. Therefore A is power bounded if A 1, and the latter condition is often much easier to establish. Since A may be power bounded even if A > 1, the condition A 1 is sufficient for stability, but not necessary. In an analogous way, the solution of the continuous system is y(t) = e ta y(0), and the solution is bounded (stable) if e ta C. This, too, is more difficult to establish than the result obtained by using norms. Thus we have seen, by using differential inequalities, that e ta e tm[a] for t 0, and it follows that the system is stable if M[A] 0. Again, the latter is a sufficient but not a necessary condition. In the nonlinear continuous case, we need to investigate the difference of two solutions, u and v, and whether they remain close as t. This is the case if M[ f ] 0. The situation is similar in the discrete case. We then have

84 24 6 Stability theory u n+1 = f (u n ) v n+1 = f (v n ). The difference between the solutions satisfy u n+1 v n+1 = f (u n ) f (v n ) L[ f ] u n v n, where L[ f ] is the Lipschitz constant of f. Thus, if L[ f ] < 1, the distance between the solutions decreases. We have already seen in Theorem 2.1 that, provided that f : D f (D) D and L[ f ] < 1, there is a unique fixed point ū solving the equation ū = f (ū). This is then a stationary solution to the discrete dynamical system. In addition, we saw in (2.25) that (I f ) 1 exists and is Lipschitz. In fact, in view of Theorem 6.2, the inverse exists under a slightly relaxed condition, M[ f ] < 1, and the error estimate (2.25) can be sharpened. We shall derive an improved bound, but restrict the derivation to inner product norms. Note that Hence u n+1 ū = f (u n ) f (u n+1 ) + f (u n+1 ) f (ū). u n+1 ū 2 2 = u n+1 ū,u n+1 ū Simplifying, we obtain = u n+1 ū, f (u n ) f (u n+1 ) + u n+1 ū, f (u n+1 ) f (ū) u n+1 ū 2 f (u n ) f (u n+1 ) 2 + M 2 [ f ] u n+1 ū 2 2 L 2 [ f ] u n+1 ū 2 u n u n M 2 [ f ] u n+1 ū 2 2. u n+1 ū 2 L 2 [ f ] u n u n M 2 [ f ] u n+1 ū 2. Note that M 2 [ f ] L 2 [ f ] for all f, and that we have assumed L 2 [ f ] < 1. Therefore M 2 [ f ] < 1 and u n+1 ū 2 L 2[g] 1 M 2 [g] u n+1 u n 2. This bound is always preferable to (2.25) since M 2 [ f ] L 2 [ f ]. In particular, if M 2 [ f ] 0, it follows that u n+1 ū 2 u n+1 u n 2, expressing that the error is less than the computable difference between the last two iterates. Such a bound cannot be obtained using the Lipschitz constant alone, as in (2.25). We restate the fixed point theorem in improved form, for general norms, even though only the Euclidean norm was used above: Theorem 6.9. (Fixed point theorem) Let D be a closed set and assume that f is a Lipschitz continuous map satisfying f : D R d D. Then there exists a fixed point

85 6.6 Stability in discrete systems 25 System type ẏ = Ay ẏ = f (y) y n+1 = Ay n y n+1 = f (y n ) Spectral condition α[a] < 0 ρ[a] < 1 Norm condition M[A] 0 M[ f ] 0 A 1 L[ f ] 1 Table 6.3 Stability conditions for linear and nonlinear systems. Elementary stability conditions are given in terms of the spectrum and norms in the linear constant coefficient case. The spectrum may reach the boundary of the left half plane or the unit circle, provided that a multiple eigenvalue has a full set of eigenvectors. The norm conditions are sufficient, but not necessary ū D. If, in addition, L[ f ] < 1 on D, then the fixed point ū is unique, and the fixed point iteration converges to ū for every starting value u 0 D, with the error estimate u n+1 ū L[g] 1 M[g] u n+1 u n. (6.31) Thus, in connection with discrete dynamical systems, there are also links to iterative methods for solving equations; such iterations are also discrete time dynamical systems, to which stability and contractivity applies. Returning to the stability interpretation, we have collected some elementary stability conditions in Table 6.6. While the conditions on the eigenvalues (spectrum) only apply to linear constant coefficient systems, the norm conditions apply to linear as well as nonlinear systems, but are only sufficient conditions; a system could be stable and yet fail to fulfill the norm condition. Another analogy is found in linear differential and difference equations. Example Consider the linear differential equation ÿ + 3ẏ + y = 0 with suitable initial condition. The standard procedure is to insert the ansatz y = e tλ into the equation, to obtain the characteristic equation from which the possible values of λ are determined. Thus λ 2 + 3λ + 1 = 0, (6.32) and it follows that λ 1,2 = ( 3 ± 9 4)/2 = ( 3 ± 5)/2. The general solution is y(t) = C 1 e tλ 1 +C 2 e tλ 2, where the constants of integration C 1 and C 2 are determined by the initial condition. Obviously y(t) = 0 is a solution. It is stable, since both roots λ 1,2 C. Example The corresponding example in difference equations is y n+2 + 3y n+1 + y n = 0. This time, however, we make the ansatz y n = λ n, which is an exponential function in discrete time. Upon insertion, we get

86 26 6 Stability theory λ n+2 + 3λ n+1 + λ n = 0, leading to the same characteristic equation (6.32) as before, since λ 0 when we seek a nonzero solution. Naturally, we have the same roots λ 1,2. The general solution is y n = C 1 λ n 1 +C 2λ n 2, where C 1 and C 2 are determined by the initial conditions. The zero solution y n = 0 is now unstable, since one root is outside the unit circle: λ 1 = / < 1 λ 1 = 3 5 / > 1. Thus, in linear differential and difference equations, stability is once more determined by having the roots in the left half-plane, or in the unit circle. This is in fact the same result as we saw before: Example By putting ẏ = z in the second order problem above, it is transformed into a system of first order equations. Thus ẏ = z ż = y 3z, with initial conditions y(0) = y 0 and z(0) = ẏ(0) = ẏ 0. In matrix vector form we have ( ) ( ) ( ) d y 0 1 y =. dt z 1 3 z The matrix thus obtained is referred to as the companion matrix of the differential equation, and for stability its eigenvalues must be located in the left half plane. To determine its eigenvalues λ k [A], we need the characteristic equation, ( ) λ 1 det(a λi) = det = λ(λ + 3) + 1 = λ 1 3 λ 2 + 3λ + 1 := 0. This is the same characteristic equation as before, showing that spectral stability conditions are identical to those derived directly for higher order differential or difference equations.

87 Chapter 7 The Explicit Euler method The first discrete method for solving initial value problems was devised by Euler in the mid 18th century. One of the greatest mathematicians of all times, Euler realized that many of the emerging problems of analysis could only be solved approximately. In the problem ẏ = f (t,y), the issue is the derivative. Thus we have already noted that derivatives need to be approximated by finite differences in order to construct computable approximations. In the differential equation ẏ = f (t,y), (7.1) we start from the simplest approximation. We will compute a sequence of approximations, y n y(t n ), such that y n+1 y n t = f (t n,y n ). (7.2) This follows the pattern from the standard definition of the derivative. Since y(t n + t) y(t n ) lim = ẏ(t n ), t 0 t the finite difference approximation (7.2) is obtained by replacing the derivative in (7.1), using a finite time step t > 0. From (7.2) we create a recursion, y n+1 = y n + t f (t n,y n ), (7.3) starting from the initial value y 0 = y(0). This is the Explicit Euler method. It is the original time-stepping method, and all other types of time-stepping method constructions include the explicit Euler method as the simplest case. 27

28 7 The Explicit Euler method Fig. 7.1 Leonhard Euler (1707 1783). Portrait by J.E. Handmann (1753), Kunstmuseum Basel 7.

88 28 7 The Explicit Euler method Fig. 7.1 Leonhard Euler ( ). Portrait by J.E. Handmann (1753), Kunstmuseum Basel 7.1 Convergence The Euler recursion implies that we sample the vector field f (t,y) at the current point of approximation, i.e., at (t n,y n ), and then take one step of size t in the direction of the tangent. Naturally, the exact solution will not satisfy this recursion. As before, we let y(t n ) denote the exact solution of the differential equation at time t n. Inserting the exact solution into the recursion (7.3), we obtain y(t n+1 ) = y(t n ) + t f (t n,y(t n )) r n, (7.4) where the local residual r n 0 signifies that the exact solution does not satisfy the recursion. The first question is, how large is r n? Let us assume that the exact solution y is twice continuously differentiable on [0,T ]. Expanding in a Taylor series, we obtain y(t n+1 ) = y(t n ) + t ẏ(t n ) + t2 ÿ(θ n ), 2 for some θ n [t n,t n+1 ]. Since ẏ(t n ) = f (t n,y(t n )), we can compare to (7.4) and conclude that r n = t2 ÿ(θ n ). (7.5) 2 Hence r n = O( t 2 ) as t 0, provided that y C 2 [0,T ].

89 7.1 Convergence 29 Lemma 7.1. Let {y(t n )} denote the exact solution to (7.1) at the time points {t n }. Assume that y C 2 [0,T ], and that f is Lipschitz. When the exact solution is inserted into the explicit Euler recursion (7.3), the local residual is r n t 2 max t [0,T ] ÿ(t). (7.6) 2 This means that the difference equation is consistent with the differential equation, and that the approximation improves as t 0. We will return to this notion later; for the time being, we say that the method has order of consistency p, if as t 0 (or, equivalently, N ). r n = O( t p+1 ) Now, because the exact solution does not satisfy the recursion, it follows that the numerical solution will deviate from the exact solution. We introduce the following definition. Definition 7.1. Let the sequence {y n } denote the numerical solution generated by the Euler method (7.3) and let {y(t n )} denote the exact solution to (7.1) at the time points {t n }. Then the difference is called the global error at time t n. e n = y n y(t n ) The next question is, therefore, how large is e n? Now, the objective of all time stepping methods is to generate a numerical solution {y n } whose global error can be bounded. In fact, we want more: the method must be convergent. This means that given any prescribed error tolerance ε, we must be able to choose the step size t accordingly, so that the numerical solution attains the prescribed accuracy ε. Let us see how this is done. The local residual is related to the global error. Subtracting (7.4) from (7.3), we get e n+1 = e n + t f (t n,y(t n ) + e n ) t f (t n,y(t n )) + r n. (7.7) This is a recursion for the global error, where the local residual is the forcing function. It should be noted that the terminology varies in the literature, and that r n is often called the local truncation error, or the local error. The reason for this discrepancy will become clear later, in connection with implicit methods. Taking norms in (7.7), using the triangle inequality and the Lipschitz condition for f, yields e n+1 e n + t L[ f ] e n + r n = (1 + t L[ f ]) e n + r n. (7.8)

90 30 7 The Explicit Euler method This is a difference inequality, and from it we are going to derive a bound on the global error. To this end, we need the following lemma. Lemma 7.2. Let the sequence {u n } 0 satisfy u n+1 (1 + µ)u n + v n, with u 0 = 0, v n > 0 and µ > 1, but µ 0. Then In case µ = 0, it holds that u n n max k<n v k. u n max v k enµ 1. (7.9) k<n µ Proof. The case µ = 0 is trivial. For µ 0, we first prove by induction that u n U n, where n 1 U n = (1 + µ) n k 1 v k. (7.10) k=0 This obviously holds for n = 1. Assume that u n U n holds for a given n. Then u n+1 (1 + µ)u n + v n (1 + µ)u n + v n = n k=0 so u n U n holds for all n 1. From (7.10), it now follows that U n max n 1 v k k<n k=0 e (n k 1)µ = max k<n v k enµ 1 e µ 1 max (1 + µ) n k v k = U n+1, v k enµ 1. k<n µ since 1 + µ e µ and the sum is a geometric series. The proof is complete. We are now ready to construct a global error bound for the explicit Euler method, by applying (7.9) to the error recursion (7.8). Thus we obtain the following result. Theorem 7.1. Let the initial value problem ẏ = f (t,y) be given with y(0) = y 0 on a compact interval [0,T ]. Assume that L[ f ] < for y R d, and that the solution y C 2 [0,T ]. When this problem is solved, taking N steps with the explicit Euler method using step size t = T /N, the global error at t = T is bounded by ÿ y N y(t ) t max t [0,T ] 2 e T L[ f ] 1. (7.11) L[ f ] Proof. The result follows immediately from identifying µ = t L[ f ] in (7.8), and v k = r k = t2 ÿ(θ k ) 2 in (7.5), and noting that n t = t n in general, and N t = T in particular.

91 7.1 Convergence 31 This classical result proves that the explicit Euler method is convergent, i.e., by choosing the step size t = T /N small enough, the method can generate approximations that are arbitrarily accurate. Thus the structure of the bound (8.7) is y N y(t ) C(T ) t. (7.12) An alternative formulation of (7.12) uses t = T /N, to express the bound as with K(T ) = T C(T ). y N y(t ) K(T ) N, (7.13) Definition 7.2. If a time stepping method generates a sequence of approximations y n y(t n ) at the time points t n = n t with t = T /N, and the exact solution y(t) is sufficiently smooth, the method is said to be of convergence order p, if there is a constant C and an N such that for all N > N. y N y(t ) C N p Here we note that, by Theorem 7.1, the explicit Euler method is of convergence order p = 1. This implies that the global error is e N = O( t), and that given any accuracy requirement ε, we can pick t (or N) such that e N ε. Example The theory is easily illustrated on a simple test problem. We choose The exact solution is ẏ = λ(y g(t)) + ġ(t); y(0) = y 0. (7.14) y(t) = e λt (y 0 g(0)) + g(t), and is composed of a homogeneous solution (the exponential) and the particular solution f (t). We will choose g(t) = sinπt and y 0 = 0 so that the exact solution is y(t) = sinπt, and we will solve the problem on [0,1]. We deliberately choose N = 5 and N = 10 steps, to obtain large errors, visible to the naked eye. Such a computational setup is obviously exaggerated for the purpose of creating insight. The results are seen in Figure 7.1, where we have taken λ = 0.2. In spite of using so few steps, we still observe the expected behavior, with a global error O( t) and local residuals O( t 2 ). This corroborates the first order convergence. The convergence proof for the explicit Euler method is key to a broader understanding of time stepping methods. We shall elaborate on the numerical tests by varying the parameters in the test problem, and also compare the results to those we obtain for the implicit Euler method, to be studied next. However, before we go on, it is important to discuss the interpretation of the convergence proof, as well as some important, critical remarks on what has been achieved so far.

92 32 7 The Explicit Euler method t Fig. 7.2 Demonstration of the Explicit Euler method. The simple test problem (7.14) is solved on [0,1] with N = 5 steps (top) and then N = 10 steps (bottom). The exact solution y(t) = sinπt is indicated by emphasized blue curve, while the explicit Euler method generates the red, polygonal, discrete solution. Each step is taken in the tangential directions of local solutions to the differential equation (green intermediate curves). The global error at the endpoint is approximately 0.6 for t = 1/5 and half as large, 0.3, for t = 1/10, in agreement with an O( t) global error. The local residuals approximately correspond to the distance between the green curves. Since there are twice as many green curves in the lower graph, with only half the distance between them, the local residual is O( t 2 ) Remark 1 (Convergence) The notion of a convergent method applies to a general class of Lipschitz continuous problems whose solutions are smooth enough, in this case with y C 2 [0,T ]. Thus convergence is a nominal method property, and the convergence order is the best generic performance the method will exhibit. However, there are exceptions. For example, if the solution y is a polynomial of degree 1 (a straight line), then ÿ 0 and the local residual vanishes. The explicit Euler method then generates the exact solution. Conversely, if a given problem fails to satisfy the Lipschitz condition, or if the solution is not in C 2 [0,T ], the convergence order may drop, or the method may fail altogether. In practice, one rarely verifies the theoretical assumptions, and occasional failures are encountered. Using a convergent method does not guarantee unconditional success. Finally, convergence is a notion from analysis, requiring a dense vector space, such as R d. In computer arithmetic, there is no true convergence, since machine representable numbers are few and far between. Even so, it is rare to experience difficulties due to roundoff, and in most cases the nominal convergence order is observed. In the explicit Euler case, this means that if t is reduced by a factor of 10, we will typically observe a global error that is 10 times smaller.

93 7.1 Convergence 33 Remark 2 (Consistency and stability imply convergence) The error bound has the form y N y(t ) C(T ) t, and it has two essential components. First, the single power of t is due to consistency. Second, C(T ) must be bounded; this is referred to as stability, and implies that the bound depends continuously on t. Here C(T ) is sometimes referred to as the stability constant. Let us have a closer look at where these concepts were employed. We used the inequality n 1 u n (1 + µ) n k 1 v k, k=0 and we need u N 0 as N. We then need (1 + µ) N to be bounded. This appears to require µ 0, but in fact we have a little bit of leeway. In our case ( (1 + µ) N = (1 + t L[ f ]) N = 1 + T L[ f ] ) N e T L[ f ], N so the exponential term is bounded for fixed T, even though µ > 0. That is where stability entered. Without stability, C(T ) would not have been bounded. Thus stability is necessary. Second, for the error to go to zero, we needed v k 0. This is where consistency entered. Above we saw that v k O( t 2 ) 0, since the local residual is r n O( t 2 ). Without consistency, the upper bound of the global error y N y(t ) C(T ) t would not have contained the factor t. Thus consistency is necessary, too. Remark 3 (What is a stability constant?) The convergence proof above is a mathematical proof. It is sharp in the sense that equality could be attained, but it is far too weak for numerical purposes. Consider a plain initial value problem ẏ = 10y; y(0) = 1, to be solved on [0,T ] with T = 10. The exact solution is y(t) = e 10t, with ÿ = 100y, implying that max ÿ = 100. The Lipschitz constant is L[ f ] = 10. Inserting these data into (8.7), we get ÿ e T L[ f ] 1 y N y(t ) t max = t 50 e t. t [0,T ] 2 L[ f ] 10 Thus C(T ) is stupendous; such constants do not belong in numerical analysis. In real computations, stability constants must have a reasonable magnitude, keeping in mind that computations are carried out in finite precision, and need to finish in finite time. Fortunately, the real error is much smaller. Suppose we take t = 0.01 and N = 10 3 steps to complete the integration from t = 0 to T = 10. Because y(t) is convex, it is easily seen that 0 < y n < y(t n ) during the entire integration. Therefore the error at T is certainly less than y(t ). Now, y(t ) = e 10T = e 100 = , so y N y(t ) Thus the error bound overestimates the error by 88 orders of magnitude. This is unacceptable, especially when the differential equation poses no special problems at all.

94 34 7 The Explicit Euler method To summarize the analysis above, we note that stability and consistency are two distinct necessary conditions for convergence. Later on, we will simplify the convergence proofs, reducing them to a matter of verifying the order of consistency, and stability. Proving consistency is usually easy, only requiring a Taylor series expansion. Stability is more difficult, but once established, it holds that the order of convergence equals the order of consistency. Bridging the gap from consistency to convergence, stability plays the key role. It will turn up in many different forms depending on the problem type and method construction. Because the pattern remains the same, we will discover that the Lax Principle is the most important idea in numerical analysis: consistency and stability imply convergence. 7.2 Alternative bounds The convergence proof derived above is only a mathematical proof, and we need to find better estimates. The problem with the derivation above depends on two things: a reckless use of the triangle inequality, and a consequential damaging use of the Lipschitz constant. As a consequence, we obtained a stability constant C(T ) which is so large as to suggest that accurate numerical solution of differential equations is impossible over reasonably long time intervals. However, this is not the case. By using logarithmic norms, we will be able to derive much improved error bounds that support the observation from computational practice, that most initial value problems can be solved to a very high accuracy. Going back to the recursion (7.7) for the global error, we had e n+1 = e n + t f (t n,y(t n ) + e n ) t f (t n,y(t n )) + r n. Again, we take norms and use the triangle inequality, without splitting the first three terms. Thus we get e n+1 L[I + t f ] e n + r n (1 + t M[ f ]) e n + r n, where the approximation is derived from L[I + t f ] = 1 + t M[ f ] + O( t 2 ), in accordance with Definition 6.2. Dropping the O( t 2 ) term, we have effectively just replaced the Lipschitz constant L[ f ] in (7.8) by the logarithmic norm M[ f ]. Otherwise, everything remains the same. The convergence proof now only offers an approximate global error bound, but it is much improved due to the fact that M[ f ] L[ f ]. Theorem 7.2. Let the initial value problem ẏ = f (t,y) be given with y(0) = y 0 on a compact interval [0,T ]. Assume that L[ f ] < for y R d, and that the solution

95 7.3 The Lipschitz assumption 35 y C 2 [0,T ]. When this problem is solved, taking N steps with the explicit Euler method using step size t = T /N, the global error at t = T is bounded by ÿ y N y(t ) t max t [0,T ] 2 e T M[ f ] 1. (7.15) M[ f ] Remark 1 (Perturbation bound) We note that the error bound (7.15) has the same structure as the perturbation bound (6.26). While the latter was derived for the differential equation, the global error bound was derived for the discretization. The shared structure of the bounds shows that the error accumulation in the discrete recursion is similar to the effect of a continuous perturbation p(t) in the differential equation. Remark 2 (The stability constant, revisited) Let us again consider the problem ẏ = 10y; y(0) = 1, to be solved on [0,T ] with T = 10. The exact solution is y(t) = e 10t, with ÿ = 100y, implying that max ÿ = 100. While the Lipschitz constant is L[ f ] = 10, the logarithmic norm is M[ f ] = 10. This gives a completely different error bound. Inserting the data into (7.15), we get ÿ e T M[ f ] 1 y N y(t ) t max = t 50 e t. t [0,T ] 2 M[ f ] 10 Thus the stability constant C(T ) 5 is moderate in size. The error bound is still an overestimate, but the new error bound shows that the numerical method can achieve realistic accuracy. Even with the logarithmic norm, the error bound usually overestimates the error. However, the main difference is that while the Lipschitz constant is positive, it cannot pick up any information on the stability of the solutions of the equation. By contrast, the logarithmic norm distinguishes between forward and reverse time integration, and therefore contains some information about stability. This is necessary in order to have realistic error bounds. 7.3 The Lipschitz assumption While a much better error bound could be obtained when the logarithmic norm replaced the Lischitz constant in the derivation, the classical assumption for establishing existence of solutions on some compact interval [0,T ] is still that the vector field is Lipschitz with respect to its second argument, i.e., L[ f (t, )] <. Noting that it always holds that M[ f ] L[ f ] (see the basic properties in Theorem 6.2, which apply also to nonlinear maps), the Lischitz assumption automatically

96 36 7 The Explicit Euler method implies that M[ f ] exists and is bounded. Therefore it is always possible to work with the logarithmic norm instead of the Lipschitz constant in the estimates, although this occasionally leads to approximate upper bounds. More importantly, it may happen that M[ f ] L[ f ], implying that vastly improved error bounds can be obtained with the logarithmic norm. This is of particular importance in connection with stiff differential equations, which will be studied later. Since the error bounds typically contain a factor e T L[ f ], which can be replaced by e T M[ f ] (see Theorem 6.7), the difference is enormous. In case T is also large, the classical bound, based on the Lipschitz constant, loses its computational relevance altogether. Bounds and estimates have to be as tight as possible. Unlike the Lipschitz constant, the logarithmic norm may be negative. Thus, in cases where M[ f ] < 0, we can obtain uniform error bounds, also when T, which is otherwise impossible in the classical setting. In fact, stiff problems have T L[ f ] 1 but T M[ f ] small or even negative. Such problems cannot be dealt with in a meaningful way without using the logarithmic norm. Typical examples are found in parabolic partial differential equations, such as in the diffusion equation. For this reason, we shall in the sequel start our derivations from the (weaker) assumption that M[ f (t, )] <, keeping in mind that this is compatible with classical existence theory for ordinary differential equations, no matter how large L[ f (t, )] is.

97 Chapter 8 The Implicit Euler Method The explicit Euler method is y n+1 y n t = f (t n,y n ). and is obtained from the finite difference approximation to the derivative, y(t n + t) y(t n ) t ẏ(t n ). In the Implicit Euler Method, we instead interpret the difference quotient as an approximation to ẏ(t n+1 ), which is the right endpoint of the interval [t n,t n+1 ] rather than the left. This leads to the discretization y n+1 y n t = f (t n+1,y n+1 ). The method is implicit, because given the point (t n,y n ), we cannot compute y n+1 directly. To use this method, we have to solve the (nonlinear) equation y n+1 t f (t n+1,y n+1 ) = y n (8.1) on every single step. This leads to a number of questions: 1. Under what conditions can this equation be solved? 2. Which method should be used to solve this equation? 3. Can the added cost of equation solving be justified? Let us start with existence of solutions. In ordinary differential equations we generally assume that f is Lipschitz with respect to the second argument. For simplicity, let us assume that L[ f (t, )] L[ f ] < on all of R d. This implies that M[ f ] L[ f ] <. By Corollary 6.2 we have the following result: M[ t f ] < 1 L[(I t f ) 1 ] 1 1 t M[ f ]. 37

98 38 8 The Implicit Euler Method Throughout the analysis, we shall assume that M[ t f ] < 1. We note that this is a considerably weaker assumption than assuming L[ t f ] < 1, in which case the fixed point theorem would apply. Thus, a solution to (8.1) exists for (possibly) much larger step sizes t than the fixed point theorem would indicate. This brings us to the second question. If we were to use step sizes t such that M[ t f ] < 1 but L[ t f ] 1, then obviously we cannot solve the equation by fixed point iteration. We will see that these conditions are typical. For this reason, we need to use Newton s method, which may converge in the operative conditions defined by M[ t f ] < 1. As for the third question, being implicit, the implicit Euler method is going to be more expensive per step than the explicit method. But using Newton s method is going to make it far more expensive per step. Can this extra cost can be justified? There are two possible benefits if the method is more accurate or has improved stability properties, it might be possible to employ larger steps t than in the explicit method. Then the implicit method would compensate for the inefficiency of the explicit method. It turns out that the advantage is in improved stability, and that there are cases when the implicit Euler method can use step sizes t that are orders of magnitude greater than those of the explicit method. These conditions depend on the differential equation, and do not violate the solvability issues raised in the first question. Whenever these conditions are at hand, the implicit method easily makes up for its more expensive computations. The issue is not the cost per step, but the cost per integrated unit of time, often referred to as the cost per unit step. 8.1 Convergence Let us begin by investigating consistency. We follow standard procedure and insert the exact solution y(t) into the discretization, to obtain y(t n+1 ) = y(t n ) + t f (t n+1,y(t n+1 )) r n, (8.2) where we want to find the magnitude of the local residual r n. We assume that the exact solution y is twice continuously differentiable and expand in a Taylor series. Here we note that we need to expand both t(t n+1 ) and ẏ(t n+1 ) = f (t n+1,y(t n+1 ) around t n. Thus we have y(t n+1 ) = y(t n ) + t ẏ(t n ) + t2 ÿ(t n ) + O( t 3 ) 2 ẏ(t n+1 ) = ẏ(t n ) + t ÿ(t n ) + O( t 2 ). Inserting into (8.2) we conclude that r n = t2 ÿ(t n ) 2 + O( t 3 ). (8.3)

99 8.1 Convergence 39 Hence the order of consistency of the implicit Euler method is p = 1, just like for the explicit method. The only difference is that the local residual in the implicit Euler method has the opposite sign from that of the explicit Euler method. Lemma 8.1. Let {y(t n )} denote the exact solution to (7.1) at the time points {t n }. Assume that y C 2 [0,T ], and that f is Lipschitz. When the exact solution is inserted into the implicit Euler recursion (8.2), the local residual is as t 0. r n t 2 max t [0,T ] ÿ(t). (8.4) 2 As for the global error and convergence, the analysis is now slightly more complicated because the method is implicit. Assuming that t M[ f ] < 1, the inverse map (I t f ) 1 exists and is Lipschitz on account of Theorem 6.2. We now have have (I t f )(y n+1 ) = y n (I t f )(y(t n+1 )) = y(t n ) r n. Inverting I t f, subtracting and taking norms, it follows that e n+1 L[(I t f ) 1 ] e n + r n, (8.5) where e n = y n y(t n ) is the global error at t n. Again, by Theorem 6.2, we obtain e n+1 e n + r n 1 t M[ f ] e n + r n 1 t M[ f ] (1+ t M[ f ]) e n + r n +O( t 3 ). (8.6) Thus we have the approximate difference inequality e n+1 (1 + t M[ f ]) e n + r n, conforming to Lemma 7.2. Identifying µ = t M[ f ] and v k = r k, we obtain the following standard convergence result. Theorem 8.1. Let the initial value problem ẏ = f (t,y) be given with y(0) = y 0 on a compact interval [0,T ]. Assume that t M[ f ] < 1 for y R d, and that the solution y C 2 [0,T ]. When this problem is solved, taking N steps with the implicit Euler method using step size t = T /N, the global error at t = T is bounded by ÿ y N y(t ) t max t [0,T ] 2 e T M[ f ] 1. (8.7) M[ f ] While this conforms to the bound obtained for the explicit method, there appears to be little new to learn from the implicit method. Before we run some tests,

100 40 8 The Implicit Euler Method comparing the explicit and implicit Euler methods, let us note that there is another way of deriving the error estimate above. Thus, starting all over, we note that if the method in a single step starts from a point y(t n ) on the exact solution, it produces an approximation ŷ n+1 = y(t n ) + t f (t n+1,ŷ n+1 ), (8.8) Because ŷ n+1 y(t n+1 ), it is warranted to introduce a special notation for the discrepancy. Thus we introduce the local error l n+1, defined by ŷ n+1 = y(t n+1 ) + l n+1 (8.9) The local error is, naturally, related to the local residual. Subtracting (8.2) from (8.8), we have l n+1 = t f (t n+1,y(t n+1 ) + l n+1 ) t f (t n+1,y(t n+1 )) + r n. Let us for simplicity assume that f (t,y) = Jy, i.e., that the vector field f is a linear constant coefficient system. Then l n+1 = (I t J) 1 r n. (8.10) The inverse of the matrix exists, since we assumed tm[ f ] < 1, and M[J] = M[ f ] if f = J. Obviously, if t J 0 as t 0, we have l n+1 r n. Thus, if the step sizes are small enough to make t J 1, it holds that l n+1 r n. However, the point in using the implicit Euler method is to employ large step sizes for which t J 1, while it still holds that tm[j] 1. This is the case in stiff differential equations, where the local error l n+1 can be much smaller than the local residual r n. For the explicit Euler method, the local residual and the local error ar identical; thus l n+1 r n. For implicit methods, however, especially in realistic operational conditions, the difference is significant. In practical computations, it is therefore more important to control the magnitude of the local error than the local residual. For this reason, we emphasize the local error perspective in the sequel. Let us now turn to comparing the computational behavior of the explicit and implicit Euler methods. This will be done by considering a few simple test problems that illustrate both stability and accuracy. It is important to note that stability is the highest priority; without stability no accuracy can be obtained. Example We shall compare the explicit and implicit Euler methods using the same test problem (7.14) as before, specifically with exact solution ẏ = λ(y sinωt) + ω cosωt; y(0) = y 0, (8.11) y(t) = e λt y 0 + sinωt. We shall take y 0 = 0 so that the homogenous solution is not present in the exact solution y(t) = sinωt, but the exponential term will show up in the local solutions passing through the points {y n } generated by the numerical methods. We will take ω = π and solve the problem on [0,1], and just like before we will use N = 5 and 10 steps, respectively. The

101 8.1 Convergence 41 Explicit Euler, N=5 Implicit Euler, N= Explicit Euler, N=10 Implicit Euler, N= t t Fig. 8.1 Comparing the Euler methods. Test problem (8.11) is solved for λ = 0.2 with N = 5 (top) and N = 10 steps (bottom), comparing explicit Euler (left panels) to implicit Euler (right panels). Blue curve is the exact solution y(t); red polygons represent the numerical solutions; green curves represent local solutions through numerically computed points. The local residuals have opposite signs, since explicit solutions proceed above y(t), and below it in the implicit case. Going from 5 to 10 steps, the global error is O( t) for both methods, with local residuals O( t 2 ) prime motivation for the test is to investigate the effect of varying λ, which controls the damping of the exponential. We will use three different values, λ = 0.2, 2 and 20, and otherwise use the same computational setup for both methods. The results are shown in Figures For λ = 0.2 the damping is weak and local solutions almost run parallel to the exact solution. The test verifies that the global error is of the same magnitude for both methods. When λ = 2 there is more exponential damping. This has the interesting effect that the global error becomes smaller, due to the fact that the stability constant C(T ) is smaller when the damping rate increases. In addition, local solutions now show a moderate damping, but we still observe how the global error is similar in both methods, and still O( t). For λ = 20, there is strong exponential damping, as is evident from the fast contracting local solutions. The big surprise, however, is that for N = 5 (or, equivalently, t = 0.2), the explicit method goes unstable, with a numerical solution exhibiting growing oscillations diverging from the exact solution, even though the initial value was taken on the exact solution. This effect is due to numerical instability, and will be investigated in detail. The instability may at first seem surprising, since we do have a convergence proof for the method, but we note that λ t = 4, which is too large for the method to remain stable. By contrast, there is no sign of instability in the implicit method, which remains stable. Nor is there any instability in the explicit method when N = 10 and t = 0.1 put λ t at 2.

102 42 8 The Implicit Euler Method Explicit Euler, N=5 Implicit Euler, N= Explicit Euler, N=10 Implicit Euler, N= t t Fig. 8.2 Comparing the Euler methods. Test problem (8.11) is solved for λ = 2 with N = 5 (top) and N = 10 steps (bottom), comparing explicit Euler (left panels) to implicit Euler (right panels). When the exponential damping increases, the global error decreases, but otherwise the results remain similar, with the exception of local solutions clearly displaying a faster damping rate However, we have seen that in the convergence proof stability depends on (1 + t L[ f ]) N being bounded as N and t 0. While this is still true, we need to recognize that in a practical computation, we fix a t > 0 and then take a large number of steps with that step size. In such a case we have a new situation; the successive powers (1 + µ) n will naturally grow unless 1 + µ 1. We will see that this condition has been violated in our situation when the method goes unstable. The main discovery in the comparison of the explicit and implicit Euler methods is that the explicit method suddenly goes unstable when the product of the step size t and the problem parameter λ is too large. This means that we need to develop a stability theory for the methods. It is not sufficient to investigate the stability of the mathematical problem and the discrete problem; we need to establish under what conditions stability of the mathematical problem carries over to the discrete problem. This will pose (possibly) restrictive conditions on the magnitude of the step size t.

103 8.2 Numerical stability 43 Explicit Euler, N=5 Implicit Euler, N= Explicit Euler, N=10 Implicit Euler, N= t t Fig. 8.3 Comparing the Euler methods. Test problem (8.11) is solved for λ = 20 with N = 5 (top) and N = 10 steps (bottom), comparing explicit Euler (left panels) to implicit Euler (right panels). For N = 5, the explicit method shows numerical instability (top left) as indicated by growing oscillations, diverging from the exact solution. When the step size is shortened (N = 10, lower left), the method regains stability. In the implicit method, there is no instability at all; the implicit Euler method can use larger steps than the explicit Euler method. Finally, due to the strong exponential damping, the global error is very small whenever the computation is stable 8.2 Numerical stability Numerical stability is investigated in terms of the linear test equation, where λ C. This requires a special motivation. ẋ = λx, x(0) = 1, (8.12) Motivation Consider a linear constant coefficient system ẏ = Ay, (8.13) where A is diagonalizable by the transformation U 1 AU = Λ, and Λ is the diagonal matrix containing the eigenvalues of A. The transformation y = Ux then implies ẏ = Uẋ, and Uẋ = AUx ẋ = U 1 AUx = Λx. Since A may have λ[a] C, we take λ C in (8.12). If (8.13) is discretized by (say) the explicit Euler method, we obtain

104 44 8 The Implicit Euler Method Euler discretization ẏ = Ay y n+1 = (I + ha)y n diagonalization diagonalization ẋ = Λx x n+1 = (I + hλ)x n Euler discretization Fig. 8.4 Commutative diagram. Diagonalization U 1 AU = Λ of the vector field A commutes with the Euler discretization, justifying the linear test equation for investigating numerical stability Putting y n = Ux n, we get y n+1 = (I + t A)y n. Ux n+1 = (I + t A)Ux n x n+1 = U 1 (I + t A)Ux n = (I + tλ)x n. But this is the explicit Euler method applied to the diagonalized problem ẋ = Λx. Thus, diagonalization and discretization commute; it does not matter in which order these operations are carried out, see Figure 8.4. Therefore (8.13) can be analyzed eigenvalue by eigenvalue; this is what the linear test equation (8.12) does, with λ λ[a]. This justifies the interest in (8.12) as a standard test problem. Now, let us consider the mathematical stability of (8.12). Since the solution is it follows that x(t) = e tλ, x(t) = e tλ = e treλ. Hence x(t) remains bounded for t 0 if Reλ 0. For a system ẏ = Ay, this corresponds to having λ[a] C, or equivalently α[a] 0. This is the stability condition for the differential equation. Checking the stability of the discretization, we start by investigating the explicit Euler method. For the linear test equation (8.12) we obtain x n+1 = (1 + tλ)x n, and it follows that x n+1 = 1 + tλ x n. Thus the numerical solution is nonincreasing provided that 1 + tλ 1. (8.14) This is the condition for numerical stability. Unlike mathematical stability, it does not only depend on λ, but on the step size t as well. More precisely, numerical stability depends on the product tλ, and does not automatically follow from

105 8.2 Numerical stability 45 Explict Euler Implict Euler Fig. 8.5 Stability regions of explicit and implicit Euler methods. Left panel shows the explicit Euler stability region S EE in the complex plane. The method is stable for tλ inside the green disk. Right panel shows the stability region S IE of the implicit Euler method. The method is stable in C, except inside the red disk, which corresponds to the region where the method is unstable. Thus the implicit Euler method is stable in the entire left half-plane, but also in most of the right half-plane, where the differential equation is unstable mathematical stability. Instead, numerical stability must be examined method by method, establishing the unique step size limitations associated with each method. In (8.14), λ is a complex number. We put z = tλ C, and rewrite (8.14) as 1 + z 1. This is the interior of a circle in C, with center at z = 1 and radius 1. It is referred to as the stability region of the explicit Euler method, formally defined as the disk S EE = {z C : 1 + z 1}. The discrete problem is numerically stable for tλ S EE. A similar analysis for the implicit Euler method yields x n+1 = x n + tλ x n+1 x n+1 = (1 tλ) 1 x n. Hence the numerical solution remains bounded if 1 z 1 1, and the stability region of the implicit Euler method is defined by

106 46 8 The Implicit Euler Method S IE = {z C : 1 z 1}. Numerical stability requires that tλ S IE. The shape of the stability region of the implicit Euler method is also a circle, but now with center at z = 1 and radius 1. The important difference is that while S EE is the interior of a circle, S IE is the exterior of a circle. In particular, we note that C S IE. Thus, if tλ C, the implicit Euler method is stable. The implicit Euler method has a large stability region, covering the entire negative half plane, while the explicit method has a small stability region, putting strong restrictions on the step size t. Thus the explicit Euler method can only use short steps. By contrast, the implicit Euler method is stable whenever tλ C. Since t > 0 is real, it follows that there is no restriction on the step size if λ C. For this reason, the method is sometimes referred to as unconditionally stable. The stability regions of the explicit and implicit Euler methods are found in Figure 8.2. We can now analyze the results we obtained when testing the two methods. Example The previous test problem, used to assess the properties of the explicit and implicit Euler methods, was ẏ = λ(y g) + ġ. This has a particular solution y(t) = g(t) and exponential homogeneous solutions. Putting x = y g, the test problem is transformed into the linear test equation ẋ = λx. Thus, stability is only a function of tλ, and only depends on the homogeneous solutions and the method s ability to handle exponential solutions. In the previous tests, we used two step sizes, t = 0.2 and t = 0.1. We also used three different values of λ, viz., λ = 0.2, λ = 2, and λ = 20. Since in all cases λ C, the implicit Euler method is stable, no matter how the step size is chosen. This explains why no stability issues were observed. For the explicit Euler method, we have the following table of parameter combinations: Parameters λ = 0.2 λ = 2 λ = 20 t = 0.1 tλ = 0.02 tλ = 0.2 tλ = 2 t = 0.2 tλ = 0.04 tλ = 0.4 tλ = 4 From this table we see that only one parameter combination, tλ = 4, is such that tλ / S EE, causing numerical instability. Another combination, tλ = 2, is marginally stable. This is in full agreement with the tests, and for tλ = 4 the numerical solution became oscillatory and diverged from the exact solution, see Figure 8.1. This is the typical behavior when numerical instability is encountered.

107 8.3 Stiff problems t Fig. 8.6 Vector field and flow of a stiff problem. The test problem (8.11) is solved using the implicit Euler method with N = 20 steps. The exact solution y(t) = sinπt (blue) and the discrete solution (red markers) are plotted vs t [0, 1]. Neighboring solutions (green) to the differential equation, for other initial conditions, illustrate the vector field away from the particular solution. At λ = 30, these solutions quickly converge on the particular solution. This is typical of stiff problems 8.3 Stiff problems Stiff initial value problems require time stepping methods with stability regions covering (near) all of C. The simplest example of such a method is the implicit Euler method. The simplest illustration of a stiff problem is the Prothero Robinson test problem (8.11), i.e., ẏ = λ(y g) + ġ. whose particular solution is y(t) = g(t), and where the homogeneous solutions are exponentials e tλ. Stiffness is a question of how the numerical method interacts with the problem. If λ 1, the homogeneous solutions decay very fast, after which the solution is near the particular solution, y(t) g(t) no matter what the initial condition was. This is demonstrated in Figure 8.3 for λ = 30. This is not a very stiff problem, but the parameter values are chosen to make the effect readily visible to the naked eye. By taking λ 30 the neighboring flow becomes nearly vertical. Example (Stiffness) In the Prothero Robinson test problem, assume that λ = 1000, and that g(t) = sint. Analyzing the stability of a time stepping method for this problem is equivalent to considering the linear test equation (8.12) with the given λ.

108 48 8 The Implicit Euler Method Let us consider the numerical stability of the explicit Euler method. Since we then require 1 + t λ 1, or, given that λ < 0, where t S is the maximum stable step size. t t S = 2 λ = , Meanwhile, approximating the particular solution y(t) sint, using the explicit Euler method, produces a local residual (equal to its local error) r n t 2 ẏ 2 t2 2. Let us assume that we need an accuracy specified by l n TOL = 10 4, where TOL is a prescribed local error tolerance. Then, obviously, we can accept a step size t TOL = 2TOL = However, it will not be possible to use such a step size, because the method would go unstable. There is a conflict between the stability requirement and the accuracy requirement, since t S < t TOL. This is the problem of stiffness; being restricted by having to maintain stability, an explicit method cannot reach its potential accuracy. The problem is overcome by using an appropriate implicit method. For example, as the implicit Euler method is unconditionally stable, it has no stability restriction on t. Its local residual is the same as that of the explicit method, so it will be possible to use t = In fact, one can use an even larger step size, as the local error is smaller. Thus l n+1 = r n 1 t λ t t t TOL, which requires t TOL = 0.2. This step size will produce the requested accuracy, and, since it is 100 times larger than the maximum stable step size t S available to the explicit Euler method, the implicit Euler method is likely going to be far more efficient, even though it is more expensive per step due to the necessary equation solving. In real applications, one often encounters stiff problems where the ratio t S / t TOL 1 for any explicit method. This ratio can be arbitrarily large, making it impossible to solve such stiff problems unless dedicated methods are used. On the other hand, when an appropriate implicit method is used, these problems can often be solved very quickly. To develop a comprehensive theory os stiffness, we may consider a system of nonlinear equations having a structure similar to the Prothero Robinson example. Thus we will consider ẏ = f (y g) + ġ, (8.15) where f : R d R d is a nonlinear map having f (0) = 0. Then g can be viewed as a particular solution, as y(t) = g(t) satisfies (8.15). But since the initial condition y(0) may not be chosen to equal g(0), there is also a nonlinear transient, corresponding to the homogeneous solution. Putting u = y g, (8.15) is turned into the simpler nonlinear problem u = f (u), u(0) = u 0. (8.16)

109 8.3 Stiff problems 49 The central issue is now whether an explicit method would suffer severe step size restrictions when solving this problem. This depends on the mathematical stability properties of the system, and in particular on whether there are strongly damped solutions near the zero solution u = 0, corresponding to the linear Prothero Robinson problem with λ 1. If the initial value u(0) is small enough, the system (8.16) can be linearized around the equilibrium solution u = 0, to obtain u f (0)u, where f (0) is the Jacobian matrix of f evaluated at u = 0, as long as u 2 1. Since the matrix is constant, the linearized problem is a simple linear system, u = Au, where A R d d. Unlike having a single, scalar λ as before, we now have to deal with a matrix (and its spectrum), asking whether it will lead to stability restrictions on the step size t. This is done using the techniques of Chapter 6. Investigating its mathematical stability, we take the inner product with u to find the differential inequalities m 2 [A] u 2 d dt u 2 M 2 [A] u 2. Here m 2 [A] and M 2 [A] are the lower and upper logarithmic norms, respectively, and the differential inequalities imply that e tm 2[A] e ta 2 e tm 2[A]. Thus the matrix exponential is bounded above and below. The lower bound gives the maximum decay rate of homogeneous solutions, while the upper bound gives the maximum growth rate. In the Prothero Robinson example above, we saw that the stability restriction in an explicit method is caused by a fast decay rate. This happened when λ 1. In a linear system, we have such a fast decay rate when m 2 [A] 1. This is characteristic of stiff problems: m 2 [A] 1 is a necessary for stiffness. To give an alternative interpretation, if one would integrate the problem in reverse time, then we would solve u = A, whose maximum growth rate is M 2 [ A] = m 2 [A] 1. Thus the reverse time problem has very unstable solutions. Since mathematical and numerical stability are not equivalent, we again consider the explicit Euler method. We then see that a sufficient condition for numerical stability is the circle condition I + t A 2 1.

110 50 8 The Implicit Euler Method By the triangle inequality, 1 + t A 2 I + t A 2 1, the circle condition implies t A 2 2. However, t A 2 m 2 [ t A]. Therefore, if m 2 [ t A] 1 the circle condition cannot possibly be satisfied, and instability is bound to happen. It follows that m 2 [ t A] 1 is a necessary condition for stiffness. However, a system of equations is more complicated than a scalar equation. The investigation of the scalar problem excluded growing solutions. In a system, it is possible that we have growing as well as decaying solutions. We therefore introduce the average decay rate, s[a] = m 2[A] + M 2 [A]. 2 The reason why the condition m 2 [A] 1 alone might not cause stiffness is that if M 2 [A] 1 at the same time, the system has both rapidly decaying and rapidly growing solutions. While decaying solutions put a stability restriction on the step size, the growing solutions put an accuracy restriction, in order to resolve the growing solutions. Definition 8.1. Let A R d d. The stiffness indicator of A is defined by s[a] = m 2[A] + M 2 [A]. (8.17) 2 Here we note that if A = λ R, then s[λ] = λ. The stiffness indicator is compatible with the previous discussion of scalar problems, and we can proceed to generalize the concept to systems. For scalar systems, λ put a restriction on the step size. More generally, we now need to relate s[a] to a time scale τ, which is not necessarily the same as the step size t. Definition 8.2. Assume that s[a] < 0. Then the local reference time scale τ is defined by τ = 1 s[a]. (8.18) The reference time scale approximates the largest step size by which an explicit method can proceed, without losing numerical stability. Any desired time interval, be it the length of the integration interval or the preferred step size, can be related to the reference time scale. Definition 8.3. Let τ be the local reference time scale. For a given step size t the stiffness factor is defined by r( t) = t/τ. Irrespective of whether an explicit or implicit method is used, stiffness is determined by whether a step size t > τ is desired or not. This depends on many factors, not least the accuracy requirement and error tolerance TOL. As a simple observation, if the problem is to be solved on [0,T ], stiffness cannot occur if r(t ) < 1, since

111 8.3 Stiff problems 51 the step sizes will obviously be shorter than τ in such a case. However, it may very well occur that r(t ) 1, in which case the problem may turn out to be stiff. We are now in a position to discuss stiffness in more general problems. Following the techniques outlined in Section 6.5, we consider two neighboring solutions to a nonlinear problem, u = f (u), u(0) = u 0 v = f (v), v(0) = u 0 + u 0, where we do no longer require that f (0) = 0. Thus we can consider stiffness in terms of perturbations along a non-constant solution. The difference u = v u satisfies d u = f (v) f (u). dt As before, we will only consider small ( infinitesimal ) perturbations u, allowing us to linearize the perturbed problem around the non-constant solution u. Taking the inner product with u, we obtain the differential inequalities m 2 [J(u)] u 2 d dt u 2 M 2 [J(u)] u 2, where J(u) = f (u) is the Jacobian matrix of f, evaluated along the nominal solution u(t). Thus the matrix is no longer constant but varies along the solution trajectory. Nevertheless, the same theory applies, and if s[j(u)] < 0, we obtain a reference time scale τ(u). Thus the stiffness factor too can vary along the solution. Stiffness occurs whenever we need to use a step size t such that r( t) = t τ(u) 1. By evaluating s[j(u)] along a trajectory stiffness can be assessed locally. Remarks on stiffness For any nonlinear system u = f (u), with f C 1, stiffness is defined locally at any point u of the vector field, in terms of s[ f (u)]. The stiffness indicator determines a local time scale τ(u). In case f = A is a linear constant coefficient system, s[a] and τ are constant on [0,T ]. 1. Since s[j(u)] L[ f ], a necessary condition for stiffness is that T L[ f ] 1. However, the latter be used as a characterization of stiffness. As L[ f ] = L[ f ], this quantity is independent of whether the problem is integrated in forward time or reverse time. One of the most typical characteristics of stiffness is that the problem has strong damping in forward time, and is strongly unstable in reverse time. This is reflected by s[j(u)]. 2. Depending on the requested error tolerance TOL, as well as on the choice of method, the step size t may have to be chosen shorter than τ; in such a case stiffness will not occur either. Likewise, should s[j(u)] become positive during some subinterval, stiffness is no longer an issue. Unless T s[j(u)] 1 stiffness will not occur; this means that any time stepping method can be used, without loss of efficiency. 3. It is not uncommon to encounter problem where T s[j(u)] 10 6 or much greater. In some problems of practical interest T s[j(u)] or more; in such cases, an explicit

112 52 8 The Implicit Euler Method method will never finish the integration, as trillions of steps will be necessary. By contrast, there are stiff problems where a well designed implicit method solves the problem in N steps, where N is practically independent of T s[j(u)], or of T L[ f ]. 4. Given the choice between an explicit and an implicit method, using the same error tolerance, the explicit method may be restricted to using step sizes t τ, while an unconditionally stable implicit method might be able to employ a step size t τ. 8.4 Simple methods of order 2 The explicit and implicit Euler methods are the simplest time stepping methods, but they illustrate the essential aspects of such methods. They are simple to analyze and can be understood both intuitively and theoretically. However, the methods are only first order convergent and therefore of little practical interest. In real computations we need higher order methods. Before we proceed to advanced methods, we shall consider a few simple methods of convergence order p = 2. The construction of the Euler methods started from approximating the derivative by a finite difference quotient. If y C 1 [t n,t n+1 ], then, by the mean value theorem, there is a θ [0,1] such that y(t n+1 ) y(t n ) t n+1 t n = ẏ((1 θ)t n + θt n+1 ) = ẏ(t n + θ(t n+1 t n )). But this is only an existence theorem, not telling us the value of θ. In the explicit Euler method, we used θ = 0 to create a first order approximation. Likewise, in the implicit Euler we used θ = 1. However, there is a better choice. Thus, assuming that y is sufficiently differentiable, expanding in Taylor series around t n yields, for the left hand side, y(t n+1 ) y(t n ) t n+1 t n and for the right hand side, = ẏ(t n ) + t n+1 t n ÿ(t n ) + O((t n+1 t n ) 2 ); 2 ẏ(t n + θ(t n+1 t n )) = ẏ(t n ) + θ(t n+1 t n )ÿ(t n ) + O((t n+1 t n ) 2 ). Matching terms, we see that by taking θ = 1/2, we have a second order approximation. This is the best that can be achieved in general, and corresponds to the approximation y(t n+1 ) y(t n ) t n+1 t n ( tn +t n+1 ẏ 2 We can transform this into a computational method for first order IVP s in two ways. For obvious reasons, both are referred to as the midpoint method; one is explicit and the other implicit. ).

113 8.4 Simple methods of order 2 53 Beginning with the Implicit Midpoint method, we define the scheme ( tn +t n+1 y n+1 y n = t f, y ) n + y n+1. (8.19) 2 2 The method is implicit since y n+1 appears both in the left and right hand sides. The explicit construction is just a matter of re-indexation. We use three consecutive equidistant points t n 1,t n and t n+1, all separated by a step size t. Then y(t n+1 ) y(t n 1 ) 2 t = ẏ(t n ) + O( t 2 ). This leads to the Explicit Midpoint method, defined by y n+1 y n 1 = 2 t f (t n,y n ). (8.20) This is a two-step method, since it needs both y n 1 and y n to compute y n+1. On the other hand, the method is explicit and needs no equation solving. There is a third, less obvious, construction. Consider the approximation ( ) tn +t n+1 ẏ ẏ(t n) + ẏ(t n+1 ). 2 2 This means that we approximate the derivative at the midpoint by the average of the derivatives at the endpoints. Expanding ẏ(t n ) and ẏ(t n+1 ) around the midpoint, we obtain ẏ(t n ) = ẏ( t) t 2 ÿ( t) + O( t2 ) ẏ(t n+1 ) = ẏ( t) + t 2 ÿ( t) + O( t2 ), where t = (t n +t n+1 )/2 represents the midpoint. It follows that ẏ(t n ) + ẏ(t n+1 ) 2 = ẏ( t) + O( t 2 ) is a second order approximation. This leads to the Trapezoidal Rule, ( ) f (tn,y n ) + f (t n+1,y n+1 ) y n+1 y n = t. (8.21) 2 It is an implicit method, and it is obviously related to the implicit midpoint method. Thus, if we consider a linear constant coefficient problem ẏ = Ay, both methods produce the discretization ( ) Ayn + Ay n+1 y n+1 y n = t. 2

114 54 8 The Implicit Euler Method Solving for y n+1, we obtain ( y n+1 = I t A ) 1 ( I + t A ) y n. 2 2 However, the two methods are no longer identical for nonlinear systems, or for linear systems with time dependent coefficients. Among advanced methods, there are two dominating classes, Runge Kutta (RK) methods, and linear multistep (LM) methods. While the latter may use an arbitrary number of steps to advance the solution, RK methods are one-step methods. We shall study both method classes below. Of the three second order methods above, the explicit midpoint method is in the LM class, but not in RK. The implicit midpoint method is in the RK class, but not in LM. Finally, the trapezoidal rule, as well as the explicit and implicit Euler methods studied before, can be seen as (some of the simplest) members of both the LM and the RK class. Let us now turn to comparing these methods. For simplicity, we take a the scalar Prothero Robinson test problem (8.11), which means that we can solve equations in the implicit methods exactly. In Figure 8.4 the trapezoidal rule is compared to the explicit Euler method. The test demonstrates the need for higher order methods. Thus, going from a first order to a second order method has a strong impact on accuracy. Although the step size is the same for both methods, the second order method can achieve several orders of magnitude higher accuracy. The same relative effect takes place each time we select a higher method order. Since it is possible to construct methods of very high convergence orders, it is in practice possible to solve many problems to full numerical accuracy at a reasonable computational cost. Modern codes typically implement methods of up to order p = 6, although there are standard solvers of even higher orders. The test problem in Figure 8.4 is not stiff, but slightly unstable. It is not a particularly difficult problem. It solved using a rather coarse step size, again to emphasize the differences in accuracy. In real computations the step size would be smaller, making the difference even more pronpounced. To see the effect as a function of the step size t, we compare the explict and implicit Euler methods to the trapezoidal rule in Figure 8.4. Here, the setting is moderately stiff, at λ = 50, so as to also demonstrate when the explicit method goes unstable. More importantly, we see that for smaller (but still relevant) step sizes, the second order method achieves several orders of magnitude better accuracy. This is the central idea in discretization methods: in a convergent method, accuracy increases as t 0, but the smaller the step size, the more computational effort is needed. So how do we obtain high accuracy without a too large computational cost? The answer is by using high order methods. Then the error can be made extremely small, even without taking t exceedingly small. The only concern is that we have to make sure that the method remains stable, since stability is required in order to have convergence. Note that this is not a matter

115 8.4 Simple methods of order 2 55 Explicit Euler, N= t Trapezoidal Rule, N= t Fig. 8.7 The effect of second order convergence. The test problem (8.11) is solved using the implicit Euler method (top) and with the trapezoidal rule (bottom), both using N = 50 steps. The exact solution y(t) = sin πt (blue) and the discrete solution (red markers) are plotted vs t [0, 4], covering two full periods. The choice λ = 0.1 makes the problem slightly unstable, posing a greater challenge to the methods. The first order explicit Euler has a readily visible and growing error. By contrast, the second order trapezoidal rule offers much higher accuracy of the differential equation being unstable; such problems can be solved too, as demonstrated in Figure 8.4. Instead, it is a matter of whether the method as such is a stable discretization. To illustrate this point, we return to the test problem (8.11), and solve it using the explicit midpoint method. The results are found in Figure 8.4. Comparing the explicit midpoint method to teh trapezoidal rule, both of second order, we find that there are substantial differences. Here we have returned to a nonstiff problem, with λ = 1, but even so, the explicit method soon develops unacceptable oscillations. These are due to instability, altough the issue is less serious than before. Even so, the trapezoidal rule is far better, as the error plots demonstrate. This test suggests that stability, convergence and accuracy are delicate matters that require a deep understanding. Later, we shall return to these methods in connection with Hamiltonian problems, and find that the explicit midpoint method has a unique niche, in Hamiltonian problems (e.g. in celestial mechanics) and in hyperbolic conservation laws. This is due to the mathematical properties of such problems.

116 56 8 The Implicit Euler Method 10-3 EE 10-3 IE 10-3 TR error dt dt Fig. 8.8 Global error vs. step size. The test problem (8.11) is solved using explicit Euler (left), implicit Euler (center) and the trapezoidal rule (right), over [0, 1] for λ = 50. Red graphs show global error magnitude at t = 1 as a function of t. The Euler methods are of order p = 1 as indicated by dashed reference line of slope 1. The trapezoidal rule is of order p = 2, indicated by reference line of slope 2. The error of the trapezoidal rule is 1000 times smaller at t = 10 4, showing the impact of higher order methods. The explicit Euler error graph goes through the roof at t due to numerical instability, when t λ is outside the method s stability region dt To analyze the instability of the explicit midpoint method, we apply it to the linear test equation ẏ = λy. We then obtain the recursion y n+1 y n 1 = 2 t λy n. Putting q = t λ, this is a linear difference equation y n+1 2qy n y n 1 = 0. This has stable solutions provided that the roots of the characteristic equation are inside the unit circle. The characteristic equation is κ 2 2qκ 1 = 0. Since this can be factorized into (κ κ 1 )(κ κ 2 ) = 0, where κ 1,κ 2 are the roots of the characteristic equation, we see that κ 1 κ 2 = 1. Thus, if one root is less that one in magnitude, the other is greater than 1. Therefore we can write

117 8.4 Simple methods of order 2 57 Explicit midpoint method, N= Trapezoidal rule, N= Error magnitudes t Fig. 8.9 Comparison of second order methods. The test problem (8.11) is solved using the explicit midpoint method (top) and the trapezoidal rule (center), over [0, 6] for λ = 1. Both methods have order p = 2 and use N = 75 steps. The explicit method method develops growing oscillations over time, indicative of instability. No such issues are observed in the implicit method. Graphs of how the error evolves over time (bottom) show that while the trapezoidal rule (green) maintains an error never exceeding , the explicit midpoint method has an exponentially growing error (red), as indicated by the straight trendline in the lin-log diagram κ1 = eiϕ ; κ2 = e iϕ. Now, since we must also have κ1 + κ2 = 2q, we obviously have 2q = eiϕ e iϕ = 2i sin ϕ. Consequently, q = i sin ϕ, and it follows that t λ = i sin ϕ. Since t > 0, the surprising result is that the method is only stable when λ is on the imaginary axis. Writing λ = iω, we must therefore have t ω ( 1, 1). Note that we cannot allow t ω = 1 since we would then have a double root, leading to unbounded solutions.

118 58 8 The Implicit Euler Method 2 Trapezoidal rule 2 Explicit midpoint method Fig Stability regions. The stability region of the trapezoidal rule is the entire negative halfplane C, as indicated by the green area in the left panel. The stability region of the explicit midpoint method, however, is very small (right panel) as it only consists of the open interval i( 1,1). Thus the endpoints ±i are not included The conclusion is that the method is only stable for t λ on the open interval i ( 1,1) in the complex plane. This is simply a short strip of the imaginary axis. Since we used λ = 1 in our test, we chose λ in the negative half plane. The method is therefore obviously unstable, no matter how we choose t. For this reason, the method is unsuitable for the test problem. By contrast, the trapezoidal rule is stable, which explains why its performance is superior. No matter how λ is chosen, the trapezoidal rule will not fail. To see this, we again consider the linear test equation, ẏ = λy. We obtain the recursion resulting in y n+1 y n = t λy n 2 (y n+1 y n ), y n+1 = 1 + z/2 1 z/2 y n = R(z)y n, where z = t λ. Thus we need R(z) 1 for stability. This implies that 1 + z/2 1 z/2.

119 8.4 Simple methods of order 2 59 This requires that the distance from an arbitrary point z/2 C to +1 is greater than its distance to 1 in the complex plane. This obviously implies that z C. The method s stability region is therefore S TR C. The stability regions of the explicit midpoint method and the trapezoidal rule are shown in Figure 8.4. This explains why two second order methods can produce such different results. In fact, the stability properties of the explicit midpoint method need to be qualified. We have already determined its stability region. Let us again assume that we solve ẏ = λy with λ R to obtain the linear difference equation y n+1 2qy n y n 1 = 0, where q = t λ, and characteristic equation κ 2 2κ 1 = 0. As noted above, the product of the roots is 1, and the sum is 2q. If λ is real, then q is real, so for stability the only possibility is that κ 1 = 1 and κ 2 = 1. The sum is zero, implying λ = 0. Thus, if λ 0 is real, the method is necessarily unstable, with roots 1 κ t λ κ t λ It follows that the solution has the behavior 1 + t λ. y n C 1 (1 + t λ) n +C 2 ( 1) n (1 t λ) n C 1 e t nλ +C 2 ( 1) n e t nλ, where t n = n t. Thus, there is a discrete oscillatory behavior, indicated by the factor ( 1) n. In case λ R, the amplitude of this oscillation grows exponentially, as indicated by the factor e t nλ. Although this is confirmed in Figure 8.4, in full agreement with theory, this undesirable behavior only evolves over time, since the coefficient C 2 is very small. The method does remain convergent, but it is not stable in the same sense as the other methods considered so far. The explicit midpoint method is weakly stable, and is only of practical value for problems where λ = iω is located on the imaginary axis. The qualification of stability that is needed is the following. A method is called zero stable if it produces stable solutions when applied to the (trivial) problem ẏ = 0. The method is called strongly zero stable if the characteristic equation associated with this problem has a single root of unit modulus, κ = 1. In case there are other roots of unit modulus, but still no roots of multiplicity greater than one, the method is weakly zero stable. The explicit and implicit Euler methods, as well as the trapezoidal role, only have the single root κ = 1, and are strongly zero stable. (All one-step methods are strongly zero stable, since they only have a single root.) The explicit midpoint method, however, is a two-step method having two unimodular roots, κ = 1 and κ = 1. Therefore it is weakly zero stable. It is the root κ = 1 that limits method performance.

120 60 8 The Implicit Euler Method 8.5 Conclusions In a stable method, the global error has a more or less generic structure of the form e C t p y (p) et M[ f ] 1. (8.22) M[ f ] For convergence, stability is necessary. The necessary stability condition is rather modest a method is only be required to solve the linear test equation ẏ = λy in a stable manner for λ = 0, and for this reason, the condition is referred to as zero stability. It requires that the point 0 C is inside the method s stability region. While this is the case also for the explicit midpoint method studied above, further qualifications are needed, and in most cases we require strong zero stability, which excludes the explicit midpoint method. Apart from stability, which is crucial, the global error bound depends on A constant C, characteristic of the method, known as the error constant The step size t The method order p The regularity of the solution y, as indicated by y (p) The range of integration [0,T ], or the interval on which the problem is solved The logarithmic norm M[ f ], determining the damping rate of perturbations. Thus a strongly stable (convergent) method will produce any desired accuracy by choosing t small enough. Choosing a method with a higher order of convergence p may often be preferable. The regularity of the solution, the damping rate, and the range of integration are parameters given by the problem, and little can be done about them. Nevertheless, it is of importance to understand that the final computational accuracy also depends on those parameters. Returning to stability, there are three different notions involved: Stability of the problem, essentially governed by M[ f ] Stability of the discretization for a fixed t > 0 Stability of the method as t 0 for a given T. Although this may at first seem like an overuse of the same term, they all refer to a continuous data dependence, as some variable goes to infinity. In a stable problem, also referred to as mathematical stability, we are interested in whether a small perturbation of the differential equation only results in a small perturbation of the solution y(t) as t. The usual setting is that we consider a single solution, often an equilibrium, and whether small perturbations of the initial condition will increase or stay bounded. If it stays bounded, the solution is stable. One sufficient condition for mathematical stability is M[ f ] < 0, and in (8.22), we see that the stability of the problem affects the computational accuracy, in particular how fast the global error is accumulated. In a stable discretization, also referred to as numerical stability, we are interested in whether the discrete system has a similar property. The standard setting is

121 8.5 Conclusions 61 that we take the given problem, fix a finite time step t, and let the recursion index n. The question is whether the numerical solution y n remains bounded under perturbations, again usually in the initial value. In the best of worlds, numerical stability would follow from mathematical stability. However, this requires more. Thus the method must be stable in order to have convergence. The setting is different; here we fix an interval [0,T ], and consider solving the problem on that interval using N steps t = T /N. The question is whether the numerical solution remains bounded as per (8.22) when N. In effect, we ask that the accumulated error at time T remains bounded, independent of how many steps N we use to reach T. This is key to convergence, since the bound (8.22) contains the factor t p, allowing us to make the accumulated error arbitrarily small by choosing N large enough. In all three cases above, we ask that perturbations remain bounded as some parameter tends to infinity. This is stability, and it keeps recurring in various guises throughout all of numerical analysis, explaining the importance of stability theory. This concludes our analysis of elementary methods for first order initial value problems. Advanced methods work with similar notions, but require a different approach to the construction of the methods. The two main contenders are Runge Kutta and linear multistep methods.

122

123 Chapter 9 Runge Kutta methods We have seen that a higher order methods offer significantly higher accuracy at a moderate extra cost. Here we shall explore a systematic approach to the construction of high order one-step methods. In Runge Kutta methods the key idea is to sample the vector field f at several points in a single step, combining the results to obtain a high order end result in a single step. This entails using the samples to match as many terms in a Taylor series expansion of the solution as possible. 9.1 An elementary explicit RK method Let us consider one of the simplest explicit Runge Kutta methods, achieving second order convergence by sampling the vector field at two points per step. For clarity, we shall make the simplifying assumption that the differential equation is autonomous, i.e., the vector field does not depend on time t, but has the form ẏ = f (y); y(0) = y 0. The following computational procedure is a simple RK method. We first compute the stage derivatives, Y 1 = f (y n ) Y 2 = f (y n + t Y 1). These are samples of the vector field f at two points Y 1 and Y 2 near the solution trajectory. They are not derivatives in the true sense, since they are not functions of time, but only evaluations of f. For this reason we use a prime to denote the stage derivatives, while a dot represents the derivative ẏ of the solution y, which is a differentiable function. The points Y 1 and Y 2 are referred to as the stage values, and are defined by 63

124 64 9 Runge Kutta methods Y 1 = y n Y 2 = y n + t Y 1. Thus it holds that Y i = f (Y i ). We finally update the the solution according to y n+1 = y n + t 2 (Y 1 +Y 2). (9.1) This is an explicit method, since there is no need to solve nonlinear equations in the process. We shall see that it is a second order convergent method, which can be viewed as an explicit method that emulates the trapezoidal rule. We shall compare this method to the second order convergent trapezoidal rule, starting from the same point y n. It computes an update ȳ n+1, defined by ȳ n+1 = y n + t 2 ( f (y n) + f (ȳ n+1 )). This method is implicit, which makes it relatively expensive. However, the explicit second order RK method (9.1) above has been constructed along similar lines, by replacing the vector field sample f (ȳ n+1 ) by Y 2. To justify this operation, note that Y 2 = y n + t f (y n ). Thus the stage value Y 2 is simply an explicit Euler step starting from y n, and it follows that Y 2 ȳ n+1 = O( t 2 ), corresponding to the local error of the explicit Euler method in a single step. It follows that Y 2 = f (Y 2 ) = f (ȳ n+1 + O( t 2 )) = f (ȳ n+1 ) + O( t 2 ), provided that f is Lipschitz (the standard assumption). Therefore (9.1) computes y n+1 = y n + t 2 (Y 1 +Y 2) = ȳ n+1 + O( t 3 ). This implies that the RK method (9.1) is a second order explicit workaround producing nearly the same result as the trapezoidal rule. It is cheaper, but the benefit come at a price. Thus the explicit RK method does not have the excellent stability properties of the implicit trapezoidal rule. Runge Kutta methods are divided into explicit (ERK) and implicit (IRK) methods. The classical ERK methods date back to 1895, when Carl Runge and Wilhelm Kutta developed some of the ERK methods that are still in use today. Runge and Kutta were involved in mathematics and its applications in physics and fluid mechanics, and realized that the construction of accurate and stable computational methods for initial value problems required serious mathematical thought. The modern theory of RK methods, however, was largely initiated and developed by John C. Butcher of the University of Auckland, New Zealand, around Because of the complexity of RK theory, this area is still a lively field of research.

125 9.2 Explicit Runge Kutta methods Explicit Runge Kutta methods So far we have based the construction of our methods on replacing the derivative by a finite difference. By contrast, the idea behind Runge Kutta methods are closely related to interpolation theory. Integrating the differential equation ẏ = f (t, y) over the interval [t n,t n+1 ], we obtain tn+1 y(t n+1 ) y(t n ) = f (τ,y(τ))dτ. (9.2) t n This transforms the differential equation into an integral equation. Here we have returned to the nonautonomous formulation ẏ = f (t,y), although we will shortly go back to the autonomous formulation. As the integral cannot be evaluated exactly, it needs to be approximated numerically. The standard approach to numerical integration is to sample the integrand f (τ,y(τ)) at a number of points, and approximate the integral by a weighted sum, tn+1 t n f (τ,y(τ))dτ t s i=1 b i Y i, where Y i = f (τ i,y i ). The accuracy of this approximation depends on how we construct the stage values Y i and the corresponding stage derivatives, Y i = f (τ i,y i ). This means that (9.2) generates a numerical integration formula y n+1 y n = t s i=1 b i Y i, and the difficulty lies in the construction of the stage values Y i, which generate the stage derivatives, Y i = f (τ i,y i ). There are many ways to choose the stage values, corresponding to the many different ways in which integrals can be approximated by discrete sums. However, because the stage values and derivatives have to be generated sequentially in an explicit computation, we must have Y 1 = f (t n,y n ). Subsequent stage values are obtained by advancing a local solution based on linear combinations of previously computed stage derivatives. Thus Y 2 = f (t n + c 2 t,y n + a 2,1 ty 1) Y 3 = f (t n + c 3 t,y n + a 3,1 ty 1 + a 3,2 ty 2)... This means that an explicit Runge Kutta method for the problem ẏ = f (t,y) is given by the computational process

126 66 9 Runge Kutta methods i 1 i = f (t n + c i t,y n + a i, j t Y j), (9.3) Y together with the updating formula y n+1 = y n + t j=1 s i=1 b i Y i. (9.4) The method is determined by three sets of parameters, the nodes c i, forming a vector c, the weights b i forming a vector b, and the matrix A with coefficients a i, j. These are usually arranged in the Butcher tableau, c 2 a 2, c s a s,1 a s,2 0 b 1 b 2 b s For explicit RK methods the coefficient matrix A is strictly lower triangular. In implicit RK methods, this is no longer the case. In the sequel, we shall use the simplifying assumption c i = s j=1 or c A b T a i, j. (9.5) This means that the nodes are determined by the row sums of the coefficient matrix A. We then only need to consider the autonomous initial value problem ẏ = f (y) to derive order and stability conditions. While it is possible to construct RK methods that do not satisfy the simplifying assumption, such methods are never used in practice, since important invariance properties are lost. Thus all state-of-the-art Runge Kutta methods, including the original methods of 1895, satisfy the simplifying assumption. With the simplifying assumption, we can describe the RK process as 1. Compute the i th stage value Y i = y n + t i 1 j=1 a i, j Y j 2. Sample vector field to compute the i th stage derivative Y i = f (Y i ) 3. After computing all stage derivatives, update y n+1 = y n + t s i=1 b iy i In this process, we note that stage derivatives are always multiplied by the time step t. Thus the process should be viewed as computing stage values Y i and scaled stage derivatives, ty i. The latter quantity is then computed from the vector field by ty i = t f (Y i ).

127 9.3 Taylor series expansions and elementary differentials Taylor series expansions and elementary differentials In the derivation of RK methods, we need to match terms in the Taylor series expansions of the method s updating formula and the expansion of the exact solution. Because RK methods by construction employ nested evaluations of the vector field f when the stage derivatives Y i are computed, the Taylor series expansions are more complicated than otherwise. The standard approach is to express the Taylor series in terms of the function f and its derivatives, rather than in terms of the solution y and its derivatives. Below, we shall use a short-hand notation for function values and their derivatives. Since all expansions are around t n in time and y n in space, we let y,ẏ,ÿ,... denote the values y(t n ),ẏ(t n ),ÿ(t n ), etc. Likewise, we let f denote f (y), while f y = f / y denotes the Jacobian matrix with elements { f i / y j }. Note that, due to the simplifying assumption we only have to consider the differential equation ẏ = f (y); without that assumption, we would have had to consider ẏ = f (t,y), requiring two partial derivatives, f y and f t. As will become clear, the simplifying assumption saves a lot of work, without losing generality. We also need higher order derivatives, and f yy denotes the 3-tensor with elements { 2 f i / y j y k }. Having three indices, it is a multilinear operator producing a vector if applied to two vectors. Thus f yy f f is a vector, which can be computed successively, from ( f yy f ) f, where the product of the 3-tensor f yy and the vector (1- tensor) f produces the matrix (2-tensor) f yy f. This, in turn, can then multiply the vector f in the usual way; thus ( f yy f ) f is simply a matrix-vector multiply. If this sounds complicated, the worst is yet to come. Now, since y(t + t) = y + tẏ + t2 2 ÿ + t3... y we will have to convert derivatives of y into derivatives of f using the differential equation ẏ = f. By the chain rule it follows that Then, again using the chain rule, ÿ = f y ẏ = f y f.... y = d dt f y f = ( f yy ẏ) f + f y f y f = ( f yy f ) f + f y f y f. Before computing higher order derivatives, we introduce some simplifications and short-hand notation. First, in an expression of the type f yy gh, the order of the two arguments g and h does not matter; thus ( f yy g)h = ( f yy h)g. Second, in an expression like f yy f f, where the 3-tensor has two identical arguments f, we will allow the (slightly abusive) short-hand notation f yy f 2, although f 2 does not represent a power (which has no meaning for a vector) but only that the argument occurs twice. Finally,

128 68 9 Runge Kutta methods in an expression of the type f y f y f the Jacobian multiplies f twice, justifying the notation fy 2 f ; this is indeed a power of f y. We then have... y = f yy f 2 + fy 2 f. From here it is all uphill. Thus, using the chain rule to each term of the third derivative, observing the rules of the simplified notation, we have... y = f yyy f 3 + f yy ( f y f ) f + ( f yy f ) f y f + ( f yy f ) f y f + f y ( f yy f 2 ) + f 3 y f. Noting that three terms are identical, omitting superfluous parentheses, we collect terms to get... y = f yyy f f yy f f y f + f y f yy f 2 + fy 3 f. The terms appearing in the total derivatives are called elementary differentials, and every total derivative is composed of several elementary differentials. Unfortunately the number of elementary differentials grows exponentially with the order of the derivative, soon making the expressions very complicated. Nevertheless, collecting the expressions obtained so far, we have ẏ = f so that the Taylor series is ÿ = f y f... y = f yy f 2 + f 2 y f... y = f yyy f f yy f f y f + f y f yy f 2 + f 3 y f, y(t + t) = y + t f + t2 2! f y f + t3 ( fyy f 2 + fy 2 f ) + 3! t 4 ( fyyy f f yy f f y f + f y f yy f 2 + fy 3 f ) ! The next step is to compare this Taylor series to that of the RK method s updating formula. To exemplify, let us consider a two-stage ERK. Its Butcher tableau is c 2 a 21 0 b 1 b 2 where c 2 = a 21. Thus a two-stage ERK has three free parameters, a 21,b 1 and b 2, which can be chosen to maximize the order of the method. The method advances the solution a single step by y n+1 = y n + t (b 1 f (y n ) + b 2 f (y n + a 21 t f (y n ))). (9.6) We now need to select a 21,b 1 and b 2 so as to match as many terms as possible to the previous Taylor series expansion. To this end, we need to expand (9.6) in a Taylor series as well. Fortunately, since by assumption f (y n ) = ẏ(t n ), there is only one term to expand. Noting that a 21 t f (y n ) is small, we have

129 9.3 Taylor series expansions and elementary differentials 69 f (y n + a 21 t f (y n )) = f + a 21 t f y f + O( t 2 ). Thus we can assemble the Taylor series from (9.6) to obtain y n+1 = y + t (b 1 f + b 2 f ) + b 2 a 21 t 2 f y f + O( t 3 ). Matching terms, we achieve second order if b 1 + b 2 = 1 b 2 c 2 = 1 2, where we have preferred to let the parameter c 2 replace a 21. Now, since we have three parameters but only two equations, the solution is not unique; there is a oneparameter family of two-stage RK methods of order p = 2. Choosing β = b 2 as the free parameter, the family can be written β 0 1 β β 1 2β where we typically choose β [ 1 2,1]. The methods at the endpoints are perhaps the best known. Thus, at β = 1/2 we obtain Heun s method, /2 1/2 corresponding to the simple ERK we introduced to emulate the trapezoidal rule by an explicit method. On the other hand, taking β = 1, we get the modified Euler method, /2 1/ This procedure looks complicated, and it is. For higher order methods we only need to allow more stage values and parameters, and expand the Taylor series to include more terms. This quickly goes out of hand, and the construction of RK methods needs special techniques. Thus the elementary differentials are usually represented by graphs ( trees ), and order conditions are derived using group theory, combinatorics and symmetry properties. That is the easy part. The hard part is that, as we saw above, the equations for determining the method coefficients are nonlinear algebraic equations, which means that it is often difficult to solve for the method parameters, and that computer software (numerical and symbolic) is usually needed in this step. In the light of this, it is quite remarkable that Runge managed to derive methods of order p = 4.

130 70 9 Runge Kutta methods 9.4 Order conditions and convergence To specify order conditions, we take the derivations one step further and consider a three-stage ERK with Butcher tableau c 2 a c 3 a 31 a 32 0 b 1 b 2 b 3 Here we proceed in the same fashion as in the two-stage case, expanding the updating formula and comparing to a third order Taylor series. We now have six parameters, and as there are four elementary differentials for total derivatives not exceeding order three, we will again have a non-unique solution, now with three degrees of freedom. Two different families (therefore not exhaustive), are and /3 2/ / β 4β β β /3 2/ β 1 4β β 3 4 β Of these two, the first is typically preferred since the b i coefficients are positive if β < 3/4. The best known method from this family is the Nyström method, /3 2/ /3 0 2/3 0 1/4 3/8 3/8 A method of order p = 3, but not coming from any one of these two families, is the classical RK3 method /2 1/ /6 2/3 1/6 The procedure can be continued to find higher order methods, but it soon gets out of hand. More importantly, the degrees of freedom are lost, since the number of of elementary differentials (order conditions) grows faster than the number of free parameters. In fact, already for order four, we have eight order conditions and ten

131 9.4 Order conditions and convergence 71 parameters; hence still some degrees of freedom. But for order five, there is already 17 order conditions (elementary differentials), but only 15 parameters to choose in a five-stage explicit RK method. Unsurprisingly, there is no explicit Runge Kutta method of order five, with five stages, but six stages are necessary. To give an impression of the increasing complexity, we consider a four-stage ERK with Butcher tableau c 2 a c 3 a 31 a c 4 a 41 a 42 a 43 0 b 1 b 2 b 3 b 4 After expanding in the relevant Taylor series, matching elementary differentials, we obtain the order conditions for order p = 1, In addition, for order p = 2, f : b i = 1. i f y f : b i c i = 1 i 2. For order p = 3, it must also hold that f yy f 2 : f 2 y f : b i c 2 i = 1 i 3 b i a i j c j = 1 i, j 6. For order p = 4, we further require f yyy f 3 : f yy f f y f : f y f yy f 2 : fy 3 f : b i c 3 i = 1 i 4 b i c i a i j c j = 1 i, j 8 b i a i j c 2 j = 1 i, j 12 b i a i j a jk c k = 1 i, j,k 24. By now, it is clear that the order conditions cannot be solved for the free coefficients without hard work. Even so, in 1895 Runge found the classical RK4 method,

132 72 9 Runge Kutta methods /2 1/ /2 0 1/ /6 1/3 1/3 1/6 corresponding to the computational scheme Y 1 = f (t n,y n ) Y 2 = f (t n + t/2,y n + ty 1/2) Y 3 = f (t n + t/2,y n + ty 2/2) Y 4 = f (t n + t,y n + ty 3) y n+1 = y n + t ( Y Y 2 + 2Y 3 +Y 4 ). This method is still in wide use today, demonstrating how powerful it is. It is linked to Simpson s rule for the computation of integrals. This is a fourth order method for computing integrals of a function of t. At considerable extra work, Runge was able to extend this idea to ordinary differential equations. There is a full theory for how to construct explicit Runge Kutta methods, and today, there are ERK methods in use of orders up to p = 8. Such methods are difficult to construct, but due to their extremely high accuracy, they are useful for high precision computations. To illustrate how difficult it is to construct such methods, we note that for an s-stage ERK method, there are s(s+1)/2 coefficients, as seen in the following table. Stages s Coefficients Table 9.1 Stages and coefficients in ERK methods. The number of free parameters in an s-stage ERK method is s(s + 1)/2 But there is an overwhelming number of order conditions to achieve order p. Comparing a given order to the number of order conditions, and the minimum number of stages s required to achieve the requested order, we have the following table. Order p Conditions Min stages ?? Table 9.2 Necessary number of stages in ERK methods. The number of order conditions in ERK methods grows prohibitively fast Thus it is currently not known how many stages are minimally needed to construct methods of orders 9 and 10. Naturally, this might not seem to be important. However, what is more surprising is that it has been possible to construct (say) 7-

133 9.5 Implicit Runge Kutta methods 73 stage ERK methods of order p = 6; such methods are subject to no less than 37 order conditions due to the large number of elementary differentials, yet they only have 28 parameters. Thus 28 parameters must satisfy 37 equations; while this seems unlikely, it is nevertheless possible. It is even more remarkable that an order p = 8 method (with 200 order conditions) can be constructed using only 11 stages and 66 parameters (Butcher, 1985). The one thing that is simple about Runge Kutta methods is that every RK method satisfying the first order condition b i = 1 (consistency) is convergent. This follows from the methods being one-step methods. All consistent onestep methods are convergent, and have global error bounds similar to those derived for the Euler methods. The possible break-down of convergence only happens in multistep methods, or in connection with time-dependent partial differential equations. The construction of explicit Runge Kutta methods is far from trivial and requires considerable expertise. Luckily, there are several high performing methods to choose from, also with built-in error estimators. We will return to how these methods are made adaptive, using automatic step size control to meet a prescribed accuracy requirement. 9.5 Implicit Runge Kutta methods Implicit RK methods have a Butcher tableau c A b T where the matrix A is no longer required to be strictly lower triangular. The order conditions are exactly the same as in the ERK case, as the parameters are again determined by matching coefficients in the Taylor series expansions of the solution y(t + t) and the updating formula for y n+1. Let us consider a general two-stage IRK method. It has a Butcher tableau This corresponds to the equations c 1 a 11 a 12 c 2 a 21 a 22 b 1 b 2 Y 1 = f (y n + t(a 11 Y 1 + a 12 Y 2)) Y 2 = f (y n + t(a 21 Y 1 + a 22 Y 2)) y n+1 = y n + t(b 1 Y 1 + b 2 Y 2),

134 74 9 Runge Kutta methods and it becomes evident that the two first stage equations form a single nonlinear equation system which must be solved in order to advance the solution. The aim of using IRK methods is to increase the stability region so as to obtain methods useful for solving stiff differential equations, for which tl[ f ] 1. However, because it is expensive to use a general IRK with a full A matrix, only a few such methods are ever used. Among them we find the well-known 3-stage order p = 5 Radau IIa method, also known as RADAU5. Its Butcher tableau is Due to its structure, the computations can be arranged in a quite efficient way, and RADAU5 is probably the most powerful IRK method available today for solving stiff problems. For IRK methods, the maximum order when using s stages is p = 2s. This is achieved by the Gauss Legendre methods. They are also useful for stiff problems. Due to symmetry their stability regions coincide with C ; thus the methods are A-stable. However, the computations associated with these methods are more complicated than for RADAU5. The latter method also has better damping when tλ ; this is often an advantage in practice. The Gauss Legendre method of order six has the Butcher tableau: The methods discussed so far have a full A matrix. However, one can achieve sufficiently improved stability properties already when the matrix A is lower triangular, with a nonzero diagonal. Such methods are referred to as DIRK methods (diagonally implicit Runge Kutta methods). A further restriction is to demand that the diagonal elements are all equal. Such methods are known as SDIRK methods (singly diagonally implicit RK), and have the Butcher tableau (in the 2-stage case) This corresponds to the equations γ γ 0 c 2 a 21 γ b 1 b 2

135 9.5 Implicit Runge Kutta methods 75 Y 1 = f (y n + γ t Y 1) Y 2 = f (y n + a 21 t Y 1 + γ t Y 2) y n+1 = y n + t(b 1 Y 1 + b 2 Y 2). The first two equations form a system, but the equations are decoupled. Thus, they can be rewritten (I γ t f )(y n + γ t Y 1) = y n (I γ t f )(y n + a 21 t Y 1 + γ t Y 2) = y n + a 21 t Y 1, and we see that they will share the same Jacobian matrix, I γ t f y. After the first equations has been solved, the right-hand side of the second equation can be computed, and the second equation solved. Thus we now have two separate, sequential systems to solve, and this requires less work than simultaneously solving two coupled equations. This substantially reduces the computational effort and complexity compared to the more advanced methods. Let us now turn to some of the most elementary IRK methods. The implicit Euler method is a one-stage method with Butcher tableau and equations Y 1 = f (y n + t Y 1) y n+1 = y n + ty 1. Here we make the important observation that y n+1 = Y 1, which means that we can rewrite the first equation above as Y 1 = f (y n+1) with updating formula y n+1 = y n + t f (y n+1 ). This is recognized as the implicit Euler method. The implicit midpoint method is also a one-stage method, with the Butcher tableau 1/2 1/2 1 together with the equations Y 1 = f (y n + t Y 1/2) y n+1 = y n + ty 1. The first equation now needs a minor transformation, y n + t Y 1/2 = y n + t 2 f (y n + t Y 1/2), implying that y n + t Y 1 /2 is the solution to the equation

136 76 9 Runge Kutta methods (I t 2 f )(y n + t Y 1/2) = y n. Hence y n + t Y 1 t /2 = (I 2 f ) 1 (y n ). Therefore the updating formula becomes y n+1 = 2(I t 2 f ) 1 (y n ) y n = 2(I t 2 f ) 1 (y n ) (I t ( 2 = 2I (I t 2 f = (I + t 2 ) (I t 2 f ) 1 (y n ) f )(I t 2 f ) 1 (y n ). Hence the implicit midpoint method can be expressed as y n+1 = (I + t 2 f )(I t 2 f ) 1 (y n ) f )(I t 2 f ) 1 (y n ). (9.7) Let us now compare to the trapezoidal rule. This is a two-stage, one-step IRK method of order p = 2, whose Butcher tableau is This corresponds to the equations Y 1 = f (y n ) /2 1/2 1/2 1/2 Y 2 = f (y n + t Y 1/2 + t Y 2/2) y n+1 = y n + t 2 Y 1 + t 2 Y 2. Here we note that Y 2 = f (y n+1), so it immediately follows that y n+1 = y n + t 2 ( f (y n) + f (y n+1 )), which is recognized as the trapezoidal rule in the form it was previously discussed. Moreover, the latter formula can be rearranged to obtain ( I t ) ( 2 f (y n+1 ) = I + t ) 2 f (y n ), resulting in the formula y n+1 = (I t 2 f ) 1 (I + t 2 f )(y n). (9.8)

137 9.6 Stability of Runge Kutta methods 77 Thus, comparing (9.8) to (9.7) we see that the difference between the trapezoidal rule and the implicit midpoint method is that the two operators, corresponding to one half-step explicit Euler method and one half-step implicit Euler method, are commuted. In the case of a linear constant coefficient system ẏ = Ay this does not matter since the factors commute, but in the nonlinear case there is a difference. This also makes a difference in terms of stability. All of the three elementary IRK methods above are SDIRK methods. In addition, the trapezoidal rule has a first explicit stage, and is therefore sometimes referred to an ESDIRK method. It has one more property of significance; the second stage Y 2 is identical to the output y n+1, which, of course, becomes the first stage on the next step. Methods with this property are called first same as last (FSAL). The FSAL property can be used to economize the computations. 9.6 Stability of Runge Kutta methods To assess stability, we use the linear test equation, ẏ = λy, with y(0) = 1. Let us investigate the stability of the classical RK4 method, with equations Since f (t,y) = λy, we obtain t Y 1 = t λy n Y 1 = f (t n,y n ) Y 2 = f (t n + t/2,y n + ty 1/2) Y 3 = f (t n + t/2,y n + ty 2/2) Y 4 = f (t n + t,y n + ty 3) y n+1 = y n + t ( Y Y 2 + 2Y 3 +Y 4 ). t Y 2 = t λ(y n + ty 1/2)) = t λ(y n + t λy n /2)) = ( t λ + ( t λ) 2 /2 ) y n t Y 3 = t λ(y n + ty 2/2)) = = ( t λ + ( t λ) 2 /2 + ( t λ) 3 /4 ) y n t Y 4 = t λ(y n + ty 3)) = = ( t λ + ( t λ) 2 + ( t λ) 3 /2 + ( t λ) 4 /4 ) y n. We now assemble these expressions in the updating formula, to get y n+1 = y n + t ( Y Y 2 + 2Y 3 +Y 4) ( t λ)2 ( t λ)3 = (1 + t λ ) ( t λ)4 y n. 24 Thus, when the classical RK4 method is applied to the linear test equation, the updating formula is a polynomial of degree four, y n+1 = P( t λ)y n, where

138 78 9 Runge Kutta methods Fig. 9.1 Stability region of the classical RK4 method. Colors indicate values z C where P(z) [0,1]. Dark blue areas reveal the location of the four zeros of the polynomial P(z) P(z) = 1 + z + z2 2 + z3 6 + z4 24 ez. This is no coincidence; since ẏ = λy implies y(t + t) = e t λ y(t), the polynomial P(z) must necessarily approximate e z accurately. The same thing will happen for any explicit RK method. Thus for every ERK method the updating formula is of the form y n+1 = P( t λ)y n, where the polynomial P is characteristic for each method. Since y n+1 P( t λ) y n, it follows that the method is numerically stable if P(z) 1. For this reason, P(z) is referred to as the stability function of the method. The stability region of the RK4 method is plotted in Figure 9.6. A similar procedure can be followed for implicit RK methods, with the only difference being that the stability function is then a rational function, R(z) = P(z)/Q(z), with P and Q polynomials such that R(z) e z as z 0. An implicit RK method is A-stable if Rez 0 R(z) 1. Since R(z) must be bounded in all of the left half plane, it follows that degq deg P. Obviously, no explicit method can be A-stable, since for every polynomial, P(z) as z.

Lecture 4: Numerical solution of ordinary differential equations

Lecture 4: Numerical solution of ordinary differential equations Department of Mathematics, ETH Zürich General explicit one-step method: Consistency; Stability; Convergence. High-order methods: Taylor