Numerical Methods Orals

Size: px

Start display at page:

Download "Numerical Methods Orals"

Naomi Manning
6 years ago
Views:

1 Numerical Methods Orals Travis Askham April 4, 2012 Contents 1 Floating Point Arithmetic, Conditioning, and Stability Previously Asked Questions Numerical Methods Class Problems Questions that Seem Reasonable The IEEE Standard (Double and Single) Conditioning, Stability, and Accuracy Definitions A Theorem An Example Solution of Linear Systems (Direct) LU Decomposition Cholesky s Method QR and SVD Factorizations and Least Squares QR Decomposition by Modified Gram-Schmidt QR Decomposition by Householder Reflections SVD Overview Least Squares Best Low Rank Approximation Eigenvalue Algorithms Why Iterative? Power Method The QR Algorithm Jacobi s Method by Givens Rotations

2 5.5 Bisection for Tridiagonal Systems Methods Described in other Parts of the Notes Iterative Methods for Linear Systems Classical Iterative Methods Steepest Descent Krylov Subspaces Arnoldi Iteration GMRES Lanczos Iteration Conjugate Gradient Preconditioning Interpolation by Polynomials Existence Divided Differences Piecewise Approximations Numerical Integration Using Equidistant Points Quadrature Rules Adaptive Methods Issues and Other Considerations Nonlinear Equations and Newton s Method The Bisection Idea Newton s Method The Secant Method Numerical ODE The Set-Up It s Almost Always this Idea A Note on Conditioning Richardson Extrapolation Modified Equation Splitting Schemes Linear Multi-Step Methods (LMM) Runge-Kutta Stiff Problems Stability Analysis (Especially for Stiff Problems)

3 11 Numerical PDE Finite Differences for Elliptic PDE Finite Differences for Equations of Evolution (Parabolic and Hyperbolic PDE) Stability, etc Finite Element Methods for Elliptic PDE Fast Algorithms in Potential Theory and Linear Algebra Linear Elliptic PDE and the FMM: A Paradigm Fast Multipole Methods Hierarchical Matrix Compression The Fast Fourier Transform References 82 1 Floating Point Arithmetic, Conditioning, and Stability 1.1 Previously Asked Questions Why are normal equations bad for solving least squares problems? (Ans: condition number squared) What would you use instead? What does the condition number of a matrix tell you? (Ans: I talked about error bounds for solving systems of equations and convergence rates of iterative methods.) What does the error bound tell you in the case that the solution is zero? (It turns out that this is something of a trick question.) 1.2 Numerical Methods Class Problems Find three double precision IEEE floating point numbers a, b, and c for which the relative error of a + b + c is very large. Try to make it as bad as possible and explain your reasoning. What is the smallest positive integer which is not exactly represented as a single precision IEEE floating point number? What is the largest finite integer which is part of the double precision IEEE floating point system? Find the IEEE single and double precision floating point representations of the numbers 4, 100, 1/100, 2 100, 2 200, and Questions that Seem Reasonable What is the approximate value of machine epsilon in the IEEE double-format floating-point standard? 3

4 1.4 The IEEE Standard (Double and Single) A double-format number is represented as ± a 1 a 2... a 11 b 1 b 2... b 52 The value given can be figured out from the following table If exponent string is: Then the value is: ( ) 2 = (0) 10 ±(0.b 1 b 2... b 52 ) ( ) 2 = (1) 10 ±(1.b 1 b 2... b 52 ) ( ) 2 = (2) 10 ±(1.b 1 b 2... b 52 ) ( ) 2 = (1023) 10 ±(1.b 1 b 2... b 52 ) ( ) 2 = (1024) 10 ±(1.b 1 b 2... b 52 ) ( ) 2 = (2045) 10 ±(1.b 1 b 2... b 52 ) ( ) 2 = (2046) 10 ±(1.b 1 b 2... b 52 ) ( ) 2 = (2047) 10 ± if b i = 0 i, NaN otherwise A single-format number is represented as ± a 1 a 2... a 8 b 1 b 2... b 23 The value given can be figured out from the following table If exponent string is: Then the value is: ( ) 2 = (0) 10 ±(0.b 1 b 2... b 23 ) ( ) 2 = (1) 10 ±(1.b 1 b 2... b 23 ) ( ) 2 = (2) 10 ±(1.b 1 b 2... b 23 ) ( ) 2 = (127) 10 ±(1.b 1 b 2... b 23 ) ( ) 2 = (128) 10 ±(1.b 1 b 2... b 23 ) ( ) 2 = (253) 10 ±(1.b 1 b 2... b 23 ) ( ) 2 = (254) 10 ±(1.b 1 b 2... b 23 ) ( ) 2 = (255) 10 ± if b i = 0 i, NaN otherwise Note that for either format, the numbers corresponding to the exponent zero are interpreted differently. The leading zero allows these subnormal numbers to take very small values, but at the cost of significant digits. 4

5 The value of machine epsilon is determined by the smallest significant digit that is capable of being stored the values are Double-format: ɛ = Single-format: ɛ = With IEEE correctly rounded arithmetic, the results of arithmetical operations have nice error bounds and behavior. Let round be the function that represents a number as its nearest floating point approximation and a circled operation signify its floating point equivalent. Assume x, y are floating point numbers. x y = round (x + y) = (x + y)(1 + δ) x y = round (x y) = (x y)(1 + δ) x y = round (x y) = (x y)(1 + δ) x y = round (x/y) = (x/y)(1 + δ) x 1 = x x y = 0 x = y However, it s not all good. Indeed, it is possible that (x y) z round (x + y z) for x, y, and z floating-point numbers. Consider x = z = 1 and y = 2 25 in single-format. 2 Conditioning, Stability, and Accuracy 2.1 Definitions Absolute Condition Number For a problem f : X Y. If f is differentiable ˆκ = lim sup δ 0 δx δ δf δx ˆκ = J(x) Where the norm on the Jacobian is the induced norm from the norms on X and Y. 5

6 2.1.2 Relative Condition Number For a problem f : X Y. For f differentiable, this gives κ = lim sup δ 0 δx δ δf / f(x) δx / x Condition Number of a Matrix For an invertible matrix A If we are using the 2-norm, we get κ = x J(x) f(x) κ(a) = A A 1 κ(a) = σ 1 σ m Note: the condition numbers of the following problems are bounded by κ(a): matrix-vector multiplication and solving a system Ax = b Algorithms Problems can be viewed as f : X Y and algorithms can be viewed as another map f : X Y. Let c(x) mean the computer representation of x. Then, in particular, we have Accuracy If we have that x c(x) program f(x) f(x) f(x) f(x) = O(ɛ mach ) Then the algorithm f is accurate. We note that for an ill-conditioned problem, an accurate algorithm in the above sense is unlikely. For instance, rounding error alone could lead to poor accuracy in the case of an ill-conditioned problem. 6

7 2.1.6 Stability An alrogithm is stable if for each x X for some x with f(x) f( x) f( x) = O(ɛ mach ) x x x = O(ɛ mach ) A stable algorithm gives nearly the right answer to nearly the right question Backward Stable Often, numerical linear algebra algorithms satisfy a stronger condition: for some x with x x x f(x) = f( x) = O(ɛ mach ) A backward stable algorithm gives exactly the right answer to nearly the right question. 2.2 A Theorem The accuracy of a backward stable algorithm is given by Proving this theorem is straightforward. f(x) f(x) f(x) = O(κ(x)ɛ mach ) 2.3 An Example 3 Solution of Linear Systems (Direct) 3.1 LU Decomposition Explanation This is the decomposition you obtain when you perform Gaussian elimination on a matrix. You successively introduce zeros below the diagonal of your matrix, one column at a time (from right to left). The resulting decomposition is A = LU for L a lower triangular (often unit diagonal) matrix and U an upper triangular matrix. The proof of the existence is easy. 7

8 3.1.2 Stability Without pivoting, the method can be quite unstable. The following is a standard example. Perform Gaussian elimination on ( ) A = 1 1 You get A = ( ) = ( ) ( ) ( ) ( Where the indicates its floating point representation. We lose information when this digit is lost. Indeed ( ) ( ) = ( So the backwards error is order 1. When you perform the elimination with pivoting you do not have this problem. Further, we note that the matrix A is well-conditioned here. While Gaussian elimination with partial pivoting is indeed backward stable, the worst-case can be quite bad. The standard example is If you row-reduce even with pivoting the growth factor is 2 m which corresponds to a loss on the order of m bits (a friggin disaster). However, in practice, this peculiarity is rarely (read: never) seen Work On the k-th step, you have about (n k) 2 multiplications and additions. This gives that there is O(n 3 ) work. More precisely, it s about 2 3 n3 flops. For banded matrices, the work can be significantly decreased. For instance, a tri-diagonal matrix can be row-reduced in linear time Heuristics Using the LU decomposition is fairly straightforward. If A = LU then solving Ax = b is equivalent to performing two back-substitutions (stable and O(n 2 )), Ly = b and Ux = y. If you have multiple ) ) 8

9 right-hand-sides, you only have to compute the LU decomposition once. There is an interesting existence/uniqueness result for sonsingular matrices. Each principal minor of A is nonsingular if and only if A has an LU factorization. Further, this factorization is unique. In practice, Gaussian elimination with partial pivoting is quite stable. 3.2 Cholesky s Method Explanation A Cholesky factorization is a decomposition of the form A = R R for R upper-triangular and R jj > 0. Every Hermitian positive definite matrix has a unique Cholesky factorization. The existence and uniqueness can be proven by construction. The basic step of the induction used is to write ( a11 w A = ) w K ( ) ( α = w/α I 0 K ww /a 11 ) ( α w /α 0 I ) where α = a 11 This incrementally transforms the middle matrix into the identity Stability This algorithm is always stable. Intuitively, we have that R = R = A 1/2 by the SVD, so R can never grow Work It is easy to see by the construction that there are O((n k) 2 ) operations at each step, giving O(n 3 ) total work. More precisely, it is about 1 3 n3 total flops. This is an improvement by a factor of 2 over Gaussian elimination Heuristics In comparison to Gaussian elimination, the Cholesky factorization requires less work and is more stable. It is, however, limited to positive definite Hermitian matrices. Once obtained, the decomposition is used in the same manner as Gaussian elimination. 9

10 4 QR and SVD Factorizations and Least Squares 4.1 QR Decomposition by Modified Gram-Schmidt Explanation A QR decomposition allows you to write a matrix A as A = QR where Q is an orthogonal matrix and R is upper-triangular. To be brief, the regular Gram-Schmidt method is exactly what you think it is. You take the columns of your matrix and then Gram-Schmidt them (storing the inner-products and norms as you go). At step j v j = v j q 1, v j q 1 q j 1, v j q j 1 q j = v j / v j That is, you update the current vector using the previous vectors. The modified Gram-Schmidt method, instead, normalizes the current vector and then uses it to modify the future vectors. This process is mathematically equivalent; however, it has numerical implications. At step j q j = v j / v j v j+1 = v j+1 q j, v j+1 q j. v n = v n q j, v n q j Thus, at each step the imperfections of the previous steps are again orthogonalized with respect to q j. To see this, a fixed vector v j undergoes: This is the key difference. v j v j q 1, v j q 1 v j q 1, v j q 1 v j q 1, v j q 1, q 2 q Work For an m n matrix, at step j you perform n j inner products each with O(m) flops and n j vector subtractions, again about O(m) each. Thus, summing over j you get O(mn 2 ) total work. More precisely, you have 2mn 2 flops Heuristics Gram-Schmidt is a nice method because you obtain the orthonormal vectors sequentially. If you perform the algorithm as described, you actually get a reduced QR factorization, i.e. A = ˆQ ˆR 10

11 where ˆQ is an m n matrix with orthonormal columns and ˆR is an n n upper-triangular matrix. You can obtain a full QR decomposition by extending the columns of ˆQ to a full basis and padding R with zeros. However, I m not sure how (or if) this is done in practice. Perhaps by taking random vectors and orthonormalizing them? A way to summarize the process of Gram-Schmidt is triangular orthogonalization. 4.2 QR Decomposition by Householder Reflections Explanation The idea of using Householder Reflections to obtain the QR factorization is to successively introduce zeros below the diagonal of each column by means of an orthogonal transformation. It is easy to see how the first step of the process can be repeated to get the full decomposition. Let x = (x 1,..., x m ) T be the first column of your matrix A. We would like to reflect this vector so that it is zero in every entry below the first. Let the matrix of this reflection be Q 1. Because reflections are unitary, we know that Q 1 x = ± x e 1. We see that v = ± x e 1 + x is a suitable vector to reflect across (easy to see if you draw it). Because subtracting numbers of similar sizes is unstable, it is best to choose v = sign (x 1 ) x e 1 + x. Finally, Q 1 is then given by F = I 2 vv v v Repeating this process, we get Q m Q 1 A = R. So setting we have a QR factorization Stability Q = Q 1 Q m QR decomposition by Householder reflections is stable because multiplication by orthogonal matrices is stable Work Let A be m n. At step k in the algorithm, you compute the reflection vector v, which is about O(m k) flops. Then you mst apply the matrix F to the submatrix A k:m,k:n. This is about O((m k)(n k)) = O(mn (m + n)k + k 2 ) work. Summing in k we get about O(mn 2 ). A more careful analysis gives that it is in fact 2mn 2 2n 3 /3 flops (a slight improvement over Gram-Schmidt). 11

12 4.2.4 Heuristics We note that this method always gives a full QR factorization, in contrast with Gram-Schmidt. Further, if you stop the algorithm before the final step, you do not have any of the vectors in your basis, so you don t gain them sequentially like you do in Gram-Schmidt. The algorithm should really only store the reflection vectors and R. Forming the explicit Q from these vectors involves a decent amount of work. Further, evaluating Q b or Qx using the explicit Q is O(mn 2 ) work whereas you can multiply by the Q i or Q i in the correct order to get these results in O(mn) work. We also note that Q i = Q i. As an example, Q b can be calculated by 4.3 SVD Overview Explanation for k = 1 to n b k:m = b k:m 2v k (v k b k:m) Every matrix A C m n has an SVD (singular value decomposition). It consists of two orthogonal matrices U C m m and V C n n as well as a diagonal matrix Σ C m n. The singular values are those found on the diagonal of Σ. They are non-negative and decrease from top-left to bottomright. The singular values are uniquely determined and if they are distinct, then the right and left singular vectors corresponding to them are unique up to complex signs. We note that if A = UΣV, we have AV = UΣ So Av i = σ i u i. This leads to the geometric interpretation of the SVD. The image of the unit sphere under any matrix is a hyper-ellipse. The right-singular vectors v i are the pre-images of the ellipse s axes found on the unit ball, the σ i the lengths of those axes, and the left-singular vectors u i the directions of those axes. The SVD has many applications, in particular to least squares problems as outlined in the next subsection Stability We omit many of the details, but the process described in the sketch of the computation found below is stable Work The reduction to bidiagonal form requires O(mn 2 ) work for an m n matrix. The iterative step takes about O(n) iterations and each requires about O(n) work, resulting in O(n 2 ) total work. Thus, the reduction to bidiagonal form is often the more expensive step. 12

13 4.3.4 Computation Sketch The matrix is reduced to bidiagonal form by applying Householder reflections on both right and left. This process can be sped up by first computing a QR factorization for part of the matrix (choosing when to perform this step to get the least amount of work is subtle). A variant of the QR algorithm (described in the eigenvalues section) is then used to find the SVD of the bidiagonal matrix. Really, you should never code up an SVD algorithm yourself, as the industry standard will likely destroy anything you come up with (the same philosophy applies for many numerical linear algebra algorithms). 4.4 Least Squares Least-squares refers to minimizing some quantity with respect to the 2-norm Overdetermined System A common least squares problem is show that there is an optimal approximation to an overdetermined system of linear equations. As a numerical problem, you would compute the minimizer. The problem can be stated as follows. Let A C m n for m > n be full rank. Find the vector x C n which minimizes Ax b 2 The existence of a QR factorization and the fact that orthogonal matrices preserve the 2-norm greatly simplify this problem. min x Ax b = min x QRx b = min Rx Q b x As R is upper-triangular, we see that the entries of Q b beyond the n-th entry will never be cancelled out. Further, as A was full rank the triangular system corresponding to the first n entries can be solved exactly (and stably) Underdetermined Systems Another common least squares problem is to minimize the 2-norm of a vector that solves an underdetermined system. The problem can be stated as follows. Let A C m n for m < n be full rank. Find the vector x C n which solves Ax = b and minimizes the 2-norm of x. This problem is greatly simplified by considering the SVD of A. Write A = UΣV. Then b = Ax b = UΣV x U b = Σy 13

14 where y = V x C n and y = x. We note that the first m components of y are determined by the diagonal part of Σ and that because A is full rank, these are sufficient to satisfy U b = Σy. Setting the rest of the entries of y to zero, we obtain the vector of minimum 2-norm. Finally, x = V y Example of a Least Squares Problem That s Not Quite the Same Best plane approximation Consider n > 3 points (x i, y i, z i ), 1 i n, in three-dimensional, real Euclidean space. The points lie close to a common plane. Determine a plane which goes through ( x, ȳ, z) where x = n i=1 x i/n, etc., which provides the best least squares fit to the data. Also, show how the normal to the plane can be obtained readily by using the singular value decomposition. Solution: We note that the equation of a plane going through ( x, ȳ, z) t is given by a(x x) + b(y ȳ) + c(z z) = 0 for n = (a, b, c) t 0. We note that the scale of n is irrelevant as αn determines the same plane for any non-zero α. Thus, we normalize so n = 1. We let v i = (x i, y i, z i ) t ( x, ȳ, z) t, for i = 1,..., n. Let A be the n 3 matrix whose i-th row is v i. Then, the least squares fit is given by the n that produces the minimum below min An n =1 Let A = UΣV be the SVD decomposition of A. This gives min An = min UΣV n n =1 n =1 = min n =1 ΣV n Let w = V n. Then w = 1. We note that the rows of Σ are zero beyond the third row. This gives min An = min Σw n =1 w =1 = min w =1 ˆΣw 14

15 Because the singular values are in decreasing magnitude down the diagonal, we see that the minimizing w is w = (0, 0, 1) t. Thus, the minimizing n is given by n = V (0, 0, 1) t, i.e., the third column of V. 4.5 Best Low Rank Approximation The SVD allows you to construct the best low-rank approximation (in the 2-norm) to a matrix A of any rank (less than that of the matrix itself). The best approximation to A C m n of rank k < dim range A is given by B = ( u 1... u k ) Σ1:k,1:k v 1. vk Also, you can use the SVD to see that this is indeed the best approximation. 5 Eigenvalue Algorithms 5.1 Why Iterative? Any root finding problem can be recast as an eigenvalue problem, and vice versa. Thus, the following theorem from Galois theory provides some insight into eigenvalue problems. For any m 5 there is a polynomial p(x) of degree m with rational coefficients and a root z such that z cannot be expressed in terms of the coefficients using addition, subtraction, multiplication, division, and radicals. Thus, there cannot be a finite time algorithm (using the above operations) to find the eigenvalues of an arbitrary matrix. Further, we note that the preferred method for calculating eigenvalues is by an eigenvalue revealing method instead of working with the characteristic polynomial. This is because certain polynomial root finding problems are extremely ill-conditioned with respect to the coefficients. The following example due to James Wilkinson w(x) = 20 i=1 (x i) illustrates this issue with conditioning. Expanded in terms of the monomials, we have w(x) = x x If we perturb the coefficient of x 19 by ɛ then the value of w(20) is changed by ɛ This is especially troubling, because the eigenvalue problem is a well-conditioned one (for symmetric matrices. This can be seen by considering Gershgorin s circle theorem. see Dahlquist 15

16 Section 5.8). With these issues in mind, the preferred approach is to transform the given matrix into an upper-triangular (Schur decomposition for any matrix) or diagonal (for Normal and Hermitian matrices) form via orthogonal transformations. 5.2 Power Method Idea The power method is based on the idea that if you repeatedly apply a matrix to a vector, the direction of the eigenvector corresponding to the largest eigenvalue will dominate. We assume that we have a full basis {v i } of eigenvectors for our matrix A and that our starting vector v = α i v i has non-zero components in the direction of each eigenvector. Then A k v = λ k 1α 1 v λ k nα n v n Suppose that λ 1 is the strictly largest magnitude eigenvalue, then Pitfalls A k v A k v v 1 The convergence can be quite slow. The error is dominated by ( λ 2 / λ 1 ) k where λ 2 is the second greatest eigenvalue in magnitude. The method is also limited to finding the eigenvector corresponding to the largest magnitude eigenvalue Other Uses The method is used to approximate the spectral norm of a matrix (which coincides with the induced 2-norm). In this case, you alternatively apply A and A to the vector. 5.3 The QR Algorithm Description The QR algorithm is an extremely simple one to describe, though it can appear a bit mysterious. We start with A (0) = A. Then the later steps are determined by the following Q (k) R (k) = A (k 1) A (k) = R (k) Q (k) this is obtained from a QR decomposition algorithm simply multiply these together 16

17 5.3.2 Understanding QR in Terms of Simultaneous Iteration Simultaneous iteration can be understood as applying the power iteration method to multiple vectors and orthonormalizing the result each time Q (k) R (k) = Z (k) Z = A Q simply again from a QR decomposition algorithm multiply these together If we let R (k) = R (k) R (1), then A k = Q (k) R(k) and A (k) = ( Q (k) ) T A Q (k), where A (k) is the k-th iterate of the QR Algorithm. Further, Q(k) = Q (1) Q (k) where the Q (i) are those from the QR Algorithm. In this way it is clear to why the QR Algorithm produces eigenvalues Convergence By comparison with the power method, we see that we get linear convergence to the eigenvectors that depends on the ratios of the eigenvalues. The convergence of the eigenvalues can be deduced by considering the Rayleigh quotients of the eigenvector approximations. In particular, we note that for an eigenvector v, the Rayleigh quotient r(v) = v T Av/v T v satisfies r(v) = 0. This gives us that the first order terms are 0, so we have that the convergence of the eigenvalues is quadratic Shifted QR and the Inverse Iteration Idea If we have a good guess as to an eigenvalue, we can perform inverse iteration in which we utilize the fact that (A µi) 1 has the same eigenvectors as A and in particular an eigenvector with eigenvalue 1/(λ µ). So if our guess is close, applying (A µi) 1 to a vector will strongly favor the eigenvector corresponding to eigenvalue λ µ. To utilize this idea, we can choose a shift at each step of the QR Algorithm. In particular, we can perform Q (k) R (k) = A (k 1) µ (k) I obtained from a QR decomposition algorithm A (k) = R (k) Q (k) + µ (k) I We still have that A (k) = ( Q (k) ) T A Q (k). Thus, to get a good eigenvalue estimate, we can consider a Rayleigh quotient for an eigenvector approximation of A. We note that A (k) mm = q m (k)t Aq (k) m So µ (k) = A (k) mm is a decent choice of shift. However, there can be issues with symmetry of eigenvalues. 17

18 ( 0 1 A = 1 0 The A above has eigenvalues ±1. The shift suggested above is 0 and with that shift, A is left unchanged by the QR algorithm. Wilkinson s shift instead uses the eigenvalue of A (k) m 1:m,m 1:m closest to A mm (k) (or either one in the case of a tie) for the shift. This breaks the symmetry problem. In either case, we get cubic convergence Actual Implementation The actual implementations of the QR algorithm normally tridiagonalize the matrix first and then perform the algorithm with some sort of shift. Further, the algorithms often deflate the matrix when eigenvalues are found, a process I will not describe here. 5.4 Jacobi s Method by Givens Rotations Description The idea of Jacobi s method is quite simple. It is to diagonalize a matrix by repeatedly annihilating off-diagonal entries. It applies only to symmetric matrices. We note that a symmetric 2 2 matrix can be diagonalized easily by a rotation. In particular, if we let ( ) a b A = b d then we can diagonalize using the rotation matrix ( ) c s J = s c In particular, D = J T AJ if we let θ = arctan[2b/(d a)] and c = cos θ and s = sin θ. This process can then be used to zero out a specific off-diagonal entry of a matrix by using rotations of the form ) 18

19 J = c s s c (the dots between the c s and s s are only there to show how they line up). The matrix above only affects the rows and columns that have c s and s s, call these rows i and j (the affected columns have the same indeces). Then, we have that (J T AJ) ij = 0. While this introduces a zero in the i, j-th entry, it can remove other zeros (note that this is clear, otherwise, we d have a finite step algorithm). However, the idea with the Jacobi algorithm is that the total of the off-diagonal elements goes down and that if you continuously sweep through the matrix and zero out elements, you will eventually converge Convergence The convergence of the algorithm can be easily proved for the case where you annihilate the largest off-diagonal element each time. In this case the sum of the squares of the off-diagonal entries decreases by a factor of 1 2/(m 2 m) at each step. The proof of this fact is below. First we note that the Frobenius norm is invariant under unitary multiplication. Thus, for J a Jacobi rotation, we have J T AJ F = A F Now, suppose A j,k is the largest off-diagonal element (for k > j). Then A j,k 2 2 i>h A h,i 2 m(m 1) = S(A) m(m 1) where S(A) is the sum of the off-diagonal elements. The effect of the product J T AJ on the entries of A is easily understood for the submatrix [ ] Aj,j A j,k A k,j A k,k 19

20 Indeed, this matrix gets replaced with [ ] [ Aj,j A j,k cos θ sin θ gets sin θ cos θ A k,j A k,k ] [ Aj,j A j,k A k,j A k,k ] [ cos θ sin θ sin θ cos θ ] = [ ] For appropriately chosen θ. As the 2 2 rotation matrix in the above is unitary, it should preserve the Frobenius norm of this matrix. In particular, the sum of the diagonal entries should increase by 2 times the value of A j,k 2. As the other diagonal entries are unaffected by the Jacobi rotation, we have (J T AJ) 2 i,i A 2 i,i + 2S(A) m(m 1) Because the Frobenius norms of J T AJ and A are the same, we have then that S(J T AJ) = S(A) + A 2 i,i ( ) (J T AJ) 2 2 i,i S(A) 1 m(m 1) Some Heuristics Rather than searching for the largest off-diagonal entry at each step (which is costly), most algorithms simply sweep through the off-diagonal entries, eliminating them sequentially. In contrast with the QR algorithm, there is no benefit to first tri-diagonalizing the system. That is because the Jacobi rotations would introduce non-zeros outside of the sub, main, and super-diagonals, destroying that structure. Finally, the process can be parallelized, as it only affects and depends on the rows and columns corresponding to the element to be zeroed out. 5.5 Bisection for Tridiagonal Systems Description Bisection is quite distinct from the other methods, as it allows you to find the eigenvalues in a specific range. It applies to symmetric tridiagonal matrices. The key fact used is that the eigenvalues of the principal minors of such matrices interlace. That is if we have λ (k) i are the strictly increasing eigenvalues of the k-th principal minor of A, then λ (k+1) i < λ (k) i < λ (k+1) j+1. If we have a 1 b 1 b 1 a 2 b 2 A =. b 2 a Then there is a simple recursion formula for the characteristic polynomials of the principal minors. We have p ( 1) (x) = 0, p (0) (x) = 1, and 20

21 p (k) (x) = (a k x)p (k 1) (x) b 2 k 1 p(k 2) (x) We say p (k) (x) p (k+1) (x) has a sign change if it changes sign or goes from zero to negative or positive (but not if it goes from positive or negative to zero). If we evaluate the above sequence for a fixed value of x and we count the sign changes, the total gives the number of eigenvalues in (, x) Heuristics While Wilkinson s polynomial suggests that bisection is an unstable approach, it does not actually experience such difficulties. This is because the method simply evaluates the characteristic polynomial without considering its coefficients and therefore avoids the issue of ill-conditioning. Bisection is especially good for situations in which you need only a few eigenvalues. This is because it requires only O(m) flops for each evaluation of the sequence p (k) (x). 5.6 Methods Described in other Parts of the Notes The Arnoldi and Lanczos algorithms are related to eigenvalues and the details of these are given in the Iterative Methods for Linear Systems section. 6 Iterative Methods for Linear Systems 6.1 Classical Iterative Methods Linear One-Step Stationary Schemes We consider the case of solving Ax = b via a one-step stationary scheme, that is, a scheme in which you obtain iterates by x k+1 = Hx k + v We note that this scheme converges to a bounded limit ˆx if and only if ρ(h) < 1 (a straightforward calculation). Further, if ρ(h) < 1 then ˆx is the correct solution if and only if v = (I H)A 1 b. We note that as of now the scheme seems impractical. To get v, it seems we have to calculate A 1 b. However, there are ways to get around this Regular Splitting One can split A = P N where P is non-singular and consider the scheme P x k+1 = Nx k + b 21

22 We generally choose a P which is simpler to invert (e.g. has an easily computed LU factorization). A sufficient condition for convergence of this method is that both the matrices A and P T + P A are symmetric positive definite Jacobi, Gauss-Seidel, SOR Jacobi iteration is a regular splitting scheme in which P = D, the diagonal of A, and N = (A D). Gauss-Seidel iteration is a regular splitting scheme in which P = D + L, the diagonal and strictly lower triangular part of A, and N = (A D L) = U, the negative of the strictly upper triangular part of A. Successive-over-relaxation is a regular splitting scheme in which P = 1 ω D + L and N = (A 1 ω D L) = ( 1 ω 1)D U for some ω [1, 2). We note that Gauss-Seidel is SOR with ω = As Linear Equations Each of the above methods can be written as a system of linear equations. For l = 1, ldots, n we have for Jacobi iteration: For Gauss-Seidel: For SOR: l 1 a lj x k j + a ll x k+1 l + j=1 l 1 a lj x k+1 j + a ll x k+1 l + j=1 n j=l+1 n j=l+1 a lj x k j = b l a lj x k j = b l l 1 n ω a lj x k+1 j + a ll x k+1 l + (ω 1)a ll x k l + ω j=1 j=l+1 a lj x k j = ωb l These equations highlight the differences between the methods. Gauss-Seidel uses all of the new entries of x k+1 that have been computed up to that point. This leads to obvious storages savings, as the new entries can be simply overwritten into the old vector. Further, if we rearrange, we have for Gauss-Seidel and x k+1 l = 1 a ll l 1 b l j=1 a lj x k+1 j 22 n j=l+1 a lj x k j

23 x k+1 l = (1 ω)x k l + ω a ll l 1 b l j=1 a lj x k+1 j n j=l+1 a lj x k j for SOR. If we compare these we note that Gauss-Seidel exactly solves the linear equation for x k+1 l, while SOR actually moves further in the direction of the solution (passed what is optimal if we are only changing x k+1 l ). The convergence of SOR is often faster than Gauss-Seidel and depends on the value of ω. The optimal value of ω can be chosen, but the methods for doing so are left out of these notes A Different Viewpoint If we have a minimization problem min x R n f(x 1,..., x n ) the Gauss-Seidel minimization process would be to cycle through the coordinates of x and then minimize f with respect to each coordinate, holding the other coordinates constant. After each iteration (and indeed after each step of each iteration), the value of f improves or stays the same. In this way, we see that the Gauss-Seidel iteration will always converge to a local minimum Some Heuristics Gauss-Seidel converges when the matrix is diagonally-dominant. We note that Gauss-Seidel tends to smooth out high frequency errors (large eigenvalue errors) quickly; this is the basis for the multigrid method (see Iserles for a reference). 6.2 Steepest Descent For the method of steepest descent, we assume A is symmetric positive definite and recast the problem Ax = b as a minimization problem. Consider the quadratic form f(x) = 1 2 xt Ax b t x We note that if x = A 1 b, then for y x, we have f(y) = 1 2 yt Ay b T y = 1 2 yt Ay y T Ax = 1 2 (y x)t A(y x) 1 2 xt Ax 23

24 = 1 2 (y x)t A(y x) + f(x) > f(x) Thus, the solution of Ax = b is the minimizer of f(x). The method of steepest descent attempts to solve this minimization problem iteratively, moving from one approximation to the next in the opposite direction of the gradient of f (which points in the direction of greatest increase). The gradient of f is given by f (x) = 1 2 AT x Ax b = Ax b Thus, if we move opposite the gradient (in the direction of steepest descent ) we move in the direction of the residual b Ax. The method then calls for us to minimize f in this direction. We have Thus, we have x i+1 = x i + α(b Ax i ) = x i + αr i f(x i+1 ) = 1 2 (x i + αr i ) T A(x i + αr i ) b T (x i + αr i ) = f(x i ) + αx T i Ar i α2 r T i Ar i αb T r i d dα f(x i+1) = x T i Ar i + αr T i Ar i b T r i 0 = r T i r i + αr T i Ar i α = rt i r i r T i Ar i The above choice for α gives the minimizer. We note that we can use the chain rule to get 0 = d dα f(x i+1) = f (x i+1 ) T d dα x i+1 = f (x i+1 ) T r i so we see that the new residual r i+1 = f (x i+1 ) is orthogonal to the previous residual r i. This fact limits the method of steepest descent. In particular, for 2 dimensions, we will switch between two search directions. So if f has elliptical cross sections, the method of steepest descent will generally zig-zag towards the solution, which could take many iterations to converge. The algorithm goes as follows 24

25 r i = b Ax i α i = rt i r i r T i Ar i x i+1 = x i + α i + α i r i We note that after the initial iteration, the matrix-vector multiplication in the first step can be avoided because we have x i+1 = x i + αr i r i+1 = r i α i Ar i and we have already computed Ar i. We point out that updating in this fashion can cause a build up of round-off error over time. One possible solution is to occasionally update r i using x i. At any rate, the work of the steepest descent method is dominated by the matrix-vector multiplication Ar i in the second step. The error analysis of this method can be found in Shewchuk s CG paper (see references). The basic idea is to write the initial error in terms of the eigenvectors of A and then see what the method does to the error over time in terms of the norm e 2 A = et Ae. The convergence goes like e i A ( ) κ 1 i e 0 A κ + 1 where κ is the condition number of A. An interesting feature of the method of steepest descent and CG is that from one step to the next, the error in a certain eigenvector direction may actually increase. This is in contrast with methods such as Gauss-Seidel iteration which decrease each component (called smoothers). This is why steepest descent and CG are sometimes referred to as roughers. 6.3 Krylov Subspaces If we are working with a matrix A, then the Krylov subspaces generated by a vector b are given by K n = span b, Ab,..., A n 1 b. The Arnoldi and Lanczos iteration methods come up with increasing orthonormal bases Q n for K n which satisfy AQ n = Q n+1 H n AQ n = Q n+1 T n where H n is upper-hessenberg for Arnoldi and T n is tridiagonal for Lanczos. Two of their applications are to finding eigenvalues and iteratively solving the system Ax = b. The following statements are much more heuristic than they are precise. 25

26 Finding eigenvalues. Let P n be the space of monic polynomials of degree n. Then P n (A)b K n+1. Let p n P n be the minimizing polynomial for p n (A)b. We note that if p n has zeros near the eigenvalues of A, then this value will likely be small. To see this, consider the case of a diagonal matrix A C m m which has n m non-zero eigenvalues. Then the minimal polynomial p of A gives p (A) 0, so p n = p would have zeros given by the eigenvalues of A. Thus, the minimizing polynomial p n P n has zeros which approximate the eigenvalues of A. Interestingly, this minimzing polynomial is given by the characteristic polynomial of H n or T n. Then, to approximate the eigenvalues of A one can run the Arnoldi or Lanczos algorithm for n steps and then send H n or T n to an eigenvalue algorithm. These approximations (sometimes called Ritz values) tend to find eigenvalues located at the extremes of the spectrum of A and when it finds them it converges linearly (geometrically). Solving Ax = b. Krylov subspaces are at the heart of the GMRES and Conjugate Gradient algorithms described below. 6.4 Arnoldi Iteration Let A be a general square matrix. As mentioned above, the Arnoldi iteration finds increasing orthonormal bases for K n which satisfy AQ n = Q n+1 H n for H n C (n+1) n upper-hessenberg. This can be viewed in a different light. Suppose we wished to find a similarity transformation giving A = QHQ for H upper-hessenberg. If we used a process like Gram-Schmidt to find the vectors of Q successively, we would end up with the formula for Q n and H n above, where Q n was the first n columns of Q and H n was the upper (n + 1) n block of H. This Gram-Schmidt like process is the Arnoldi iteration. Using the relation for Q n and H n above, we see that Aq n = h 1,n q h n,n q n + h n+1,n q n+1 many of the coefficients h j,n can be found by taking the inner-product of both sides with q j. For j n, we have h j,n = q j, Aq n Then, q n+1 is determined and given by the normalization of and h n+1,n = w n+1. In pseudocode, we have w n+1 = Aq n (h 1,n q h n,n q n ) 26

27 b = arbitrary, q_1 = b/ b for n = 1,2,... v = Aq_n for j = 1 to n h_{jn} = <q_j, v> v = v- h_{jn} q_j end h_{n+1,n} = v q_{n+1} = v/ v We note that the above is a variant of modified Gram-Schmidt. Another popular way to make the process stable is by taking the inner-products again to ensure orthogonality (called double Gram-Schmidt). In pseudocode b = arbitrary, q_1 = b/ b for n = 1,2,... w = Aq_n for j = 1 to n f_{jn} = <q_j, Aq_n> w = w- f_{jn} q_j end v=w for k = 1 to n g_{jn} = <q_j,w> v = v - g_{jn} q_j h_{jn} = f_{jn} + g_{jn} end h_{n+1,n} = v q_{n+1} = v/ v We note that there are subtle differences between the above and the original algorithm. The original is akin to a single round of modified Gram-Schmidt whereas this algorithm is like two rounds of classical Gram-Schmidt. I m not entirely sure of the benefits of choosing one over the other, though the second algorithm appears to be potentially parallelizable. The work at each iteration of either algorithm is dominated by the matrix-vector multiplication Aq n. This step can be sped-up using a black-box matrix multiplication process such as a sparse matrix class (obviously for the case of sparse matrices), a Fast Multipole method, a compressed form of the matrix, etc. 27

28 6.5 GMRES Description The idea of the algorithm is to choose an x K n which minimizes b Ax. If Q n are the basis matrices gained from the Arnoldi iteration then x = Q n y K n for some y C n. This gives that b Ax = b q 1 AQ n y = Q n+1 (e H n y) where e = ( b, 0,..., 0) T C n+1. Because the columns of Q n+1 are orthonormal, we see that minimizing b Ax in x is equivalent to minimizing e H n y in y. The advantage is that this minimization problem is in a much lower dimension for n m. We are then left with a least squares problem to find y. This can be accomplished by a QR decomposition. Naïvely, if we recompute the QR decomposition of H n anew at each step, this is about O(n 3 ) work per step. However, the H n are closely related and the new QR decomposition can be obtained from the previous in a more efficient manner (via Givens rotations, see for reference the Wikipedia page on GMRES). Further, even the work of the back-substitution step for the least squares (finding R 1 ) can be shortened by virtue of the relationship between the H n Convergence Why is it a good idea to choose x n from the Krylov subspace K n? As seen above, this reduces the dimensionality of the problem. We will consider a different characterization of the problem to see how good of an approximation we get for x. We note that the residual after n iterations is r n = b Ax n = (I Ap n 1 (A))b = f n (A)b where f n is the degree n polynomial f n (x) = 1 xp n 1 (x) and p n 1 is the polynomial which minimizes f n (A)b (this is what GMRES does). Thus, f n is the minimizing polynomial of f n (A)b with f(0) = 1. Thus, we have r n b inf F n f n (A) where F n = {f n : f n (0) = 1, deg(f n ) n} are polynomials. Suppose A is diagonalizable with A = V ΛV 1. Then [ ] r n b inf f n (A) κ(v ) inf sup f n (x) f n F n f n F n x σ(a) This gives an idea of when we can expect fast convergence. If κ(v ) is not too large (i.e. A is not too far from being normal) and if polynomials which are 1 at the origin can be well bounded on the spectrum of A, then we could expect good convergence. For instance, the latter condition is satisfied for eigenvalues which are clustered away from the origin. We note that this is a bound 28

29 on the value of r n rather than e n where e n = x x n. Thus, it represents a kind of backward accuracy (as opposed to forward). We note that if the method is able to continue and produces a full basis for C m, it is clear that the method converges after m steps (for A C m m ). If we are unable to produce a new vector for the basis at step n (but were able at step n 1), i.e., if Aq n ( q 1, Aq n + + q n, Aq n ) = 0 then we see that Aq n span q 1,..., q n = span b,..., A n 1 b. Thus, there is a non-trivial combination c 0 b + c 1 Ab + + c n A n b = 0 We see that c 0 0, otherwise we have c 1 b + + c n A n 1 b = 0 (if A is invertible) and the iteration would have stopped a step earlier. Thus, if we choose x n = c 1 /c 0 b + + c n /c 0 A n 1 b K n then Ax = b and we must have found a solution at step n. Thus, we see that we always get a solution by the m-th step (of course we hope to get a solution earlier than that). 6.6 Lanczos Iteration The Lanczos iteration is a specialization of the Arnoldi iteration to the case where A is Hermitian (we will assume, further, that A is real symmetric). We would like to find increasing orthonormal bases for K n which satisfy AQ n = Q n+1 T n for T n C (n+1) n tridiagonal. This can be viewed in a different light. Suppose we wished to find a similarity transformation giving A = QT Q T for T tridiagonal. To see why this matrix should be tridiagonal, we note that Aq n span q 1,..., q n+1. Further, we have T ij = q T i Aq j so T ij = 0 for i > j +1. This gives that T is necessarily upper-hessenberg. Because A is symmetric, we have T T = (Q T AQ) T = Q T A T (Q T ) T = Q T AQ = T so T is symmetric and thus tridiagonal. If we used a process like Gram-Schmidt to find the vectors of Q successively, we would end up with the formula for Q n and T n above, where Q n was the first n columns of Q and T n was the upper (n + 1) n block of T. This Gram-Schmidt like process is the Lanczos iteration. Write T n as 29

30 T n = α 1 β 1 β 1 α 2 β 2 Using the relation for Q n and T n above, we see that β 2 α βn 1 β n 1 α n β n Aq n = β n 1 q n 1 + α n q n + β n q n+1 As before, α n can be found by taking the inner-product of both sides with q n and β n is simply found to normalize q n+1. Thus, the algorithm is greatly simplified in this case. We have beta_0 = 0; q_0 = 0; b given; q_1 = b/ b ; for n=1,2,... v = Aq_n; alpha_n = (q_n,v); v = v - beta_{n-1} q_{n-1} - alpha_n q_n; beta_n = v ; q_{n+1} = v/beta_n; endfor It can be seen from the algorithm above that we only need the last two iterates to compute the next. Thus, the phrase three-term recurrence is often used for the Lanczos iteration. When the Lanczos iteration is used in a GMRES-like algorithm, we have MINRES. 6.7 Conjugate Gradient Conjugate Gradient (CG) is an iterative method for solving the system Ax = b, when A is symmetric positive definite As a Modification of the Steepest Descent Idea Conjugate Directions. Again, we let 30

31 f(x) = 1 2 xt Ax x T b We note that in the method of steepest descent, we often take many steps in the same direction. Wouldn t it be nice to get the correct step length for each direction the first time we move in that direction? (Yes, it would). If we have orthogonal search directions d 0,..., d n 1, then to ensure we ve gone the correct distance, we need that the error e i+1 = x x i+1 for x i+1 = x i + α i d i should be perpendicular to the search direction d i. d T i e i+1 = 0 d T i (e i α i d i ) = 0 α i = dt i e i d T i d i However, we cannot compute the values α i because we do not know the error e i as we don t know x. Instead, we require that the search directions be A-orthogonal and that e i+1 is A-orthogonal to d i. This is equivalent to finding the minimum of f with respect to α. We have d dα f(x i+1) = 0 f (x i+1 ) T d dα x i+1 = 0 ri+1d T i = 0 d T i Ae i+1 = 0 Further, with this choice we have the following equation for α i dt i r i α i = d T i Ad i which can be evaluated. If the search direction were the residual, this would be exactly the method of steepest descent. To see that this procedure computes x in n steps, we note that the vectors d i form a basis for the whole space and thus we can write n 1 e 0 = δ j d j 31 j=0

32 if we take the A-inner-product of each side of the equation with d k we get n 1 d T k Ae 0 = δ j d T k Ad j j=0 δ k = dt k Ae 0 d T k Ad k = dt k A(e 0 + k 1 i=0 α id i d T k Ad k = dt k Ae k d T k Ad k = α k Thus, we see that we cut down the error at each step. In particular n 1 e k = δ j d j If we take the A-inner-product of either side of this equation with d i for i < k we get j=k d i Ae k = d T i r k = 0 We note that, as with the method of steepest descent, the number of matrix-vector multiplications can be reduced to one by noting that and that Ad i has already been computed. r i+1 = Ae i+1 = r i α i Ad i Optimality. How good is the error term? If D i = span d 0,..., d i 1, suppose we d like to minimize the value of e i taken from D i + e 0 with respect to the A-norm. We have n 1 n 1 e i 2 A = δ j δ k d T j Ad k j=i k=i n 1 = δj 2 d T j Ad j j=i 32

33 Thus, with our choice of α i, the error we obtained involves only search directions which are not yet available (not in D i ). This implies that the error for x (forward error) is minimal in this norm. We note that this is in contrast to GMRES and MINRES, in which we minimize the Euclidean error of the residual over the spaces K i. Finding the Search Directions. The search directions could be found by starting with an arbitrary basis {u j } of the space and using a conjugated form of Gram-Schmidt on that basis. However, this process involves quite a bit of work. If we instead choose a particular basis made of the residuals u i = r i, there are a number of advantages. As shown above, the residual is orthogonal to the previous search directions. Thus, unless the residual is zero, we can get a new search direction (if the residual is zero, we don t need a new search direction). Choosing the r i as the basis vectors has a further advantage. First, we note that if D i = span d 0,..., d i 1 and R i = span r 0,..., r i 1 then we necessarily have that D i = R i. Further, by the equation we have that r i+1 = r i α i Ad i D i = span d 0, Ad 0,..., A i 1 d 0 = span r 0, Ar 0,..., A i 1 r 0 Thus, D i is a Krylov subspace generated by r 0. Because AD i D i+1 and r i+1 D i+1, we have that r i+1 is A-orthogonal to D i, i.e. r T i+1 Ad j = 0 for j i 1. Thus, the Gram-Schmidt conjugation process is greatly simplified because r i+1 is already A-orthogonal to the previous directions other than d i. To highlight this simplification, we write out what happens for the Gram-Schmidt conjugation process which gives i 1 d i = r i + β i,k d k k=0 i 1 d T j Ad i = d T j Ar i + β i,k d T j Ad i k=0 0 = d T j Ar i + β i,j d T j Ad j β i,j = dt j Ar i d T j Ad j β i,j = { dt i 1 Ar i d T i 1 Ad i 1 for j = i 1 0 for j < i 1 33

34 The matrix-vector multiplications from the top and bottom of this fraction can be removed by observing that Ad i 1 = (r i r i 1 )/α i 1 and that r T i r j = 0 for i j. This gives β i,j = { 1 α i 1 Finally, we use the definition of α i 1 to get r T i r i d T i 1 Ad i 1 for j = i 1 0 for j < i 1 β i,j = { r T i r i r T i 1 r i 1 for j = i 1 0 for j < i 1 The CG Algorithm. Now, we write β i = β i,i 1. We then have the following algorithm (in pseudocode) // initial direction d_0 = r_0 = b-ax_0 ; // find the minimizing step in the direction d_i alpha_i = r_i^t r_i / ( d_i^t A d_i ); // take that step x_i+1 = x_i - alpha_i d_i; // update the residual recursively (note that this can be unstable) r_i+1 = r_i - alpha_i A d_i; // find the new direction beta_i+1 = r_i+1 ^T r_i+1 / (r_i^t r_i); d_i+1 = r_i+1 + beta_i+1 d_i Krylov Subspace Characterization (Minimizing Polynomial) To analyze the convergence of the conjugate gradient method, we will outline the relation between Krylov subspaces and minimizing polynomials. We have D i = span {r 0, Ar 0,..., A (i 1) r 0 } = span {Ae 0, A 2 e 0,..., A i e 0 } As noted in the section on the optimality of CG, the method chooses x i which minimizes e i in the A norm over the set e 0 + D i, i.e., the method chooses the polynomial p(z) with p(0) = 1 which 34

35 minimizes p(a)e 0 in the A norm. Call the i-th degree polynomial obtained at the i-th step P i. We write e i = P i (A)e 0. Write n e 0 = ξ j v j where the v j are a set of orthonormal eigenvectors of A with eigenvalues λ j > 0. We have j=1 e i = j ξ j P i (λ j )v j Ae i = j ξ j P i (λ j )λ j v j e T i Ae i = j ξ 2 j (P i (λ j )) 2 λ j e i 2 A min P i max (P i(λ)) 2 λ σ(a) j ξ 2 j λ j e i 2 A min P i max (P i(λ)) 2 e 0 2 A λ σ(a) We note that the P i must satisfy P i (0) = 1. Polynomials of degree i can be chosen to fit i + 1 points, thus we can satisfy P i (0) = 1 and P i (λ) = 0 for i eigenvalues. This is another way to see that CG converges in n steps. This also suggests that CG is quicker with duplicated eigenvalues Convergence Following the ideas above, we can gain a bound for the error e i in the A norm if we come up with a specific example of an degree i polynomial which satisfies P i (0) = 1. Instead of trying to fit P i to be exactly zero at the eigenvalues, we choose P i to be small on the range of the eigenvalues of A, i.e., on [λ min, λ max ]. To accomplish this, we consider the Chebyshev polynomials T i. They satisfy two nice properties. On [ 1, 1] we have that T i (x) 1. Further, outside of [ 1, 1] the value of T i (x) is maximal among all polynomials which satisfy the first property. We then set P i (x) = T i( λmax+λ min 2x λ max λ min ) T i ( λmax+λ min λ max λ min ) The scaling in the numerator ensures that the numerator takes values in [ 1, 1] for x [λ min, λ max ]. The value chosen for the bottom ensures that P i (0) = 1. We then have that e i A max σ(a) P i e 0 A T i ( λmax + λ min λ max λ min 35 ) 1 e 0 A

36 ( ) κ = T i e 0 A κ 1 We then employ the following explicit formula for T i This gives T i (x) = 1 2 [(x + x 2 1) i + (x x 2 1) i ] e i A 2 [ ( κ ) i ( ) ] i κ 1 + e 0 A κ 1 κ + 1 ( ) i κ 1 2 e 0 A κ + 1 where the last line is the most common error estimate used for CG. This gives the following estimate on the number of iterations required to reduce the norm of the error by a factor of ɛ. We need about ( ) κ 2 i 2 ln ɛ 6.8 Preconditioning Preconditioning of linear systems is a rather large subject, so we only summarize it briefly here. The idea of preconditioning is to approximate A 1 using a matrix that is easier to invert. For instance, the Jacobi preconditioner takes the diagonal part of A, call it D, and inverts this. We get the resulting equivalent system (as long as D is invertible) D 1 Ax = D 1 b The idea is that D 1 A has a lower condition number and thus will behave better in the iterative methods described above. The Jacobi preconditioner is not particularly sophisticaed but is very easy to compute. The main goal of any preconditioner is to balance the effectiveness of the preconditioner with the difficulty of computing it. For instance, taking A itself is a perfect preconditioner but we haven t accomplished anything in that we must be able to compute (or evaluate) the inverse of A (which was the goal to begin with). 7 Interpolation by Polynomials 7.1 Existence If we have a polynomial p n (x) = a 0 + a 1 x + a n x n for distinct points x i and distinct values y i, i = 0,..., n we can find a i such that p n (x i ) = y i. One can write this as a Vandermonde matrix 36

37 problem but we can also observe that p n (x) = y i l i (x) is a solution to the problem where k i l i (x) = (x x k) k i (x i x k ) These are called the Lagrange polynomials. They directly show that the Vandermonde matrix is invertible for distinct x i. 7.2 Divided Differences To find the coefficients one could try to invert the Vandermonde matrix, which often takes O(n 3 ) time and seems too slow. Calculating the coefficients at each step is okay but requires that you start over if you add a new point. Polynomials can always be rewritten for a different center p n (x) = a 0 + a 1(x c) + + a n(x c) n a useful observation is that one can write the polynomial expanded about different centers, called the Newton form p n (x) = b 0 + b 1 (x c 1 ) + b 2 (x c 1 )(x c 2 ) + + b n (x c 1 ) (x c n ) = b 0 + (x c 1 )(b 1 + b 2 (x c 2 ) + ) = b 0 + (x c 1 )(b 1 + (x c 2 )(b 2 + b 3 (x c 3 ) + )) where the last two lines suggest an efficient way to evaluate the polynomial at a particular x. There is an efficient way to find the coefficients of the Newton form where c i = x i 1. We have p n (x) = A 0 + A 1 (x x 0 ) + + A n (x x 0 ) (x x n 1 ) and p n (x i ) = f i. Plugging in x 0 and x 1, we see A 0 = f 0 A 1 = f 1 f 0 x 1 x 0 These coefficients can be found recursively. We let A k = f[x 0,..., x k ] = f[x 1,..., x k ] f[x 0,..., x k 1 ] x k x 0 We note that if we are interpolating over four points, this gives 37

38 x 0 f[x 0 ] f[x 0, x 1 ] x 1 f[x 1 ] f[x 0, x 1, x 2 ] f[x 1, x 2 ] f[x 0, x 1, x 2, x 3 ] x 2 f[x 2 ] f[x 1, x 2, x 3 ] f[x 2, x 3 ] x 3 f[x 3 ] From this, we see that adding a new point, say the n-th point only requires about n new difference calculations. Thus, we can find the interpolating polynomials one point at a time for a total amount of work about n 2 /2 up to the n-th step. If we want to interpolate using function values and derivatives, we can modify the scheme by x 0 f(x 0 ) f (x 0 ) x 0 + ɛ f(x 0 ) f[x0, x 0, x 1 ] f[x 0, x 1 ] f[x0, x 0, x 1, x 1 ] x 1 f(x 1 ) f[x0, x 1, x 1 ] f (x 1 ) x 1 + ɛ f(x 1 ) To figure out the error in the interpolation, we consider adding a new point x. This gives We note p n+1 (x) = p n (x) + f[x 0,..., x n, x] n (x x j ) j=0 So the error is n f( x) = p n+1 ( x) = p n ( x) + f[x 0,..., x n, x] ( x x j ) j=0 e n ( x) = f( x) p n ( x) = f[x 0,..., x n, x] n ( x x j ) We claim that f[x 0,..., x n, x] = f (m+1) (ζ)/(n + 1)! for some ζ [min x i, max x i ]. To see this, we note that f p n+1 vanishes at at least n + 2 points. Thus f p n+1 vanishes at at least n + 1 points. Thus, f (n+1) (ζ) = p (n+1) n (ζ) for some ζ. Then we note that because p n+1 = lower order terms + f[x 0,..., x n, x] 38 j=0 n (x x j ) j=0

39 we have 7.3 Piecewise Approximations f (n+1) (ζ) = p (n+1) n (ζ) = (n + 1)!f[x 0,..., x n, x] There are serious limitations to piecewise linear approximations. In particular, the derivatives can not be made continuous in general. The preferred method is piecewise cubic splines. If you know the derivative values, then you can use the modified divided differences method of the previous section. This gives P 3 (x) = f 0 + f 0(x x 0 ) + f[x 0, x 1 ] f 0 x 1 x 0 (x x 0 ) + f 1 2f[x 0, x 1 ] + f 0 (x 1 x 0 ) 2 (x x 0 ) 2 (x x 1 ) for the interval [x 0, x 1 ]. If you don t have the derivative values, you can leave them as unknowns (let s call them S i ) and then require that the second derivatives match at the interpolation points on the interior. If you are interpolating on x 0 < x 1 < < x n, this gives a tri-diagonal system of n 1 equations for the n + 1 values S i. The formula is then P 3,i (x) = f i + S i (x x i ) + f[x i, x i+1 ] S i (x x i ) + S i+1 2f[x i, x i+1 ] + S i x i+1 x i (x i+1 x i ) 2 (x x i ) 2 (x x i+1 ) on each interval [x i, x i+1 ]. The system of tridiagonal equations then results by setting d 2 dx 2 P 3,i 1(x i ) = d2 dx 2 P 3,i(x i ) for i = 1,..., n 1. To get two more equations there are options to handle the endpoints. You can set the derivatives of the end points to some prescribed values or an approximation of the derivative. You could require that the third derivatives match up for x 1 and x n 1. You could require that p (x 0 ) = p (x n ) = 0. 8 Numerical Integration 8.1 Using Equidistant Points The typical approach is to find an interpolating polynomial for the given function and exactly integrate that polynomial. We conisder the problem I(f) b a f(x) dx 39

40 In general, suppose we interpolate with a k-th degree polynomial on k + 1 points, p x (x i ) = f(x i ) for i = 0,..., k. From the previous section, we had where f(x) = p k (x) + f[x 0,..., x k, x]ψ(x) k ψ(x) = (x x j ) j=0 If E(f) is the error in the integral approximation, we have E(f) = Trapezoidal Rule b a f[x 0,..., x k, x]ψ(x) dx For interpolation with two points, we have ψ 1 (x) = (x a)(x b) C This function always has the same sign in [a, b]. From the previous section, we also had We note that f[a, b, x] = f (ζ x ) 2! Thus, if we let b a (x a)(x b) dx = (x a)(x b) 2 /2 b a (x b) 3 /6 b a = (b a)3 6 then we have I(f) = b a f a + f(b) f(a) (x a) dx = b a f(a) + f(b) (b a) 2 b a f dx I(f) = E(f) = f (ζ) (b a)3 12 Thus, the trapezoidal method overestimates for concave up functions and underestimates for concave down functions. If we are integrating over a uniform grid, we have xn x 0 f(x) dx h 2 N 1 (f(x k+1 ) + f(x k )) = b a 2N (f(x 0) + 2f(x 1 ) + + 2f(x N 1 ) + f(x N )) k=0 40

41 The convergence of the method is often very fast for smooth periodic functions. One can expect this because a periodic function should spend roughly the same time concave up as concave down and the errors cancel. For a more rigorous explanation of this fact, one turns to Euler-MacLaurin formulae Midpoint Rule This is interpolation by a constant at the midpoint. You have ( ) a + b I(f) = f (b a) 2 Because it is centered, you again achieve third order accuracy with an even slightly better bound than trapezoidal Simpson s Rule E(f) = f (ζ) (b a)3 24 Simpson s is a 3 point rule that interpolates at the endpoints and midpoint. If you choose the constants A, B, C such that I(f) = Af(a) + Bf((a + b)/2) + Cf(b) is exact for polynomials of degree less than or equal to 2, you get Simpson s rule. It turns out that the rule is exact up to degree 3 polynomials. The rule is ( 1 I(f) = (b a) 6 f(a) f((a + b)/2) + 1 ) 6 f(b) The error is for η (a, b). b a f dx I(f) = E(f) = f (4) (η)((b a)/2) Rules Using Hermite Cubics Hermite cubics are the interpolating polynomials of degree 3 which match the function value and first derivative at the endpoints. How to find these polynomials is outlined in the previous section. Using these, we get a rule I(f) = x i+1 x i 2 (f(x i ) + f(x i+1 )) + (x i+1 x i ) 2 (f (x i ) f (x i+1 )) 12 41

42 which is exact for cubics. Further, if you concatenate with this rule, the derivatives cancel out in the middle. Thus, you need only the function evaluations at all the points and then the additional derivative evaluations only at the endpoints. 8.2 Quadrature Rules Given m quadrature points, it is possible to come up with a scheme that is exact for polynomials of degree 2m 1 or less. We consider integrating against a weight function w, that is we wish to approximate b a f(x)w(x) dx using m points. We have the following theorem: if the points x 0,..., x m 1 are chosen as the zeros of the polynomial ϕ m (x) of degree m in the family of orthogonal polynomials associated with w(x), then b a f(x)w(x) dx A 0 f 0 + A m 1 f m 1 is exact for polynomials up to degree 2m 1. The coefficients are given by A i = b a L i (x)w(x) dx where L i is the Legendre polynomial taking the value 1 at x i and 0 at all other x j (this choice should be clear). The proof of the theorem follows by writing a degree 2m 1 polynomial f as f = qϕ m + r with q and r of degree less than or equal to m 1. Then you use orthogonality and the fact that the choice of coefficients guarantees exact integration for degree m 1 polynomials Optimality of Gaussian Quadrature From the above, we see that Gaussian quadrature gives you a method which is exact for polynomials up to degree 2m 1. This is optimal as there is no quadrature scheme with m points that is exact for polynomials of degree 2m. Let x i be the quadrature nodes. Consider the polynomial p(x) = m (x x i ) 2 i=1 The integral of this polynomial is strictly positive over any interval. However, the quadrature rule gives for any choice of coefficients A i. p(x) m A i p(x i ) = 0 i=1 42

43 8.3 Adaptive Methods The error bounds for the schemes above depend heavily on the length of the interval and the value of the derivatives of the function we re trying to integrate. Thus, for a fixed length interval, we could obtain a more accurate solution by dividing the interval into lots of smaller subintervals uniformly. However, in doing so, we may end up doing extra work than is necessary. For instance, we can use a much coarser grid for sections of the interval over which f is relatively flat. To save the method from unnecessary evaluations, we can use adaptive methods. We ll use a concrete example of an adaptive method for Simpson s rule to elaborate. Suppose we want to approximate the integral of b a f(x) dx up to a given tolerance ɛ using Simpson s rule. We have the following formula for the error E(f) = f (4) (η)((b a)/2) 5 90 Let S be the approximation over the whole interval and S be the approximation obtained by splitting the interval in half, using Simpson s rule on each subinterval, and adding the results together. We note that S S = E E where E and E are the errors for S and S respectively. From the formula for the error, we see that E 16E. Thus, if E E 15ɛ, then the approximation S is acceptable. We then define the integral over an interval (c, d) recursively as follows. Let ɛ be the desired precision (tolerance) on (c, d). Split (c, d) in half and compute the Simpson s rule approximation on each subinterval. Use the above to determine if this approximation is good enough. If not, do the same for each subinterval with the tolerance ɛ/2, and so on (note: you should have some sort of maximum depth set). If you program an adaptive method like this using a literal recursion, you can avoid doing extra function evaluations (when intervals have the same endpoints, etc.) by passing function values from level to level. If you unroll the recursion in your code, you can still save on function evaluations by storing the evaluations you ve already performed in a stack. 8.4 Issues and Other Considerations Singularities: the following examples can be found in Dahlquist s book. By substitution: If we wish to calculate A = 1 we can use the substitution x = t 2 giving 0 x 1/2 e x dx 43

44 A = 1 0 2e t2 dt which can be approximated using the methods described above. Using integration by parts: for the same integral as above, we can integrate by parts, giving 1 0 x 1/2 e x dx = [2x 1/2 e x ] [ 2 = 2e 2 = 2 3 e x3/2 e x ] 1 x 1/2 e x dx x 3/2 e x dx 1 0 x 3/2 e x dx and so on. We note that after the first integration by parts, we don t have a singularity. However, the derivative of the function is singular and therefore the methods will not converge rapidly. Generally, the more continuous derivatives we can take the better. Simple comparison problem: We consider the problem of calculating the integral I = 1.1 x 3 e x dx which is infinite near the left endpoint. We could instead write the integral as I = 1.1 x 3 (1 + x + x 2 ) dx x 3 (e x 1 x x2 2 ) dx The first integral above can be evaluated analytically and the second integral has an integrand which is well behaved (bounded with bounded derivatives) and can therefore be evaluated numerically to high precision. Special integration formula: in many cases, if a function has a certain kind of singularity (e.g. f has a log singularity at zero) then near the singularity, the function is actually equal to the type of singularity times a smooth function plus a smooth function (e.g. f(x) = h(x) log(x) + g(x) for h, g smooth). Thus, coming up with a quadrature that works to integrate h 0 h(x) log(x) dx can have a wide range of applications. These quadratures can be found by using undetermined coefficients to integrate polynomials up to a certain degree exactly. 44

45 Infinite interval of integration: analagous versions of the methods above for singular integrals can be effective for infinite intervals of integration. Here are a few other thoughts Slowly decaying integrands: when the integrand decays slowly, one can use numerical integration on a modest length interval (say [0, R]) and use some analysis for the tail. To evaluate the integral on [R, ), one can expand the function in powers of x 1 and integrate these analytically and sum the totals numerically. Oscillating integrals: if the integrand decays slowly and oscillates as x, then the approach is quite similar to that for evaluating alternating series efficiently. Namely, one can split the integral into an alternating series of integrals over the intervals where the integrand is alternatively positive and negative. Then, one can apply the technique of repeated averaging to speed the convergence of the sum. We briefly described repeated averaging below. Smooth, rapidly decreasing integrands: if the integrand is smooth and has rapidly decreasing values and derivatives, then the trapezoidal rule gives a very good approximation to the integral over large intervals (e.g. consider integrating e x 4 dx by the trapezoidal rule over [ R, R]). The quality of this scheme can be explained by two facts. One is that R e x4 dx R x 3 R 3 e x4 dx = 1 R e R4 which decays rapidly. Let f(x) = e x4. The other fact is that the Euler-Maclaurin formula gives us that if I is the trapezoidal rule approximation of the integral over [ R, R] with step size h, we have R I f(x) dx = h2 R 12 [f (R) f ( R)] h4 720 [f (R) f ( R)] + As f and its derivatives vanish quickly as R we see that the trapezoidal rule gives a good approximation. Further, we could try to correct the error using the formula above to get a higher order scheme. Repeated Averaging: for slowly converging, alternating series, one can employ repeated averaging to speed up the convergence. We include the following example from Dahlquist and Björck. We have the following formula π 4 = which is a very slowly converging sum (after 500 terms the value can still change in the third digit). Consider the following scheme 45

46 n S n M 1 M 2 M 3 M 4 M 5 M in the above we start with a column of the partial sums and the following columns are obtained by averaging each entry s northwest and southwest neighbors. Note that the values oscillate in each column. Generally, it can be shown that if the absolute value of the j-th term (as a function of j) has a k-th derivative which approaches zero monotonically, then the values in column M k will alternatively be above and below the limit of the series. Here, the value in M 6 is actually correct to 6 digits. Euler-Maclaurin formulae: Let ˆT (h) be the trapezoidal sum for a function f on the interval [a, b]. If f is sufficiently smooth, then ˆT (h) = b a f(x) dx+ h2 12 [f (b) f (a)] h4 720 [f (b) f (a)]+ +c 2r h 2r [f (2r 1) (b) f (2r 1) (a)]+o(h 2r+2 ) The constants c 2r have the generating function 1 = c 2 h 2 + c 4 h 4 + = h e h e h 1 Proving the formula is a bit tedious, but we ll mention that it involves repeated integration by parts. Finding the generating function for the constants is simpler. One simply applies the formula to f(x) = e x. The formula has many uses. It gives an asymptotic expansion for the error in a trapezoidal integration rule. In particular this means that Richardson extrapolation (see ODE section) can be used with the trapezoidal rule to get higher order schemes (called Romberg s method). It has the application mentioned above for the integral of e x4 and generally when you know the formula for the derivatives at the endpoints. 46

47 It shows that the convergence as h 0 of the trapezoidal rule applied to smooth periodic functions (note f (k) (b) = f (k) (a) in this case) is super-algebraic (faster than h p for any p). It can also be used for the inverse problem, i.e., evaluating sums when the integral is simpler to compute. 9 Nonlinear Equations and Newton s Method 9.1 The Bisection Idea Given a continuous function f, one can use the intermediate value theorem to come up with the following scheme. Given a < b such that f(a) < 0 and f(b) > 0, we come up with a k, b k iteratively by m_k+1 = (a_k + b_k)/2 if f(m_k+1) > 0 b_k+1 = m_k+1 a_k+1 = a_k if f(m_k+1) < 0 a_k+1 = m_k+1 b_k+1 = b_k We see that the intervals (a k, b k ) always contain a root and are shrinking in size by a factor of 2 at each step. This is slow. It takes an iteration to gain a binary digit. To gain a decimal digit it takes a little over 3 iterations. There are better methods that take better advantage of the values of f (note that bisection only considers the sign) and its derivatives. These methods often need a decent starting position to be effective and bisection can provide this starting position. 9.2 Newton s Method The idea behind Newton s method is to obtain a sequence x n iteratively using a difference formula for the derivative of the function f. If the next step is close to a root, we assume f(x n+1 ) 0. Then f (x n ) f(x n) f(x n+1 ) x n x n+1 f (x n ) f(x n) x n x n+1 x n+1 x n f(x n) f (x n ) 47

48 Let α be a simple root of a function f (i.e. f(α) = 0 and f (α) 0). If your starting x 0 is near enough to the root and we have suitable bounds on f and f, then Newton s method is quadratically convergent. This can be seen by using a Taylor expansion. Let e n = α x n and ζ (x n, α). Then 0 = f(α) = f(x n ) + e n f (x n ) + e2 n 2 f (ζ) 0 = f(x n) f (x n ) + e n + e2 nf (ζ) 2f (x n ) 0 = e n+1 + e2 nf (ζ) 2f (x n ) e n+1 e n 2 f (ζ) 2 f (x n ) Let M = max f (x) on the interval (α e 0, α + e 0 ) and assume that m = max x M/(2f (x)) < 1/ e 0 on that interval. Then we see that e n e n 1 2 m m e n (m e n 1 ) 2 m e n (m e n 2 ) 22. m e n (m e 0 ) 2n giving quadratic convergence. We note that quadratic convergence is significantly better than linear convergence. For each iteration, we double the number of significant digits in the computation. Thus, to get about the same number of digits as a method with linear convergence that took n steps, we only need to do about log 2 (n) steps. There are many ways in which the method can go wrong. For the function f(x) = x 1/3 we note that the derivative at zero is infinite. We note that we otherwise have f (x) = 1 3 x 2/3. Say our starting point is x 0. We then have x 1 = x 0 3x1/3 0 x 2/3 = 2x 0 0 x 2 = 2x 1 = 4x 0 x 3 = 2x 2 = 8x 0 48

49 So we see that for any starting guess, the method diverges away from the zero. For the function f(x) = x 2 we have a double root at zero. We note that f (x) = 2x and x n+1 = x n f(x n) f (x n ) x n+1 = x n x2 n = x n 2x n 2 so we see that the method still converges to zero but at a rate as slow as the bisection method. 9.3 The Secant Method If the process of evaluating the derivative of f is impossible or prohibitively expensive, one can instead use the secant method. It requires two starting conditions but only involves one new function evaluation per step. The main idea is to use Newton s method but to approximate the value of f (x n ) by a difference formula. You get x n x n 1 x n+1 = x n f(x n ) f(x n ) f(x n 1 ) You can again get an idea of the rate of the convergence of the method by considering a Taylor expansion. If e n is the error at each step and you make similar assumptions on f as for Newton s method, you can find e n+1 K e n e n 1. Interestingly, one can find that the order of convergence is given by the golden ratio (1 + 5)/2. 10 Numerical ODE 10.1 The Set-Up For the following, we assume that we are dealing with a system of first-order ODE s. This is general enough to cover higher order ODE s because given a higher-order problem, we can convert it into a first-order system as follows. Let y (n) = f(t, y, y,..., y (n 1) ) Then we introduce new variables v 1,... v n with v i = y (i 1). We have d dt v 1. v n 1 = v n 49 v 2. v n f(t, v 1,..., v n )

50 Further, we often assume that the equation is autonomous (no explicit t dependence) as t can be easily absorbed into the system as follows. Let y = f(t, y) Then if we set v 1 = y and v 2 = t we have ( ) ( d v1 f(v2, v = 1 ) dt v 2 1 ) 10.2 It s Almost Always this Idea To show the order (and hence convergence) of many ODE methods, one can generally follow these steps: Estimate the local truncation error, i.e., the error that is accrued if you start at the correct location. The standard way to find this error is through a Taylor expansion. Derive a relation between the error at the i-th step and (i 1)-st step. This is often accomplished by assuming some sort of Lipschitz continuity on the data. Solve the relation for the error at the i-th step in terms of the starting error. As an example, we perform the above steps for Euler s method. ẋ = f(x) makes the approximations as follows: Euler s method for the ODE x i+1 = x i + tf(x i ) Let y(t) be a solution to the ODE. If we assume that we start at y i = y(t i ) then the local truncation error can be found by Taylor expanding y(t i+1 ). We have y(t i+1 ) = y(t i ) + ty (t i ) + t2 2 y (ξ) = y i + tf(y i ) + t2 2 y (ξ) = y i+1 + t2 2 y (ξ) Giving the local truncation error τ i = y(t i+1 ) y i+1 = t2 2 y (ξ) Next, we derive a difference relation for the error. Let x i be the computed solution. We have y(t i ) = y(t i 1 ) + tf(t i 1, y(t i 1 )) + τ i 1 x i = x i 1 + tf(t i 1, y i 1 ) 50

51 y(t i ) x i = y(t i 1 ) x i 1 + t[f(t i 1, y(t i 1 )) f(t i 1, y i 1 )] + τ i 1 e i e i 1 + t f(t i 1, y(t i 1 )) f(t i 1, y i 1 ) + τ i 1 (1 + tl) e i 1 + τ i 1 Where the last line assumes a Lipschitz condition on f with constant L (note this is stronger than necessary but makes the calculation easier). Finally, we solve this difference relation in terms of the starting error. e i (1 + tl) e i 1 + τ i 1 (1 + tl) 2 e i 2 + (1 + tl) τ i 2 + τ i 1. i 1 (1 + tl) i e 0 + (1 + tl) p τ i p 1 p=0 Now, we assume that we have a bounded second-derivative on the solution, say it s bounded by 2M. Let N = (T t 0 )/ t. Then N 1 e N (1 + tl) N e 0 + (1 + tl) p τ i p 1 p=0 N 1 e tln e 0 + M t 2 (1 + tl) p p=0 e L(T t 0) e 0 + t 2 (1 + tl)n tl 1 e L(T t 0) e 0 + C t Thus, if the starting error is order 1, we have an order 1 approximation A Note on Conditioning ODE s often have a family of solution curves depending on the initial condtions. From this vantage point, we can describe the conditioning of an ODE as follows: if the curves in the family of solutions depart from each other rapidly, then the initial-value problem is ill-conditioned; otherwise, the problem is well-conditioned. We note that this definition depends only on the ODE and does not consider conditioning issues specific to the algorithm used to solve the ODE. Also, it should be clear that this definition of conditioning is distinct from that given for numerical linear algebra problems. To get a more quantitative idea of this definition, we consider the ODE 51

52 ẋ = f(x) d and we let F (t, x 0 ) be the solution operator for initial condition x(0) = x 0. If dxf L for some L (which can be negative) we have F (t, x 0 ) F (t, x 0 + h) = x 0 + F (t, x 0 ) F (t, x 0 + h) h + t 0 t h + L 0 t f(f (s, x 0 )) ds x 0 h 1 f(f (s, x 0 )) f(f (s, x 0 + h)) ds 0 F (s, x) F (s, x + h) ds 0 f(f (s, x 0 + h)) ds F (t, x 0 ) F (t, x 0 + h) he Lt Where the last line follows by Grönwall s inequality. Similarly, we have F (t, x 0 + h) F (t, x 0 ) h e Lt F (t, x 0 ) F (t, x 0 + h) h e Lt Thus, for well-conditioned ODE s (small L or L < 0) we have that the solutions for nearby starting conditions stay close together (in the above sense) Richardson Extrapolation Explanation The set-up for Richardson extrapolation is that we have an algorithm that computes A(h) depending on the step size h. Further, this algorithm is designed to compute the value a and we assume it has the following asymptotic expansion A(h) = a + a 1 h p 1 + a 2 h p 2 + a 3 h p 3 + for increasing p i. We then build better approximations using A(h) as follows. Let A 1 (h) = A(h) and Then A n (h) has the form A k+1 (h) = A k (h) + A k(h) A k (2h) 2 p k 1 52

53 A n (h) = a + c n h pn + c n+1 h p n+1 + We include the following example problem to give a fuller understanding of the process. Explain how to construct high-order initial values for an Adams method using the first order forward Euler method and Richardson Extrapolation. Does this significantly increase the cost of integrating an ODE up to some time T? Solution: We note that the forward Euler method has an asymptotic error expansion. Let A denote the solution to the ODE at time t = h and let A(h) be the numerical approximation to A using a time step of length h. We can write A(h) = A + A 1 h 1 + A 2 h 2 + A 3 h 3 + because forward Euler is a first order method. We can combine the results of two of these first order approximations to obtain a second-order approximation by A (2) (h) = 2A(h/2) A(h) = A 1 2 A 2h 2 + higher order terms It is clear that this process can be continued to get higher order approximations defined iteratively. A (p) (h) = 2p 1 A (p 1) (h/2) A (p 1) (h) 2 p 1 1 = A + C p h p + higher order terms If this recursion is unrolled, we see that it is simply a linear combination of values given by the original Euler method for p different step sizes: A(h), A(h/2), A(h/2 2 ),..., A(h/2 p 1 ). The amount of work to evaluate A(h/2 n ) at time t = h involves 2 n steps of forward Euler, so you have 2 n function evaluations (and additions). Thus, the total work to get these values is p 1 n=0 2 n = 2p = 2p 1 A p-th order Adams-Bashforth method needs p initial values, so given the initial value and using the above steps for the remaining values we have (p 1)(2 p 1) extra function evaluations. To compare, the Adams-Bashforth method will require p function evaluations per step; however, only one of these will be new. Thus, to get an approximation at time T this will require about 53

54 T h function evaluations. Thus, if (p 1)(2 p 1) is significantly smaller than T/h the extra work for the initial values will not have much affect on the total amount of work. We note that as h 0 this will be true Uses As outlined above, the Richardson extrapolation process can be used to gain higher-order approximations for a lower-order method. Further, a programmer can approximate the order of convergence that their code is achieving by using a similar idea to the Richardson extrapolation. If we knew the actual answer, we could look at A(2h) a A(h) a = 2p1 h p1 a p2 h p2 a 2 + h p 1 a1 + h p 2 p 1 2 a2 + If we don t know the answer, we can consider instead This fact is useful for debugging A Word of Warning A(4h) A(2h) A(2h) A(h) 2p 1 for small h for small h The above depends not on having a method of a certain order, rather, it depends strongly on the fact that the method has an asymptotic expansion Modified Equation One can gain insight about the qualitative nature of the solution obtained by a particular numerical method by computing what is called the modified equation. The modified equation is an ODE that your computed solution satisfies to higher order than the original ODE. We give an example. If we use Forward Euler to approximate the solution of ẋ = f, i.e. x n+1 = x n + tf(x n ) which ODE does x n satisfy to second order (i.e. third order local truncation error)? We assume that we can write Then we integrate ẋ = f(x) + tf 1 (x) + 54

55 tn+1 x(t n+1 ) = x(t n ) + = x(t n ) + t n tn+1 t n f(x(t)) + tf 1 (x(t)) + O( t 2 ) dt f(x(t n )) + (t t n )f (x(t n ))f(x(t n )) + tf 1 (x(t n )) + O( t 2 ) dt = x n + tf(x n ) + t2 2 f (x n )f(x n ) + t 2 f 1 (x n ) + O( t 3 ) Thus, if f 1 (x(t)) = 1 2 f (x(t))f(x(t)), then we have x(t n+1 ) = x n + tf(x n ) + O( t 3 ) = x n+1 + O( t 3 ) and we see that Forward Euler for ẋ = f is a second order method for ẋ = f(x) t 2 f (x)f(x) We then apply this to a specific problem to see what we can expect in terms of the error. Consider the harmonic oscillator ( ) ( ) d x1 x2 = dt x 2 x 1 We then have that So Forward Euler solves f f = ( d x1 dt x 2 ( ) = ) ( ) ( ) x2 x1 = x 1 x 2 ( x2 x 1 ) + t ( x1 2 x 2 to second order. Normally the Harmonic oscillator starting at (x 1, x 2 ) = (0, 1) has solutions given by (sin t, cos t). However, the modified equation has solutions given by e t/2 (sin t, cos t). Thus, we can qualitatively expect the error of the solution gained by Forward Euler to grow outward from the correct solution (while staying relatively in phase) Splitting Schemes Sometimes, it is desirable to split the right hand side of a differential equation into two parts, giving two ODEs, each of which is easier to solve by a numerical method. However, it is not immediately ) 55

56 obvious that we could combine the solutions to the two new equations in a way that gives a solution to the original equation. Strang Splitting provides a second order way of doing so. Suppose the ODE ẋ = f(x) has solution operator x(t) = F (t, x(0)). has solution operator x(t) = G(t, x(0)). Consider the combined ODE Suppose that ẋ = g(x) Consider the Strang splitting scheme ẋ = f(x) + g(x) y n+1/2 = F ( t/2, x n ) z n+1/2 = G( t, y n+1/2 ) x n+1 = F ( t/2, z n+1/2 ) This produces a second order accurate approximation, which can be seen by taking Taylor expansions. We note that if we have approximate solution operators F and G instead of exact solution operators (e.g. numerical methods), the Strang Splitting solution is second order if F and G are second order. At first glance, the method seems a little inefficient in that F or F is evaluated twice per step. However, we note the following. If the ODE has unique solutions (e.g. f is Lipschitz), then we have F ( t/2, F ( t/2), x) = F ( t, x). To see this, let x(t) be the solution starting at x at time t. Then x( t) = F ( t/2, x( t/2)) = F ( t/2, F ( t/2, x)) x( t) = F ( t, x) So we have equality. Then we note that as we run the algorithm, we really only need to keep track of the y n+1/2 and z n+1/2 variables (until we calculate x N at the last step. We see that y n+3/2 = F ( t/2, x n+1 ) = F ( t/2, F ( t/2, z n+1/2 )) = F ( t, z n+1/2 ) Thus, aside from the beginning and end iterations, the algorithm can go as y n+1/2 = } F ( t, z n 1/2 ) z n+1/2 = G( t, y n+1/2 ) which eliminates one of the evaluations of F per time-step. An example of a situation in which it is desirable to use splitting arises in numerical PDE. Because of the spectral properties of the discrete Laplacian, it is a good idea to use an implicit method for the time-stepping. However, the matrix for the full Laplacian has a rather large bandwidth. In 56

57 contrast, if we split the Laplacian into its xx and yy parts, we have (perhaps after reordering) two tridiagonal systems. These are much easier to invert, which we would have to do in the case of an implicit method (e.g. the trapezoidal rule) Linear Multi-Step Methods (LMM) Linear Multi-Step Methods store previously computed values of x n and f(x n ) and use these to form higher order approximations to the next step than might have been possible by only storing the previous value. They can be either implicit or explicit General Explicit LMM An explicit linear multistep method with p lags can be written as x n+1 = α 0 x n + + α p x n p + t(β 0 f(x n ) + + β p f(x n p )) They are called linear because they are linear in f. By contrast, Runge-Kutta methods are not linear in f Consistency, Zero-Stability, and the Root Condition There is a quick simple check for consistency. In the case where f = 0, we should hope that x = c a constant satisfies the recurrence of the method. We have so i α i = 1 is a quick test of consistency. c = α 0 x + + α p c = c The standard way to check the stability of a LMM is to consider its zero-stability. That is, you check what happens when you apply the method to ẋ = 0. The first observation in this case is that the x i satisfy a linear recurrence relation p i=0 α i x n+1 = α 0 x n + + α p x n p We then seek solutions to the recurrence of the form x n = z n. This gives We then define z n+1 = α 0 z n + + α p z n p z p+1 = α 0 z p + + α p f(z) = z p+1 α 0 z p α p 57

58 to be the characteristic function. The solutions of the recurrence correspond to roots of the characteristic polynomial. It is immediate from the consistency condition, that for a consistent method z = 1 is a root of f. We see that if z > 1 for some root, then x n = z n is a solution which grows without bound (a sign of instability). If z 1, there are two cases. Simple roots are fine and lead to bounded solution x n. If you have a double root z 0, then f(z 0 ) = 0 f (z 0 ) = 0 so we have x n = nz n 0 is a solution. To see this x n+1 α 0 x n α p x n p = (n + 1)z0 n+1 nα 0 z0 n (n p)α p z n p 0 = (n p)z n p 0 f(z 0 ) + [ (p + 1)z0 n+1 pα 0 z0 n 1 α p 1 z n p+1 0 = 0 + z n p+1 0 ((p + 1)z p 0 pα 0z p 1 0 α p 1 = z n p+1 0 f (z 0 ) = 0 So we see that multiple roots are a problem when z 0 = 1. Further, higher multiplicity roots give rise to similar solutions as the above so we see that z 0 < 1 is fine for multiple roots. These notions give rise to the following definition. The Root Condition: A linear recurrence satisfies the root condition if each root z of the characteristic polynomial satisfies either z < 1 or is simple and z = Convergence The ideas above can be used to establish the convergence of an LMM. The result we will use is that a matrix A is power bounded ( A n C independent of n) if and only if its minimal polynomial satisfies the root condition (a statement about the Jordan form of A, in particular no Jordan blocks of size bigger than one for λ = 1 and no eigenvalues larger than 1). The converse direction is proved by constructing a norm under which A is a contraction and then using the equivalence of norms to get power boundedness. The following is taken directly from a homework problem for Numerical Methods II at Courant: Use the result to prove convergence of linear multistep methods. More precisely, suppose the characteristic polynomial of the multistep method satisfies the root condition. Use a norm in which the companion matrix is a contraction. Let X n = (x n,..., x n p ). Then X n+1 = ÃX n + tf (X n ), where Ã is d copies of A, one for each component of x. You will be able to show that if Y n+1 = ÃY n + tf (Y n ) + tr n, then ] Y n X n C(t n ) X 0 Y 0 + t k n C(t n t k ) R k 58

59 where is the contraction norm for Ã. Use this to conclude that if the method has formal order of accuracy q and is stable by the root condition criterion, then it actually is accurate with order q provided the initial steps are done accurately enough. Solution: We calculate the characteristic polynomial of the matrix A. We have a 0 z a 1 a 2 a p 1 a p 1 z A zi = 0 1 z z z Expanding along the first row, this gives that det(a zi) = (a 0 z)( z) p + p ( 1) k a k ( z) p k k=1 p = ( 1) p+1 z + ( 1) p a k z p k Thus, we have that the characteristic polynomial of A is (up to a sign) equivalent to the characteristic polynomial of the linear recurrence. Thus, if the characteristic polynomial of the linear recurrence satisfies the root condition (roots are either simple or have magnitude less than 1), then the eigenvalues of A have magnitude less than or equal to 1. Further, as the minimal polynomial of a matrix must divide the characteristic polynomial, we see that the roots with magnitude one are simple for the minimal polynomial. In particular, the size of the largest Jordan block for such a root must be 1. Thus, the matrix A satisfies the root condition for matrices. From problem 1, we see that there is a norm 1 such that A is a contraction. Let X n (j) be the vector consisting of the j-th element in each of the vectors x n,..., x n p. For each j {1,..., d} we have X (j) n+1 = AX(j) n. Let X = ( X (1) X(d) 2 1 )1/2. We note that with this norm, we have ÃX = ( ) 1/2 ( 1/2 AX (1) AX (d) 2 1 X (1) X (d) 1) 2 = X so Ã is a contraction under. Next, we assume that F satisfies some suitable Lipschitz condition, say F (Y ) F (X) M X Y for M <. Then we have k=0 59

60 Y n+1 X n+1 = Ã(Y n X n ) + t(f (Y n ) F (X n )) + tr n Ã(Y n X n ) + t F (X n ) F (Y n ) + t R n (1 + M t) Y n X n + t R n We use the semi-group approach. Define S n,k for n k by S k,k = 1 S n+1,k = (1 + M t)s n,k It is clear that S n,k e M(tn t k). Assume the initial data is (x 0,..., x p ). Let R = max k n 1 R n 1. We then have Y n X n S n,0 X 0 Y 0 + t e Mtn X 0 Y 0 + k n 1 k n 1 S n,k t R k te M(tn t k) R k tn e Mtn X 0 Y 0 + e M(tn t) R dt 0 = e Mtn X 0 Y M (emtn 1)R Say R = O( t q ). We see that if X 0 Y 0 is O( t q ), then the method is O( t q ) Maximum Order but Unstable Example and The Dahlquist Barriers We begin with an example. The LMM described by x n+1 = 4x n + 5x n 1 + t(4f(x n ) + 2f(x n 1 )) has a local truncation error of order t 4 (so it has formal 3rd order accuracy). However, its recurrence polynomial is z 2 + 4z 5 which has z = 5 as a root. This method is quite unstable. Thus, it seems that even though we have the degrees of freedom to come up with a higher order scheme, it won t necessarily be stable. There is a theorem that summarizes this idea. The First Dahlquist Barrier: an explicit LMM with p-lags has order less than or equal to p + 1 or is unstable. We will mention the Second Dahlquist Barrier in the section on stability analysis for stiff equations. 60

61 Adaptive LMM Adaptive time stepping can be very beneficial when it comes to speeding up an algorithm. If the right hand side of an ODE is smooth for a long region, it is often possible to have an accurate approximation with long time steps. Similarly, in highly oscillatory or steep regions, a shorter time step is often required. Thus, we would like the ability to adjust the time step of an algorithm as necessary. It is a simple matter to make LMM adaptive; however, you do have to allow for the flexibility in the time steps when you derive the methods. The factors by which you allow yourself to adjust the step up (say by γ) and down (say by δ) affect the constant in the error bound you ultimately achieve. In class, it was mentioned that a good rule of thumb was to not adjust up by more than a factor of 2 or down by a factor smaller than 1/ Named Methods Modified Midpoint (a Nystrom Method) This method is similar to the midpoint method that arises from Runge-Kutta. Its formula is x n+1 = x n tf(x n ) It is second order (a result of symmetry). As it is a Nystrom method, the time step must be uniform. Adams-Bashforth. These methods use the value of x n (and no other previous x j values) and the values of f(x n ),..., f(x n p ). They are derived by considering tn+1 x(t n+1 ) = x(t n ) + f(x(t)) dt t n For the approximation, f(x(t)) is replaced by the interpolating polynomial p(t) that goes through the points f(x n ),..., f(x n p ) at times t n,..., t n p. The A-B formula with two points is x n+1 = x n + t 2 (3f(x n) f(x n 1 )) and is second order. (In general you get order p with p points or p 1 lags.) Adams-Moulton. Adams-Moulton methods take a similar form as Adams-Bashforth but also use the value of f(x n+1 ). The coefficients are then chosen to get the highest order possible (which, in general, is one higher than Adams-Bashforth). Nystrom. Nystrom methods are of the form x n+1 = x n 1 + t β j f(x n j ). They are derived by integrating over the interval [t n 1, t n+1 ] (twice as long as A-B). Nystrom methods are often higher order than Adams but they lack the flexibility. The higher order of Nystrom depends on a fixed time step and thus they are not commonly used. 61

62 Backward Differentiation Formula (BDF). These methods are quite different from Adams- Bashforth in that they use many of the previous x j values to approximate f at the next step. Therefore they are implicit. They look like Backward Euler is an example Heuristics x n+1 = α 0 x n + + α p x n p + β tf(x n+1 ) Implicit schemes are often a higher order of accuracy than explicit schemes (in addition to be more stable). A way to gain the accuracy of an implicit scheme without having to solve the implicit equation for f at the current time is to use a predictor-corrector scheme. The idea is to use the result of the p-th order explicit scheme as the value of f at the current time in the implicit scheme. The result is often the same order of accuracy as the original implicit scheme. In contrast with Runge-Kutta, an advantage of LMM is that it requires only one new function evaluation per step. Another difference between the schemes is that Runge-Kutta doesn t really have the same zero-stability considerations. That is, there isn t an equivalent of the root condition for Runge-Kutta Runge-Kutta Runge-Kutta schemes are another popular family of ODE solvers. They are based on the idea of sampling multiple tangents, that is, you move along the curve to a new spot and sample the tangent there for another approximation to the derivative over the next time step. We make this idea clearer in the next section Definition The stages of a Runge-Kutta method refer to how many of these tangents we sample. A general s stage Runge-Kutta method is defined by c 1 a 11 a 12 a 1s c 2 a 21 a 22 a 2s.... c s a s1 a s2 a ss b 1 b 2 b s the above mnemonic is known as a Butcher Tableau. It specifies the scheme as follows = c A b T 62

63 s K i = f(t n + c i t, y n + t a ij K j ) j=1 for each i = 1,..., s y n+1 = y n + t s b i K i j=1 We see immediately that if a ij 0 for some j i then the scheme is implicit. In this case, one might have to solve a system of nonlinear equations in the K i. We ll get back to implicit schemes later. If A is strictly lower triangular (i.e. if a ij = 0 for j i), then the scheme is explicit. It is easier to see in this case what the method does. We have i 1 K i = f(t n + c i t, y n + t a ij K j ) If we think of K i as the i-th tangent we sample, then we see that the method uses the previous tangents to step a certain distance along the curve and evaluates f at that point along the curve at the time t n + c i t. With this characterization, the consistency condition is clear: c i = j a ij Implicit Methods j=1 Dirk, a kind of dagger used in the highlands of Scotland Implicit methods allow for any form of A. We note that for arbitrary A this gives a system of implicit equations in the K i, which is rather undesirable. One simplification is to limit your choices to implicit methods whose matrix A is lower triangular, i.e., a ij = 0 for j > i. These are called diagonally implicit Runge-Kutta (DIRK). In this case, we have an implicit equation for K 1 alone. After we solve for K 1, then there is an implicit equation for K 2 alone, and so on. In this situation, finding the K i is much simpler. To highlight this, we consider the rare case in which we must solve a linear system for the K i. If the spatial dimension is n, then we must solve a system of linear equations in sn unknowns for general implicit methods. If we use Gaussian elimination this requires s 3 n 3 work. For DIRK, we must solve s linear equations in n unknowns. If we use Gaussian elimination for this as well, we have sn 3 work, which is quite a bit better. These restrictions are in contrast to implicit LMM methods, where you only have one system of n unkowns for spatial dimension n. There also exist implicit schemes of very high order. These are the Gauss-Legendre methods and they use Gaussian quadrature ideas to achieve order 2s with an s stage scheme. They are also A-stable, which will be elaborated in the section on ODE stability. 63

64 Adaptive Methods We see from the equations that Runge-Kutta is especially easy to set up as an adaptive method because the previous time steps are not involved in the formulation. The time step is often adjusted adaptively by approximating the error accrued at each step and adjusting the step size up or down based on some threshold for the error. To approximate the error, one can use an embedded Runge- Kutta method. The idea behind this approach is to include two Runge-Kutta schemes in the same tableau (so there aren t extra function evaluations). Often, one of the schemes is of a higher order than the other and for the purposes of approximation is taken to give an exact answer. This allows you to approximate the error and adjust the step-size on the fly. A famous example of such a method is the Runge-Kutta-Fehlberg method, or RK45. It uses a fourth and fifth order method given in the following tableau 0 1/4 1/4 3/8 3/32 9/32 12/ / / / / / /4104 1/2 8/ / / /40 25/ / /4104 1/5 0 16/ / / /50 2/55 where the first row of b j give the coefficients of the forth order method and the second row give the fifth order method. A simpler example involves forward Euler and Heun s method, which are first and second order respectively. Its extended tableau is Runge-Kutta Examples /2 1/2 The following are some common examples of Runge-Kutta methods Explicit Schemes The midpoint method uses a tangent that is approximately given at the middle of the interval. It is given by x n+1 = x n + tf(x n + t/2f(x n )) 64

65 A very popular scheme is Runge-Kutta 4, or RK4. It is a fourth order scheme with only 4 stages (the significance of this is explained in the heuristics section). Its tableau is given by: 0 1/2 1/2 1/2 0 1/ /6 1/3 1/3 1/6 Interestingly, if f has no x dependence, then RK4 is simply the Simpson s rule for numerical integration. Implicit Schemes Backward Euler is the simplest example of an implicit Runge-Kutta scheme (note that it is also a BDF and A-M method). It is given by x n+1 = x n + tf(x n+1 ) The trapezoidal rule uses tangents sampled at both endpoints. It is quite similar to the trapezoidal rule from numerical integration (and in fact, if f has no x dependence it is the same). Its Butcher Tableau is given as /2 1/2 1/2 1/2 It is not immediately obvious that this is equivalent to the trapezoidal rule you get from Adams-Moulton. We have K 1 = f(t n, y n ) K 2 = f(t n+1, y n + t/2f(y n ) + t/2k 2 ) However, we note that this expression can be simplified by considering the equation for the next step. We have y n+1 = y n + t/2f(y n ) + t/2k 2 Comparing with the equation for K 2, we see that K 2 = f(t n+1, y n+1 ) 65

66 and thus we have the familiar formula Heuristics y n+1 = y n + t/2(f(y n ) + f(y n+1 ) We note that it is generally possible to get an implicit s stage scheme of much higher order than the highest order explicit s stage scheme (for consistent schemes). There are implicit methods based on Gaussian quadrature which are order 2s with s stages. On the other hand, the following maximum orders are known for explicit schemes: 10.9 Stiff Problems The Model Problem stage order order of method Suppose we are trying to solve ẏ = λy for A a matrix and with initial condition y(0) = y 0. Then we have the following solution y = y 0 e λt If R(λ) < 0 then lim t y(t) = 0 and if R(λ) > 0 then lim t y(t) =. Further, for purely imaginary λ, we have an oscillatory solution for which y(t) is bounded for all t. Thus, to get qualitatively correct solutions, we hope that our numerical methods experience similar behavior when applied to the model problem, especially we want bounded solutions (stability) for λ such that R(λ) 0. This is of particular importance for stiff problems, which we define below Definition/ Intuition If your problem is given by ẋ = Ax for A a matrix or ẋ = F (x), then we define the notion of a stiff problem as follows. Consider the eigenvalues of A or [JF ] (the Jacobian) with negative real part, if the ratio of the largest magnitude eigenvalue to the smallest is large, then you have a stiff problem. Why is a problem with a large magnitude eigenvalue with negative real part not necessarily stiff? Consider the problem ẏ = 500y which has solution y = e 500t. This function decays extremely rapidly and thus we are likely to only be concerned with the value of the function for short time intervals. Thus, taking small time steps is not quite as big of an issue for this problem. If, however, there are relatively large and small eigenvalues that have negative real part, then we could be concerned with the long time behavior of the components of the solution (generally speaking in the direction of the eigenvector) for the smaller eigenvalue while desiring to correctly resolve the components for the larger eigenvalue. Thus, we would end up using short time steps over a long interval. This gives the stiffness. 66

67 A related characterization of stiffness is then: a problem is stiff if there exists λ k such that where λ k is an eigenvalue of [JF ]. [T t 0 ]R(λ k ) 1 A final characterization of stiffness is that the requirements on the time step to have a small truncation error are weaker than the requirements for the behavior of the local numerical trajectories to have the same behavior as the local trajectories of the exact solution. In particular, if R(λ) = 0 the desired qualitative behavior is a bounded solution, if R(λ) < 0 the desired qualitative behavior is a decaying solution Some Examples of Stiff Problems An artificial stiff problem is given by A = which has eigenvalues 100, 1 + i, 1 i. This has a solution which decays extremely quickly in the directions (1, 0, 0) and which oscillates and decays slowly in the other directions. Stiff problems arise in PDE s. In particular, we note that the matrix associated with the discrete Laplacian is stiff. We consider the interval [0, 1] with homogeneous Dirichlet boundary conditions in 1-D, the center differencing discrete Laplacian is given by h u j = u j+1 2u j + u j 1 x 2 so the matrix we get for the semi-discretization scheme (see numerical PDE section) is given by A = 1 x = 1 x 2 B We will consider the eigenvalues of B as the ratios of the eigenvalues will be the same as the ratios for A. It is immediate by the Gershgorin circle theorem that B is negative semi-definite. Thus, it has real, non-positive eigenvalues. Because finite difference operators are translation invariant, 67

68 we consider discrete Fourier transform basis functions for diagonalizing them. Let these vectors be given by v j and suppose that there are m + 2 points in our discretization (counting the boundary). Because of the boundary condition, we note that we would like (v j ) 0 = (v j ) m+1 = 0. This leads us to choose functions of the form (v j ) k = sin(πjk/(m + 1)) Now, we check what the discrete operator does to these functions. We have h (v j ) k = (v j) k+1 2(v j ) k + (v j ) k 1 x 2 = 1 e πi(k+1)j/(m+1) e πi(k+1)j/(m+1) 2(v j ) k (2i) + e πi(k 1)j/(m+1) e πi(k 1)j/(m+1) 2i x 2 = eπij/(m+1) (v j ) k + e πij/(m+1) (v j ) k 2(v j ) k x 2 = 1 x 2 [2 cos(πj/(m + 1)) 2](v j) k = 1 x 2 4 sin2 (πj/2(m + 1))(v j ) k So we see that (v j ) is an eigenvector of B with eigenvalue 4 sin 2 (πj/2(m + 1))) for j = 1,..., m. (Note that for j = 0, m + 1 the vector (v j ) is zero and hence not an eigenvector). Further, our matrix of unkowns is m m, so we have a complete basis of eigenvectors and we have a formula for each eigenvalue. Now, we note that the largest magnitude eigenvalue is given by j = m and is approximately 4 for large enough m. The smallest magnitude eigenvalue is given by j = 1 and is approximately 1/m 2 for large enough m. Thus, their ratio is order m 2, making this matrix quite stiff for finer discretizations Stability Analysis (Especially for Stiff Problems) The Stability Region We define the stability region of a numerical method based on the model problem described above for stiff equations. Let the numerical method be applied to the problem ẏ = λy with y(0) = y 0 and step size h. The set R C given by R = {z = λh : y k is bounded.} where y k are the iterates of the numerical method is the region of absolute stability. The concept is clearer with examples (see below). 68

69 A-Stability and A-α Stability A numerical method is called A-Stable if {z : R(z) < 0} R for its stability region R. These methods are desirable because there is not time-step restriction for λ t to be in the stability region if R(λ) 0. Thus, these methods are useful for stiff equations. A related notion is A-α stability. A numerical method is A-α stable if {z = re iθ : θ ( π) < α} R. That is, if the wedge that makes an angle of α above and below the x-axis in the left halfplane is a subset of your stability region. These methods are especially useful for symmetric stiff problems (e.g. solving the heat equation) as the value of λ t will be in the region for any time step. Here are some facts without proof: There are no A-stable and explicit methods for either LMM or Runge-Kutta. This property will be clear for Runge-Kutta after the discussion below. The stability regions of these methods are determined by the sets where a polynomial is bounded by 1, which is necessarily a bounded set for non-constant polynomials. The s stage Gauss-Legendre methods (which are iterative Runge-Kutta) are A-stable and order 2s. Thus, there are implicit Runge-Kutta methods of arbitrarily high order. By contrast: The Second Dahlquist Barrier: There are no A-stable and explicit linear multistep methods. The implicit methods which are A-stable have order of convergence at most 2. The trapezoidal rule has the smallest error constant amongst the A-stable linear multistep methods of order Runge Kutta Stability Regions Because of the form of explicit Runge-Kutta methods, their stability regions are defined by a set where a polynomial is bounded. For instance, if we apply forward Euler to the model problem, we have y k = y k 1 + λhy k 1 = (1 + λh)y k 1 Thus, we see that the stability region is given by the z such that 1 + z 1, which is the disc of radius 1 centered at 1. For the midpoint method, we have y k = y k 1 + λh(y k 1 + h/2y k 1 ) = (1 + λh + (λh) 2 /2)y k 1 Thus, the stability region is given by z such that 1 + z + z 2 /2 1. The higher order Runge-Kutta methods have ever larger stability regions, some of which contain portions of the imaginary axis (which is useful for skew-symmetric problems). 69

70 As mentioned above, there are implicit Runge-Kutta methods which are A-stable. An example is the trapezoidal rule. When applied to the model problem, we have y n+1 = y n + t/2(λy n + λy n+1 ) (1 tλ/2)y n+1 = (1 + tλ/2)y n y n+1 = 1 + tλ/2 1 tλ/2 Thus, the stability region is given by those z such that 1 + z/2 1 z/2 1 which is precisely the left half-plane LMM Stability Regions Finding the stability region of an LMM is related to satisfying the root condition. We see that for the model problem x n+1 = α 0 x n + + α p x n p + t(β 1 λx n+1 + β 0 λx n + + β p λx n p ) Let z = λ t. We again have a recursion polynomial, but this time it depends on z. Let Then the recursion polynomial is ρ(ζ) = ζ p+1 α 0 ζ p α p We see then that the stability region is given by σ(ζ) = β 1 ζ p+1 + β 0 ζ p + + β p p(ζ) = ρ(ζ) zσ(ζ) {z : p(ζ) satisfies the root condition.} This concept is much clearer when considering an example method. We will calculate the stability region of the modified midpoint method In this case, we have p = 1 and x n+1 = x n tf(x n ) 70

71 ρ(ζ) = ζ 2 1 This gives that the recurrence polynomial is σ(ζ) = 2ζ p(ζ) = ζ 2 2zζ 1 This polynomial has roots given by ζ = 2z ± 4z = z ± z For the root condition to be satisfied, we need that R(z) = 0. Otherwise, we see that in the above the real part of one of the zeros will be larger than one in magnitude. For similar reasons, it is clear that any purely imaginary solution must have magnitude less than or equal to 1. If ζ = ±i we have a double root of magnitude 1, so the root condition is not satisfied. Thus, the stability region is the purely imaginary interval i ( 1, 1). From this, we see that the modified midpoint rule might be suited to the case where we have a skew-symmetric matrix (as it will have purely imaginary eigenvalues) for the right hand side of our ODE. On the other hand, it is not well-suited for stiff problems at all. Backwards Differentiation Formulas (BDF) often have nice stability regions (though, as noted above, they are necessarily not A-stable for order greater than 2 by the second Dahlquist barrier). A simple example is backward Euler. It s given by We have the following y n+1 = y n + tf(y n+1 ) So the recurrence polynomial is given by ρ(ζ) = ζ 1 σ(ζ) = ζ p(ζ) = (1 z)ζ 1 which has a root given by ζ = 1/(1 z). We would like this root to be bounded by 1, thus the stability region is given by {z : 1 z 1} That is, it s the region outside of the disc of radius 1 centered at 1 (the stability region includes the boundary of this disc). This includes the left half-plane and is thus A-stable. The shape of this 71

72 region is typical of the stability regions for BDF. They are defined outside of some bounded shape (which is mostly in the right half-plane). However, for BDF of order 3 and higher, these shapes creep into the left half-plane and thus these BDF schemes are not A-stable. They are, however, A-α stable. This property makes BDF an attractive alternative to implicit Runge-Kutta schemes for symmetric problems as you can have a high order scheme with just one implicit equation to solve that still has a suitable stability region. 11 Numerical PDE 11.1 Finite Differences for Elliptic PDE The prototypical elliptic PDE is Poisson s equation u = f. The standard approach for solving this problem with finite differences is to discretize the problem on a uniform grid and use a stencil on the interior points. For the 1-D case, we have h u j = u j+1 2u j + u j 1 h 2 and for the 2-D case we have the 5 point stencil h u i,j = u i,j+1 + u i+1,j 4u i,j + u i,j 1 + u i 1,j h 2 Ignoring boundary conditions, this process gives us a matrix equation to solve for u. We have Au = f Where A is the matrix corresponding to the discrete Laplacian h. The tools for solving this problem have been discussed above but we ll summarize the options. First, we note that, as calculated above, the matrix is very poorly conditioned. In the 1-D case we calculated that the condition number was on the order O(m 2 ) where m was the number of points in the discretization. Thus, the convergence of methods like Gauss-Seidel can be quite slow (especially the asymptotic rate of convergence; the initial rate of convergence can be quite good and this fact is exploited in multigrid solvers). Because we have an explicit expression for the eigenvalues, it is possible to determine an optimal choice of ω for the SOR algorithm, which can be quite efficient here. However, if we re being serious a full multigrid scheme or a preconditioned conjugate gradient scheme is likely the best choice. Dealing with boundary conditions is quite tedious and technical and is thus ignored here Finite Differences for Equations of Evolution (Parabolic and Hyperbolic PDE) We will discuss the approach for equations of evolution in terms of semi-discretization schemes. The idea is that we come up with a spatial differencing scheme for the spatial derivatives to get a 72

73 system of ODE s. For example, consider the case of the heat equation u t = u If we discretize the spatial derivatives using a stencil (say we re in 2-D) we have u t = Au where A is the matrix corresponding to the discrete Laplacian h seen in the elliptic section above. This approach of using a spatial discretization to turn the PDE into a system of ODE s is ubiquitous in numerical PDE. As another example, we can use similar ideas to solve the transport equation u t + su x = 0. We consider a few spatial discretizations (we assume s > 0) (u j ) x u j+1 u j 1 centered difference 2 x (u j ) x u j u j 1 upwinding scheme x (u j ) x u j+1 u j downwinding scheme x As we will see in the following sections, the choice of spatial discretization can have a serious impact on the choice of our ODE solver and the success of the method. Often, stability concerns dictate the choice of ODE solver used for a problem. Our final example is the wave equation u tt u xx = 0. We note that we could change this into a coupled system of transport equations for u t and u x but it seems much more desirable to compute u itself. The following approach can be found in Iserles. We consider in addition to u a dummy function v. We set up a coupled system of advection equations u t + v x = 0 v t + u x = 0 And we note that u tt = (u t ) t = ( v x ) t = ( v t ) x = u xx. This coupled equation can then be solved by diagonalizing and using a solver for each advection problem. Another approach (whose effectiveness I m unsure of) is to view this as a second order semi-discretization problem and solve it in the obvious way, i.e., setting v = u t and getting the coupled system v = u xx, u = v. I suspect that the smoothing properties of the heat equation make this a bad approach. Nonlinear problems present difficulties beyond the scope of these notes. 73

74 11.3 Stability, etc For Semi-Discretization If we have a semi-discretization for our ODE and the matrix given by it is well-understood, we can use the considerations for the model problem and stiff ODE s to determine which ODE solver to choose. We use some examples: Advection equation with centered differencing in space: we note that the matrix gained from centered differencing looks like 1 2 x and is thus skew-symmetric. These matrices have purely imaginary eigenvalues. Thus, we see that using forward Euler as the ODE solver for the time stepping is a bad choice (the step size can never be small enough for stability). Interestingly, the leap-frog (or modified midpoint method) is suitable for sufficiently short time steps (as its stability region includes part of the imaginary axis). Finally, most implicit methods are a good choice here. Heat equation in 1-D with the typical stencil. As noted in the other sections, this matrix is symmetric and has negative eigenvalues which are O(N 2 ) and O(1). Thus, the problem is quite stiff. Because the eigenvalues of the matrix are all real, an A-α stable method is sufficient for stability. We note that then the BDF methods are a good choice if we want a higher order method (higher order implicit RK methods have good stability regions as well but their implicit equations are more difficult to solve than those given by BDF) Von Neumann Analysis CFL Conditions Lax Equivalence 11.4 Finite Element Methods for Elliptic PDE A finite element is generally formulated according to the following outline: Come up with the variational formulation of the problem. Discretization of the problem using finite elements: construction of the finite dimensional space V h. Solution of the discrete problem (often, writing a linear system which can be solved). 74

75 Implementation of the method on a computer. There are number of advantages to the FEM when compared to traditional finite difference methods. In particular, it is easier to handle complicated geometries, general boundary conditions, and variable/ non-linear material properties The Model Problem We consider the problem (D) { = f u(0) = u(1) = 0 u and we will follow the basics steps of an FEM for solving this problem Minimization/ Variational Formulation. We note that (D) has solutions. This can be seen by integrating the equation twice. Let V = {v : v C[0, 1], v is piecewise continuous and bounded on [0, 1], v(0) = v(1) = 0} and let F (v) = 1 2 (v, v ) (f, v) where (, ) is the inner product on L 2 [0, 1]. Then we define the following two problems (M) { find u V s.t. F (u) F (v) v V (V ) { find u V s.t. (u, v ) = (f, v) v V We see that a solution u of (D) is also a solution of (V ) by integration by parts ( u, v) = (f, v) (u, v ) = (f, v) Further, we note that (V ) and (M) have the same solutions. This is simple to verify. It is also easy to verify that solutions of (V ) are unique. If we have a function u which solves (V ) and has a continuous second derivative, then we can integrate by parts to see that u solves (D). Finally, if u solves (V ), then u is continuous. Hence, (V ), (M), and (D) are equivalent and have unique solutions. 75

76 Discretization of the Problem. If we were to discretize the minimization problem (M), we would have what s called a Ritz method. We will give an example for discretizing the variational problem, called a Galerkin method. The idea is to come up with a finite dimensional space V h which approximates the original space V. In particular, we take V h V. The first set V h which we will consider will consist of piecewise linear functions. We let 0 = x 0 < x 1 < < x M < x M+1 = 1 be a partition of the interval into subintervals I j = (x j 1, x j ), these discrete sub-domains are the elements of the finite element method. We now let V h be the set of functions which are linear on each interval, continuous, and satisfy the Dirichlet boundary conditions. These functions are completely determined by their values at the points x 1,..., x M. This guides us in finding a set of basis functions. Ket ϕ j (x) be the tent function which satisfies ϕ j (x i ) = δ i,j. These look like (where the x-axis takes the value x i at the point i, so this is ϕ 3 ) Any function v V h can then be written as In particular, V h is dimension M. Solve the discrete problem. v(x) = M η i ϕ i (x) i=1 Because the ϕ j form a basis of V h, we can solve the discrete variational problem 76

77 by solving the equivalent problem (V h ) { find u V h s.t. (u, v ) = (f, v) v V h (V h ) { find u V s.t. (u, ϕ j) = (f, ϕ j ) ϕ j This problem is equivalent by linearity. Further, we can write u = ξ j ϕ j and we see that u is completely determined by its coefficients. We then note that the above problem gives us a linear system of M equations for the ξ j. Let ξ = (ξ 1,..., ξ m ) T then we have Aξ = b where A = (ϕ 1, ϕ 1 ) (ϕ 1, ϕ M ).. (ϕ M, ϕ 1 ) (ϕ M, ϕ M ) b = (f, ϕ 1 ). (f, ϕ M ) If we assume a uniform grid spacing h, the matrix A has a simple form. We note that ϕ j = 1/h on [x j 1, x j ] and ϕ j = 1/h on [x j, x j+1 ]. The derivative is zero otherwise. This gives that So A has the form (ϕ j, ϕ j) = (ϕ j, ϕ j 1) = (ϕ j, ϕ j+1) = xj+1 x j 1 1/h 2 dx = 2/h xj x j 1 1/h 2 dx = 1/h xj+1 x j 1/h 2 dx = 1/h (ϕ j, ϕ i) = 0 for j i > 1 A = 1 h By the Gershgorin circle theorem, it is clear that A is positive semi-definite. For u = ξ i ϕ, we have that 0 (u, u ) = ξ T Aξ with equality only when u 0. Thus, A is positive definite and Aξ = b can be solved. 77

78 Computer implementation. We will not get too far into the details of implementing this specific problem, as it is quite simple. The computer implementation consists of populating the matrix A and vector b and using a numerical linear algebra algorithm to solve for ξ. Because the matrix is tridiagonal, an LU decomposition can be done in O(m) work and is quite easy to implement Error Estimates for FEM For the model problem. We derive an error estimate for the model problem using the discretization above. be the solution of (D) and u h be the solution of V h. Because V h V, we have that Let u ((u u h ), v ) = (f, v) (f, v) = 0 for all v V h. We now claim that for any v V h we have (u u h ) (u v) To establish this fact, we use that ((u h), v ) = 0 for all v V h. Let w = u h v V h. Then (u u h ) = ((u u h ), (u u h ) ) + ((u u h ), w ) = ((u u h ), (u v) ) (u u h ) (u v) by Cauchy-Schwarz Thus, we can come up with an upper-bound on the error by considering the error for a particular choice of functiion v h V h. If we choose v h to be the linear interpolant of u, By a standard Taylor expansion argument, one can see that we have the following bounds u (x) v h (x) h max y [0,1] u (y) u(x) v h (x) h2 8 max y [0,1] u (y) pointwise where the derivative of v h is defined. Thus, the bound on (u u h ) gives us and thus by integration we have (u u h ) (u v h ) h max y [0,1] u (y) 78

79 u(x) u h (x) h max y [0,1] u (y) We note that this bound is not as good as the one given for v h. It can be shown that in this case, the FEM solution is actually O(h 2 ). At any rate, we have convergence if u is bounded What about Higher Dimensions? We discuss the case for the 2D Poisson equation here. We would like to solve u = f in Ω u = 0 on Γ = Ω Following much the same route as the 1D case, we arrive at the following variational formulation. Let V = {v : v is continuous and piecewise differentiable with v Γ = 0}. For any v V, we have Ω uv dx = Ω u v v u n ds = Γ u v = Ω Γ Γ Γ a(u, v) = (f, v) fv ds fv ds fv ds Where a(u, v) and (f, v) are given by the left and right-hand-sides of the line above, respectively. Thus, the variational formulation of our problem is to find u V such that a(u, v) = (f, v) for all v V. For the discretization, we need to define what it is to be a triangulation. Let T h = {K 1,..., K M } be a set of non-overlapping triangles. We say that T h is a triangulation of Ω if Ω = K n and no vertex of one triangle lies on the edge of the other. Often, one denotes the mesh parameter h = max K T h We define our finite dimensional subspace V h as length of longest side of K V h = {v V : v K for K T h is linear} 79

Because v = 0 for v V h on the boundary, a function v V h is then determined by its values on the interior nodes, call them N j. We again define tent functions φ j for each node N j.

80 Because v = 0 for v V h on the boundary, a function v V h is then determined by its values on the interior nodes, call them N j. We again define tent functions φ j for each node N j. In this case, they look like (image from The support of φ j is then restricted to triangles with common node N j. As before, we must set up a matrix with entries A ij = a(φ i, φ j ). We note that where a(φ i, φ j ) = a K (φ i, φ j ) = K T h a K (φ i, φ j ) K φ i φ j dx In 2D, the values of a(φ i, φ j ) are normally found this way. We note that a K (φ i, φ j ) 0 if and only if N i and N j are vertices of K. Thus, we can compute a stiffness matrix for each triangle that is 3 3, because a triangle has 3 vertices. Then, adding the contributions from each of these gives the full system. We note that there is a new difficulty in higher dimensions when it comes to getting good error bounds. We will not get into it here, but in order to show that the linear interpolant of u, which is in V h, has the desired accuracy, we require that the triangles do not get too thin. 80

Numerical Methods - Numerical Linear Algebra

Numerical Methods - Numerical Linear Algebra Y. K. Goh Universiti Tunku Abdul Rahman 2013 Y. K. Goh (UTAR) Numerical Methods - Numerical Linear Algebra I 2013 1 / 62 Outline 1 Motivation 2 Solving Linear