Lecture Notes on Numerical Solutions of Differential Equations. Instructor: Prof. Dongwook Lee

Size: px

Start display at page:

Download "Lecture Notes on Numerical Solutions of Differential Equations. Instructor: Prof. Dongwook Lee"

Sandra Oliver
6 years ago
Views:

Lecture Notes on Numerical Solutions of Differential Equations Instructor: Prof. Dongwook Lee (dlee79@ucsc.

1 Lecture Notes on Numerical Solutions of Differential Equations Instructor: Prof. Dongwook Lee MWF, :00 am 2:0 pm Jack Baskin Engineering classroom 69 Spring, 205 Course Website:

2 Contents Introduction to Scientific Computing 3 2 Systems of Linear Equations 59 3 Linear Least Squares 86 4 Eigenvalue Problems 0 5 Initial Value Problems for Ordinary Differential Equations 3 6 Two-point Boundary Value Problems for Ordinary Differential Equations Reviews on Partial Differential Equations and Difference Equations 20 8 Parabolic PDEs Hyperbolic PDEs Short Note on Hyperbolic PDEs - Linear Scalar Case 279 Computing Discontinuous Solutions of Linear Conservation Laws292 2

3 Chapter Introduction to Scientific Computing. Course Description This course aims to focus on the basic numerical methods which are most fundamental in studying various fields of scientific computing. As some of you may have already noticed, the subject of this course is traditionally called numerical analysis which is concerned with the design and analysis of numerical (discrete) algorithms for solving mathematical problems using computers. We achieve our goal by studying several key topics in two parts. The first part covers topics on numerical linear algebra, where we learn (i) how to seek for solutions of well-posed linear systems of equations Ax = b using such as Gaussian elimination, LU factorization, and Cholesky factorization; (ii) how to seek for approximated solutions when the linear systems of equations are not well-defined, rather overdetermined, using approaches called linear least squares problems Ax b. Primary topics include normal equations, orthognoal transformations, QR factorizations, and some popular orthogonalization methods of Gram-Schmidt, and Householder transformations; (iii) lastly, eigenvalue problems Ax = λx which provide great insights when analyzing any problems characterized by applying linear transformations repeatedly. The eigenvalue problems arise in a wide ranges of both science and engineering fields and play a key role in the theory. Topics include numerical approaches that allow us to compute eigenvalues and eigenvectors (e.g., power iteration, Rayleigh quotient iteration, QR iteration), singular value decomposition in the contexts of both computing eigenvalues and solving least squares and related problems. In the second part we turn our interests in obtaining numerical solutions of ODEs and PDEs. We will study numerical methods of initial and boundary value problems for ODEs, including topics of single-step, multi-step solution update algorithms and the related accuracy and stability. For PDEs we look at numerical algorithms based on primarily the finite difference methods for solving advection equations (hyperbolic PDEs), diffusion 3

4 4 equations (parabolic PDEs), as well as Poisson equations (elliptic PDEs). In conjunction with numerical linear algebra in the first part, we also develop yet another iterative approaches of solving linear systems of equations (e.g., Jacobi method, Gauss-Seidel method, successive over-relaxation, and conjugate gradient method). If time permits, some topics from numerical integrations will be briefly introduced, while a set of fundamental ideas in numerical differentiations (important topics!) will be discussed as the course proceeds. 2. Course Syllabus A tentative week-by-week schedule of the course is as follow: Week : Review of basic ideas in scientific computing, Unix/Linux, introduction to scientific languages, Fortran90, version control using Subversion and Git, review of basic linear algebra Week 2: Direct methods of solving linear systems (Gaussian elimination (or LU factorization), Cholesky factorization), least square problems (QR factorization) Week 3: Continuing least square problems (orthogonalization methods: Gram-Schmidt, Householder transformations) Week 4: Eigenvalue problems (power iteration, Rayleigh quotient iteration, QR iteration), singular value decomposition and its applications Week 5: Initial value problems for ODEs (single-step and multi-step methods, accuracy and stability, explicit vs. implicit methods) Week 6: (Two-point) Boundary value problems for ODEs (in D) (shooting method, finite difference method, Galerkin method) Week 7: Numerical methods for parabolic PDEs (explicit vs. stability analysis) implicit, Week 8: Numerical methods for hyperbolic PDEs (linear advection equations vs. nonlinear Burgers equation, stability analysis, the Courant condition, Lax equivalence theorem) Week 9: Continuing Numerical methods for hyperbolic PDEs Week 0: Iterative methods for elliptic PDEs (Jacobi, Gauss-Seidel, SOR, CG)

5 5 3. Course Materials Main resources: Class notes and handouts (see also Prof. Pascale Garaud s online lecture note from Spring 204) Other references: An Introduction to Numerical Analysis Kendall E. Atkinson (Wiley) Scientific Computing, An Introductory Survey Michael T. Heath (Mc- Graw Hill) Numerical Linear Algebra Lloyd N. Trefethen and David Bau, III (SIAM, available online) Numerical recipes Press, Teukolsky, Vetterling and Fannery (Cambridge Univ. Press, available online) Finite Difference Methods for Ordinary and Partial Differential Equations Randall J. LeVeque (SIAM) A First Course in the Numerical Analysis of Differential Equations Arieh Iserles (Cambridge Univ. Press) High Performance Scientific Computing Online Lecture Note Randall J. LeVeque ( 4. Grading Policy Homework sets: 30% of total grade Take-home mid-term exam: 30% of total grade Final coding project: 40% of total grade 4.. Homework There are total of 6 homework problem sets on both mathematical theories and computer programming in every two weeks. They take 30% toward your total grade. The purpose of the assignments is to provide you with opportunities in exploring mathematical concepts and using them to conduct numerical calculations. In this course you will learn extensively how to discretize mathematical equations and visualize numerical solutions to them. There is a policy on any late homework submission that you are going to receive a maximum of 80% if late by less than a day; 50% if late by more than a day. Students are strongly encouraged to submit their homework electronically in pdf (no word documents) to your git repository (see below) Take-Home mid-term exam There is one take-home exam which is counted 30% toward your total grade.

6 Final coding project In a final project you will be asked to implement numerical schemes in Fortran 90 to solve a physics problem. It is also required to write a scientific report in a professional style using either latex or any word documents and submitted as a pdf file. Please keep in mind that the quality of the project goes past the homework set materials. Project submission is to be made to your git repository by the due date, Jun, 205 (tentative). The project will take 40% of your total grade. 5. Required Scientific Tools for a Successful Course Work 5.. Scientific language and computing platform One of the crucial components of the entire course work is to write computer programing codes for homework sets, a take-home midterm exam, and a final coding project. Fortran 90 is the the choice of the programing language which is most widely used in high performance computing community. You should be able to submit your course assignments by successfully implementing required numerical algorithms in Fortran 90. This means that you should be able to access to a Linux/Unix computing platform, where you can conduct such series of programing studies. Basically, there are several options to bring a Linux/Unix computing system to your daily scientific adventures: if your machine is either a Linux or Mac machine, use your own machine to run your code locally, or if your machine is a Windows PC, you can remotely access to a Linux computer using an X-forwarding terminal such as Putty over the network. Putty is one of the best SSH clients on Windows allowing you to work on a remote Linux computers. Please see if you prefer to run programing locally rather remotely (e.g., limited internet access at your place), you can either install Cygwin which brings you functionality similar to a Linux distribution on Windows. Pleas see Note that Cygwin is not a way to run native Linux apps on Windows. on the other hand, if you rather wish to have a native Linux setup on your Windows PC, you can run a pure linux environment using a free virtualization software called virtualbox which is quite excellent. This also allows file sharing between your host operating system (e.g., Windows) and the virtual operating system (e.g., Linux). Please go visit Remark: You can also learn useful tips not only from google searches but also from youtube these days. So please use those visual resources as well as reading

7 7 resources. Remark: And, don t forget one thing. If you need help, please don t be shy hesitating to ask around good people. And I am one of them, hopefully. Even though you might have your own computing resources over the course (your own Linux or Mac), we choose the default computing platform to be the Linux Grape cluster from the AMS department. You can see the description of Grape at or see the attachment at the end of this chapter. If you are not an SOE student, please come to see the instructor to get an account on Grape. A Fortran compiler (e.g., GNU gfortran) and all the other necessary libraries and softwares (gnuplot, matlab, idl, etc) are available on Grape. Since it uses a Linux operating system (Grape runs on a Linux operating system called Rocks ), and it is a remote cluster, in order to remotely access to Grape, you will need to learn how to use Linux command lines (or simply commands) and, make sure you have remote access to it and to its text editors (emacs, vi, vim, etc.). As mentioned, this should be trivial if you are using either a Linux or Mac machines. If you are a Windows PC user, you can install Putty (enough for accessing Grape remotely) or Cygwin (allowing remote access capability as well as Linux functionality). If you haven t had any chance to work on Linux/Unix type operating systems, please make sure you first familiarize yourself with basic Linux/Unix commands (This is very crucial!). It is your prime responsibility to learn about working on Linux/Unix environment as quickly as possible you should trust that you can do it! The instructor is happy to help you, however, is bound to be limited to provide you with detailed technical supports at all levels. Note: If you prefer to use your own laptop/desktop to program, it is your responsibility to install Fortran and all the libraries we will be using. You should be able to find a free Fortran 90 compiler (e.g., GNU gfortran) for most platforms. The most basic one will be sufficient for this class, but please be aware that the quality of a compiler has a lot to do with the speed of execution of the program Fortran 90 and Code Debugging 5... Fortran 90 You re encouraged to master basic programing skills in Fortran 90. Please take a quick look at three lecture notes on Fortran 90 tutorial by Prof. Pascale Garaud s attached at the end of the chapter. Please make sure you cover the tutorial at least up to part 2.

8 8 Further readings are available at Debugging Tools One very old school way of debugging is to insert print statements where needed and examine values produced by your code. This is the easiest and cheapest debugging trick which works well in general. However, there are more complicated situations where you are no longer able to rely on print statements to debug your code, for instance, segmentation fault is one such case. If this occurs you d better to (or should) use more advanced debugging tools. GDB, the GNU Project debugger, is one of the best debugger that allows you to look inside of program while it executes (and it s freely available at Other options include to use any other commercial ones such as totalview. Note: On Mac OS X, it s better to use LLDB instead of GDB. LLDB is the default debugger in Xcode on Mac OS X. To get this, you simply can install Xcode on your Mac ( together with command line tools ( Basic Commands in Linux/Unix Operating Systems In this section we give a quick overview on some of the basic Linux commands, assuming working on Grape Remote login via SSH First of all, you will need to install an SSH (secure shell) client in order to access one of cluster machines (i.e., computing resources such as grape) remotely. * If you re a PC user, you can download PuTTY from ( sgtatham/putty/download.html). * If you re a Mac or Linux user, you can simply use a terminal that is already available for you. For example, in Mac, go to Applications Utilities and open Terminal application. In Linux, Terminal can be found under System in general. Next step is to log in yourself to one of the AMS machines, Grape Linux cluster. In order to log in to the cluster, you need to use a command ssh using the terminal we just mentioned. In the command line, you type in ssh -X your name@grape.soe.ucsc.edu

9 with your SOE login password. At the step, you re logging into a master node or login node. As you login for the first time to the master node, you are asked to enter a passphrase. You can enter a very secured if you wish, or you can simply press enter. This process is to generate, so-called, SSH keys, which is a way to identify trusted computers, i.e., the rest of compute nodes, without having to enter password every time you run parallel jobs on these compute nodes. Note that running a parallel job means that you run multiple jobs on multiple processors, which actually require you to login to the requested compute nodes with password. This SSH key generation save you from doing this process. Your login is successful if you see something like the following on your terminal: 9

10 Basic Linux Commands There are few rules in using command lines in Linux. Several important rules are * Commands are case-sensitive. * Make sure you always logout yourself by typing exit when you re done. * The Linux command lines enables you to create complex functions by combining built in command lines together. This capability gives you countless ways to make your commands work in various different ways. Exercise : Please run matlab by typing in a command matlab on a command prompt. Exercise 2: Please exit your current session and try to login again without having -X option. Please run matlab again. Is there any difference from the case with -X? You can use -Y instead of -X. Answer to & 2: -X or -Y option enables X forwarding in SSH, which provides you not only a command line interface from the console, but also a variety of graphical-user-interfaces (GUIs). With the X forwarding option in login, you can enjoy a full set of interface display functions remotely. Here you re introduced to learn very basic Linux command lines. For more comprehensive studies, you can use to display a manual page using the man command, or you can come to ask the instructors for more help and resources Managing Files $ ls lists your files $ ls -l ls in long format $ ls -a ls all files $ mv filename filename2 remane filename to filemane2 $ mv filename dirname move filename to a directory called dirname $ cp filename filename2 copy filename to filename2 $ rm filename remove a file $ more filename display the contents of a file as much as will fit on your screen $ less filename similar to more with the extended navigation capability allowing both forward and backward navigations $ wc filename tells you number of lines, words and characters in a file $ touch filename creating an empty file (multiple filenames after touch command will create multiple empty files) Exercise 3: See if you can find ls -l and ls -a when you execute man ls.

11 Managing Directories $ mkdir dirname create a new directory called dirname $ cd dirname change directory, meaning you go to a directory called dirname $ pwd tells you where you currently are in the directory tree $ rmdir an empty directory deletion Exercise 4: Create a directory called dira, then under dira, create an empty file named filenamenull. Exercise 5: Delete the file filenamenull. Also delete the directory dira using rm command. Hint: Please look up man page of rm and find a useful option for directory deletion Editors The following text editors are available on grape. You can choose whichever you want to use. () vim (or vi): To start $ vi filename To edit enter i and start inserting text until <Esc> hit To delete single character x To delete entire current line dd To exit pressing <Esc> key followed by :x <Return> or :wq <Return$ will quit vi saving the content to filename, whereas pressing <Esc> key followed by :q! <Return> will quit vi without saving the latest change to filename See more basic commands for vi in (2) emacs: To start $ emacs -nw filename To edit unlike vi, you can type in any characters in the editor To save the current buffer press hold down Control key and type in x and s To save the current buffer with different file name press hold down Control key and type in x and w and then enter new file name To exit the buffer press hold down Control key and type in x and c See more basic commands for emacs in Running a short program in Fortran Again, please use your preferred text editor and implement the following short fortran program: program hello real :: n,m integer :: i,j i = 0

12 2 j = 204 n=real(i) m=59.e0 print *,"i+j=",i+j print *,"n-m=",n-m print *, "Hello World" end program hello You save it to hello.f90. Now you are going to compile it in order to generate an executable binary. On grape, you can do this using gfortran compiler: $ gfortran hello.f90 After compiling your program, you should be able to see an executable binary file with a default name, a.out. Run it by entering a command line: $./a.out Exercise 6: What does your result look like from running hello.f90? Exercise 7: Can you give a different name for the executable instead of the default a.out? Note: Other useful online references on Linux/Unix commands can be found at: the shell.php 5.2. Visualization tools For most part of the course work it is sufficient for visualizing your outputs using gnuplot and/or matlab, which are both installed and available on Grape. Both tools are good for reading and plotting general ASCII format data of your program outputs (also see Prof. Garaud s notes on Fortran 90 tutorial). You can explore other options for visualization tools such as idl, python, matplotlib (matplotlib.org), yt (yt-project.org), techplot ( VisIt ( etc. 6. Version Control Systems using Subversion vs. Git and Bitbucket This part of the lecture note has been partially extracted and modified from In this class we will use git for

13 3 homework submission, take-home exam submission, final coding project submission, lecture note update, and all the other electronic file transfers needed for the course work between you and the instructor. See the below for more information on using git and the repositories required for this class. There are many other version control systems that are currently popular, such as cvs, Subversion, Mercurial, and Bazaar. Version control systems were originally developed to aid in the development of large software projects with many authors working on inter-related pieces. The basic idea is that you want to work on a file (one piece of the code), you check it out of a repository, make changes, and then check it back in when you re satisfied. The repository keeps track of all changes (and who made them) and can restore any previous version of a single file or of the state of the whole project. It does not keep a full copy of every file ever checked in, it keeps track of differences (diff) between versions, so if you check in a version that only has one line changed from the previous version, only the characters that actually changed are kept track of. It sounds like a hassle to be checking files in and out, but there are a number of advantages to this system that make version control an extremely useful tool even for use with you own projects if you are the only one working on something. Once you get comfortable with it you may wonder how you ever lived without it. 6.. Advantages You can revert to a previous version of a file if you decide the changes you made are incorrect. You can also easily compare different versions to see what changes you made, e.g. where a bug was introduced. If you use a computer program and some set of data to produce some results for a publication, you can check in exactly the code and data used. If you later want to modify the code or data to produce new results, as generally happens with computer programs, you still have access to the first version without having to archive a full copy of all files for every experiment you do. Working in this manner is crucial if you want to be able to later reproduce earlier results, as if often necessary if you need to tweak the plots for to some journal s specifications or if a reader of your paper wants to know exactly what parameter choices you made to get a certain set of results. This is an important aspect of doing reproducible research, as should be required in science. If nothing else you can save yourself hours of headaches down the road trying to figure out how you got your own results.

14 4 If you work on more than one machine, e.g. a desktop and laptop, version control systems are one way to keep your projects synched up between machines Two Types of Version Control Systems Client-server systems The original version control systems all used a client-server model, in which there is one computer that contains the repository and everyone else checks code into and out of that repository. Systems such as CVS and Subversion (svn) have this form. An important feature of these systems is that only the repository has the full history of all changes made. There is a software-carpentry webpage on version control ( 0/vc/) that gives a brief overview of client-server systems Distributed systems Git, and other systems such as Mercurial and Bazaar, use a distributed system in which there is not necessarily a master repository. Any working copy contains the full history of changes made to this copy. The best way to get a feel for how git works is to use it, for example by following the instructions in the following section. 7. Git and Bitbucket Also see: Instructions for cloning the class repository All of the materials for this class, including homework assignments, sample programs, and lecture note are in a Git repository hosted at Bitbucket, located at spring205 In addition to viewing the files via the link above, you can also view changesets, issues, etc. To obtain a copy, simply move to the directory where you want your copy to reside (assumed to be your home directory below) and then clone the repository: $ cd yourdir $ git clone dongwook59/ams23 spring205.git./

15 5 Note: There is no (white) space in the above git command line. Note: At this point, it is assumed you have git installed, otherwise, go visit Note: The clone statement will download the entire repository as a new subdirectory called ams23 spring205, residing in your home directory. If you want ams23 spring205 to reside elsewhere, you should first cd to that directory Updating your clone The files in the class repository will change as the quarter progresses new notes, sample programs, and homework sets will be added. In order to bring these changes over to your cloned copy, all you need to do is: $cd ams23 spring205 $git fetch origin $git merge origin/master The git fetch command instructs git to fetch any changes from origin, which points to the remote bitbucket repository that you originally cloned from. In the merge command, origin/master refers to the master branch in this repository (which is the only branch that exists for this particular repository). This merges any changes retrieved into the files in your current working directory. The last two command can be combined as: $ git pull origin master or simply: $ git pull because origin and master are the defaults Creating your own Bitbucket repository In addition to using the class repository, students in AMS 23 are also required to create their own repository on Bitbucket. It is possible to use git for your own work without creating a repository on a hosted site such as Bitbucket, but there are several reasons for this requirement: You should learn how to use Bitbucket for more than just pulling changes. You will use this repository to submit your solutions to homework sets, exams, and final term project. You will need to give the instructor permission to clone your repository so that I can grade your work (others will not be able to clone or view it unless you also give them permission).

16 6 It is recommended that after the class ends you continue to use your repository as a way to back up your important work on another computer (with all the benefits of version control too!). At that point, of course, you can change the permissions so the instructor no longer have access. Below are the instructions for creating your own repository. Note that this should be a private repository so nobody can view or clone it unless you grant permission. Anyone can create a free private repository on Bitbucket. Note that you can also create an unlimited number of public repositories free at Bitbucket, which you might want to do for open source software projects, or for classes like this one. (To make free open access repositories that can be viewed by anyone, GitHub ( is recommended, which allows an unlimited number of open repositories and is widely used for open source projects.) An article comparing Bitbucket and GitHub is found at article/2677/application-development/application-development-bitbucket-vsgithub-which-project-host-has-the-most.html To get started, follow the directions below exactly. Doing this will be part of your first homework assignment. We will clone your repository and check that README has been created and see if your information is correct. On your local machine you re working on: $ git config --global user.name "Your Name" $ git config --global user. you@example.com These will be used when you commit changes. If you don t do this, you might get a warning message the first time you try to commit. Go to and click on Sign up now if you don t already have an account. Fill in the form, make sure you remember your username and password. You should then be taken to your account. Repositories. Click on Create next to You should now see a form where you can specify the name of a repository and a description. The repository name need not be the same as your user name (a single user might have several repositories). For example, the class repository is named ams23 spring205, owned by user dongwook59. To avoid confusion, you should probably not name your repository ams23 spring205. You should stick to lower case letters and numbers in your repository name, with no upper case nor special symbols.

17 Don t name your repository homework because you will be using the same repository for other homework assignments later in the quarter. Make sure you click on Private at the bottom. Also turn Issue tracking and Wiki on if you wish to use these features. Click on Create repository. You should now see a page with instructions on how to clone your (currently empty) repository. In a Linux terminal window, cd to the directory where you want your cloned copy to reside, and perform the clone by typing in the clone command shown. This will create a new directory with the same name as the repository. You should now be able to cd into the directory this created. The directory you are now in will appear empty if you simply do: $ ls But try: $ ls -a./../.git/ the -a option causes ls to list files starting with a dot, which are normally suppressed. The directory.git is the directory that stores all the information about the contents of this directory and a complete history of every file and every change ever committed. You shouldn t touch or modify the files in this directory, they are used by git. Add a new file to your directory: $ vi testfile.txt This is a new file with only two lines so far. and exit vi. Here we assume you know how to use vi editor to add new lines and save the change. 7 Type: $ git status -s The response should be:?? testfile.txt

18 8 The?? (appears in red) means that this file is not under revision control. The -s flag results in this short status list. Leave it off for more information. To put the file under revision control, type: $ git add testfile.txt $ git status -s A testfile.txt The A (appears in green on the first column the staging area) means it has been added. However, at this point git is not we have not yet taken a snapshot of this version of the file. To do so, type: $ git commit -m "My first commit of a test file." The string following the -m is a comment about this commit that may help you in general remember why you committed new or changed files. You should get a response like: [master 8672a65] My first commit of a test file. file changed, 2 insertions(+) create mode testfile.txt We can now see the status of our directory via: $ git status # On branch master nothing to commit (working directory clean) Alternatively, you can check the status of a single file with: $ git status testfile.txt You can get a list of all the commits you have made (only one so far) using: $ git log commit 874a3a59794e9889f5f4bfe4a29bf8596 Author: dongwook59 <dlee79@ucsc.edu> Date: Fri Mar 27 5:3: my first git commit of a readme file

19 The number 874a3a59794e9889f5f4bfe4a29bf8596 above is the name of this commit and you can always get back to the state of your files as of this commit by using this number. You don t have to remember it, you can use commands like git log to find it later. Yes, this is a number... it is a 40 digit hexadecimal number, meaning it is in base 6 so in addition to 0,, 2,..., 9, there are 6 more digits a, b, c, d, e, f representing 0 through 5. This number is almost certainly guaranteed to be unique among all commits you will ever do (or anyone has ever done, for that matter). It is computed based on the state of all the files in this snapshot as a SHA Modifying a file Now let s modify this file: $ vi testfile.txt Adding a third line and save the change and exit vi. Now try the following: $ git status -s M testfile.txt The M on the second column the working directory indicates this file has been modified relative to the most recent version that was committed. To see what changes have been made, try: $ git diff testfile.txt This will produce something like: diff --git a/testfile.txt b/testfile.txt index d80ef00..fe a/testfile.txt +++ -,2 This is a new file with only two lines so far +Adding a third line 9 The + in front of the last line shows that it was added. The two lines before it are printed to show the context. If the file were longer, git diff would only print a few lines around any change to indicate the context.

20 20 Now let s try to commit this changed file: $ git commit -m added a third line to the test file" This will fail! You should get a response like this: # On branch master # Changes not staged for commit: # (use git add <file>..." to update what will be committed) # (use git checkout -- <file>..." to discard changes in working # directory) # # modified: testfile.txt # no changes added to commit (use git add" and/or git commit -a") git is saying that the file testfile.txt is modified but that no files have been staged for this commit. If you are used to Mercurial, git has an extra level of complexity (but also flexibility): you can choose which modified files will be included in the next commit. Since we only have one file, there will not be a commit unless we add this to the index of files staged for the next commit: $ git add testfile.txt Note that the status is now: $ git status -s M testfile.txt This is different in a subtle way from what we saw before: The M is in the first column (meaning modified at the staging state) rather than the second (meaning all modifications done in the working tesfile.txt file have been already staged), meaning it has been both modified and staged. Furthermore, if you make more modifications on testfile.txt after staging it (i.e., $ git add testfile.txt), you will have two M s appearing in both first and second columns: $ git status -s M M testfile.txt You can remove M by doing one more staging:

21 2 $ git add testfile.txt We can get more information if we leave off the -s flag: $ git status # On branch master # Changes to be committed: # (use git reset HEAD <file>..." to unstage) # # modified: testfile.txt # Now testfile.txt is on the index of files staged for the next commit. Now we can do the commit: $ git commit -m added a third line to the test file" [master 598d7] added a third line to the test file file changed, insertion(+) Try doing git log now and you should see something like: commit 598d7ea4a63da6ab42b3c03f66cbca56085 Author: dongwook59 <dlee79@ucsc.edu> Date: Fri Mar 27 6:3: added a third line to the test file commit 28a4da5a0deb04b32a0f2fd08f78e43d6bd9e9dd Author: dongwook59 <dlee79@ucsc.edu> Date: Fri Mar 27 6:3: My first commit of a test file. If you want to revert your working directory back to the first snapshot you could do: $ git checkout 28a4da5a0de Note: checking out 28a4da5a0de. You are in detached HEAD state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by performing another checkout.

22 22 HEAD is now at 28a4da5... My first commit of a test file. Take a look at the file, it should be back to the state with only two lines. Note that you don t need the full SHA- hash code, the first few digits are enough to uniquely identify it. You can go back to the most recent version with: $ git checkout master Switched to branch master We won t discuss branches, but unless you create a new branch, the default name for your main branch is master and this checkout command just goes back to the most recent commit. So far you have been using git to keep track of changes in your own directory, on your computer. None of these changes have been seen by Bitbucket, so if someone else cloned your repository from there, they would not see testfile.txt. Now let s push these changes back to the Bitbucket repository: First do: $ git status to make sure there are no changes that have not been committed. This should print nothing. Now do: $ git push -u origin master This will prompt for your Bitbucket password and should then print something indicating that it has uploaded these two commits to your bitbucket repository. Not only has it copied the file over, it has added both changesets, so the entire history of your commits is now stored in the repository. If someone else clones the repository, they get the entire commit history and could revert to any previous version, for example. To push future commits to bitbucket, you should only need to do: $ git push

23 and by default it will push your master branch (the only branch you have, probably) to origin, which is the shorthand name for the place you originally cloned the repository from. To see where this actually points to: $ git remote -v This lists all remotes. By default there is only one, the place you cloned the repository from. (Or none if you had created a new repository using git init rather than cloning an existing one.) Check that the file is in your Bitbucket repository: Go back to that web page for your repository and click on the Source tab at the top. It should display the files in your repository and show testfile.txt. Now click on the Commits tab at the top. It should show that you made two commits and display the comments you added with the -m flag with each commit. If you click on the hex-string for a commit, it will show the change set for this commit. What you should see is the file in its final state, with three lines. The third line should be highlighted in green, indicating that this line was added in this changeset. A line highlighted in red would indicate a line deleted in this changeset. This is enough for now! Feel free to experiment further with your repository at this point Using git to stay synced up on multiple computers If you want to use your git repository on two or more computers, staying in sync via bitbucket should work well. To avoid having merge conflicts or missing something on one computer because you didn t push it from the other, here are some tips: When you finish working on one machine, make sure your directory is clean (using git status ) and if not, add and commit any changes. Use git push to make sure all commits are pushed to bitbucket. When you start working on a different machine, make sure you are up to date by doing: $ git fetch origin # fetch changes from bitbucket $ git merge origin/master # merge into your working directory These can probably be replaced by simply doing: $ git pull 23

24 24 but for more complicated merges it s recommended that you do the steps separately to have more control over what s being done, and perhaps to inspect what was fetched before merging. If you do this in a clean directory that was pushed since you made any changes, then this merge should go fine without any conflicts. 8. Motivations and Needs for Scientific Computing in the Real World Computational Fluid Dynamics (CFD) Let s begin our first class with a couple of interesting scenarios. Scenario : See Fig.. Consider you re a chief scientist in a big aerospace research lab. You re given a mission to develop a new aerospace plane that can reach at hypersonic speed (> Mach 5) within minutes after taking off. Its powerful supersonic combustion ramjets continue to propel the aircraft even faster to reach to a velocity near 26,000 ft/s (or 7.92 km/s, or Mach 25.4 in air at high altitudes, or a speed of NY to LA in 0 min), which is simply a low Earth orbital speed. This is the concept of transatmospheric vehicle the subject of study in several countries during the 980s and 990s. When designing such extreme hypersonic vehicles, it is very important to understand full three-dimensional flow filed over the vehicle with great accuracy and reliability. Unfortunately, ground test facilities wind tunnels do not exist in all the flight regimes around such hypersonic flight. We neither have no wind tunnels that can simultaneously simulate the higher Mach numbers and high flow field temperatures to be encountered by transatmospheric vehicles. Scenario 2: See Fig. 2. Consider you re a theoretical astrophysicist who tries to understand core collapse supernova explosions. The theory tells us that very massive starts can undergo core collapse when the core fail to sustain against its own gravity due to unstable behavior of nuclear fusion. We simply cannot find any ground facilities that allow us to conduct any laboratory experiments in such highly extreme energetic astrophysical conditions. It is also true that in many astrophysical circumstances, both temporal and spatial scales are too huge to be operated in laboratory environments. Scenario 3: See Fig. 3. Consider you a golf ball manufacturer. Your goal is to understand flow behaviors over a flying golf ball in order to make a better golf ball design (and become a millionaire!) Although you ve already collected a wide range of the laboratory experimental data on a set of golf ball shapes (i.e., surface dimple design), you realize that it is very hard to analyze the data and understand them because the data are all nonlinearly coupled and can t be isolated easily. To keep your study in a better organized way, you wish to perform a set of parameter studies by controlling flow properties one by one so that you can also make reliable flow prediction for a new golf ball design.

25 25 Figure. DARPA s Falcon HTV-2 unmanned aircraft can max out at a speed of about 6,700 miles per hour Mach 22, NY to LA in 2 minutes. As briefly hinted above, in practice there are various levels of difficulties encountered in real experimental setups. When performing the above mentioned research work, CFD therefore can be the major player that leads you to success because you obtain mathematical controls in numerical simulations. Let us take an example how numerical experiment via CFD can elucidate physical aspects of a real flow field. Consider the subsonic compressible flow over an airfoil. We are interested in answering the differences between laminar and turbulent flow over the airfoil for Re = 0 5. For the computer program (assuming the computer algorithm is already well established, validated and verified!), this is a straightforward matter it is just a problem of making one run with the turbulence model switched off (for the laminar setup), another run with the turbulence model switched on (for the turbulent flow), followed by a comparison study of the two simulation results. In this way one can mimic Mother Nature with simple knobs in the computer program something you cannot achieve quite readily (if at all) in the wind tunnel. Without doubt, however, in order to achieve such success using CFD, you d better to know what you do exactly when it comes to numerical modeling the main goal of this course. We are now ready to define what CFD is. CFD is a scientific tool, similar to experimental tools, used to gain greater physical insights into problems of interest. It is a study of the numerical solving of PDEs on a discretized system that, given the available computer resources, best approximates the real geometry and fluid flow phenomena of interests. CFD constitutes a new third approach in studying and developing the whole discipline of fluid dynamics. A brief history on fluid dynamics says that the foundations for experimental fluid dynamics began in 7th century in England and France. In the 8th and 9th

26 26 Figure 2. FLASH simulations of neutrino-driven core-collapse supernova explosions. Sean Couch (ApJ, 775, 35 (203)). centuries in Europe, there was the gradual development of theoretical fluid dynamics. These two branches experiment and theory of fluid dynamics have been the mainstreams throughout most of the twentieth century. However, with the advent of the high speed computer with the development of solid numerical studies, solving physical models using computer simulations has revolutionized the way we study and practice fluid dynamics today the approach of CFD. As sketched in Fig. 4, CFD plays a truly important role in modern physics as an equal partner with theory and experiment, in that it helps bringing deeper physical insights in theory as well as help better desiging experimental setups. The real-world applications of CFD are to those problems that do not have known analytical solutions; rather, CFD is a scientific vehicle for solving flow problems that cannot be solved in any other way. In this reason the fact that we use CFD to tackle to solve those unknown systems we are strongly encouraged to learn thorough aspects in all three essential areas of study: (i) numerical theories, (ii) fluid dynamics, and (iii) computer programing skills. 9. Properties of Machine Arithmetics Numerical analysis is concerned with the creation and study of algorithms dedicated to solving particular mathematical problems. The recent advent of rapid

27 Figure 3. Contours of azimuthal velocity over a golf ball: (a) Re = 2.5 0 4 ; (b) Re =. 0 5. C. E. Smith et al. (Int. J. Heat and Fluid Flow, 3, 262-273 (200)).

27 27 Figure 3. Contours of azimuthal velocity over a golf ball: (a) Re = ; (b) Re = C. E. Smith et al. (Int. J. Heat and Fluid Flow, 3, (200)). evolving state of computing hardware allows the field of numerical analysis to evolve rapidly too. For instance, these days we find that constraints on memory storage are less severe than before and not as dramatic as they used to be albeit one has to put lots of efforts in designing numerical algorithms to use memory as efficiently as possible. Another good example is with parallel computing where one can explore different levels of parallelism including both coarse-grained and fine-grained parallelisms, as well as embarrassing parallelism. In pursuing numerical analysis we would face many types of challenging problems that can be investigated. Among many interesting topics, we will focus only on two parts in this course: numerical methods for linear algebra, numerical methods for solving ODEs and PDEs. As we discuss algorithms, we will always bear in mind that there are four standard concerns of numerical analysis: numerical stability, solution accuracy, algorithm s performance (speed), memory usage

28 Figure 4. Three healthy cyclic relationship in fluid dynamics. 9.. Computer representation of numbers The first two concerns (i.e., stability and accuracy) of numerical analysis are indirectly (sometimes directly) related to the way numbers are encoded by computers, with a unit of byte (recall byte = 8 bits).

28 28 Figure 4. Three healthy cyclic relationship in fluid dynamics. 9.. Computer representation of numbers The first two concerns (i.e., stability and accuracy) of numerical analysis are indirectly (sometimes directly) related to the way numbers are encoded by computers, with a unit of byte (recall byte = 8 bits). Most computers have different modes for representing integers and real numbers, integer mode and floating-point mode, respectively. Let us now take a look at these modes Integer mode The representation of an integer number is (usually) exact. Recall that we can represent an integer as a sequence of numbers from 0 to 9, for instance, as an expansion in base 0: Example: Base 0 a n a n a 0 = a n 0 n + a n 0 n + + a (.) 59 = (.2) However, the number base used in computers is seldom decimal (e.g., base 0), but instead binary (e.g., base 2) where one bit is either 0 or. In binary

29 29 form, any positive integers can be written as Example: Base 2 a n a n a 0 = a n 2 n + a n 2 n + + a (.3) 59 = = (.4) In storing such a way a singed integer is typically stored on a finite number of bytes, usually using -bit for the sign (though other conventions also exits). In Fortran there are two common ways to represent integers, normal and long types. The normal integers are stored on 4 bytes, or equivalently 32 bits where one bit is reserved for the sign and the rest 3 bits for the value itself. On the other hand, the long integers are stored on 8 bytes which are equivalent to 64 bits, where one bit for the sign and the rest 63 bits for the value. As a consequence for a given integer type there are only finite numbers of integers which can be used in programing: for normal 4-byte integers: between 2 3 and 2 3, for normal 8-byte integers: between 2 63 and This means that any attempts to reach numbers beyond these values will cause problems. Note that we have billion which is not so big a number Floating-point mode The base 0 notation (decimal) for real numbers can be written as a n a n a 0.b b 2 b m = a n 0 n + a n 0 n + + a b 0 + b b m 0 m, (.5) and by analogy we write a real number in base 2 (binary) as a n a n a 0.b b 2 b m = a n 2 n + a n 2 n + + a b 2 + b b m 2 m, (.6) Definition: We note that we can only store finite numbers of a i and b j as every computer has finite storage limit. This implies that there are cases when real numbers can only be approximated with finitely many combinations of a i and b j. The error associated with this approximation is called roundoff errors.

30 Standard notations In fact, the numbers are not stored as written above. Rather, they are stored as or 2 n( a n + a n a 0 2 n + b 2 n + + b m 2 n m), (.7) 2 n+( a n 2 + a n a 0 2 n + b 2 n b m 2 n m ). (.8) In the first case a n can be chosen to be nonzero by assumption, which necessarily gives a n =. The first is referred to as IEEE standard and the second as DEC standard. Example: The representation of in base 0 becomes = = 0.0 2, (.9) which can be written in two ways as just shown above: = 2 4 (.00 2 ) IEEE standard, 2 5 ( ) DEC standard. (.0) Definition: In general, this takes of the form of x = 2 k f, (.) where k and f are called the exponent and mantissa, respectively Standard Fortran storage types: double vs. single precisions There are two standard storage types available in Fortran. In addition to them, one can define any new type as needed in Fortran 90 and above. The two standard storage types are single precision : type REAL(SP) - storage on 4 bytes (i.e., 32 bits = bit for sign + 8 bits for the exponent + 23 bits for the mantissa), double precision : type REAL(DP) - storage on 8 bytes (i.e., 64 bits = bit for sign + bits for the exponent + 52 bits for the mantissa). Note: The bits in the exponent store integers from L to U, where usually, L is the lowest exponent, a negative integer; and U is the highest exponent, a positive integer, with 2 8 for single precision, U L (.2) 2 for double precision,

31 3 where and 26 for single precision, L = 022 for double precision, 27 for single precision, U = 023 for double precision. (.3) (.4) Floating-point arithmetic and roundoff errors Arithmetic using floatingpoint numbers is quite different from real arithmetic. In order to understand this let s work in base 0 for simplicity. If we represent π in, say, DEC standard, we get a representation up to 2 decimal places (significant digits) is 0.3, and a representation up to 6 decimal places (significant digits) is For a given number of significant digits (i.e., a given length of mantissa), the distance between two consecutive numbers is dependent on the value of the exponent. Let us consider the following example. Example: Suppose we have 3 possible values of the exponent, k = 2, and 0, and 2 digits for mantissa. Positive numbers we can possibly create from this condition are all separated by 0 4 (.5) all separated by 0 3 (.6) all separated by 0 2 (.7)

32 32 Here we note that the continuous real line has been discretized which inevitably introduces roundoff errors. Also, the discretization does not produce equidistance uniform spacing, instead the non-uniform spacing depends on the absolute value of the numbers considered. Note: The floating-point arithmetic operation is not associative as a result of roundoff errors. To see this, let s consider a case with 6-digit mantissa. Let s take three real numbers, a = = , (.8) b = = , (.9) c = = (.20) We see that the floating-point operation fails to preserve associative rule, In the first case, we have whereas the second case gives (a b) + c a (b c). (.2) (a b) + c = ( ) = = , (.22) a (b c) = ( ) = }{{} more than 6 digits hence must be rounded off = = (.23) As can be seen the error on the calculation is huge! It is of the order of the discretization error for the largest of the numbers considered (i.e., a and b in this case) Machine accuracy ɛ It is a similar concept now with the question of what is the largest number ɛ that can be added to such that in floating-point arithmetic, one gets + ɛ =? (.24) Let s consider the following example: Example: Consider 6-digit mantissa. Then we have = , (.25) and then = (.26)

33 However, the last representation exceeds 6-digit limit and hence needs to be rounded down to , resulting = (.27) This implies that the machine accuracy is ɛ 0 7. Note: For floating-point arithmetic in base 2, with mantissa of size m, we have in real single precision, ɛ 2 m = (.28) in real double precision Overflow and underflow problems There exists a smallest and a largest number (in absolute value) that can be represented in floating-point notation. For instance, let us suppose that the exponent k ranges from -4 to 4, and the mantissa has 8 significant digits. This gives us that the smallest possible number in base 0 is and the largest possible number is x min = = 0 4, (.29) x max = (.30) Therefore in general, in base 2, we have in real single precision, x min = 2 L = in real double precision in real single precision, x max = 2 U = in real double precision. (.3) (.32) If the outcome of a floating-point operation yields x < x min then an underflow error occurs. In this case x will be set to be zero usually and computation will continue. In contrast, if x > x max then an overflow error occurs, causing a fatal termination of program.

34 Nic Brummell Page 0/6/204 AMS machines AMS has some decent sized machines! NEW! 3 interactive analysis servers, 32 cores each (Dell PE R820: 4 x Intel Xeon Sandy Bridge E processor, each of which has 8 cores per cpu, 2.7 GHz, 6GB RAM, TB SATA hard drive). Machines are called jerez, muscat and mencia. Use for regular (nonparallel) analysis). NEW! New section of the cluster described below: 8 new nodes, where each node has 6 cores (Dell PowerEdge R420: 2 x Intel Xeon Sandy Bridge E processor, each of which has 8 cores per cpu, 2.3 Ghz, 8GB RAM, 500 GBSATA hard rive) => total of 28 new cores. (All connected by Infiniband switch now). So the machine now in total has = 232 cores. AMS cluster: GRAPE o Use for parallel code o AMS owns a cluster of machines called grape o The cluster consists of a master node and 2 compute nodes. Each node actually has 8 operational CPUs (known as cores), so there are a total of 04 cores (not too shabby). The oldest nodes are 6 Dell PowerEdge 950 (Intel Xeon 2.33 GHz 6GB RAM), and the newer nodes are 6 Dell PowerEdge R60 (Intel Xeon 2.40Ghz 6GB RAM) o To log in to the cluster, use ssh your_username@grape.soe.ucsc.edu using your SOE login. When you login as above, you login to the master/login node. o Operating system is Rocks (a cluster management system related to RedHat Linux) o Fortran, C compilers, MPI, and all the usual licensed software of SOE (Matlab, IDL, LApack etc) are available o Your home directory is your usual home directory. o All usual directories (e.g. /projects, /cse/faculty) are cross-mounted

35 Nic Brummell Page 2 0/6/204 o There are two sources of disk space: 2 Tbytes attached to the master node as/scratch, accessed from the compute nodes as /share/arbeit and.4 Tbytes attached to the master node as /data and accessed from the compute nodes as/share/work. None of the disk space is backed up! Use at your own risk. o To use the machine for any serious computing, you should login in to the master node and use the batch system (described below) to submit jobs to the compute nodes. Logging in to the individual nodes disrupts the load balance of the machine and is NOT recommended. o The machine is basically split into three now. The first third of the nodes (compute-0-0 to compute-0-4) is made up of the 6 oldest nodes with batch queue called orig. The second third (compute-0-5 to compute-0-) is made up of the 6 medium age nodes and is access through batch queue new. The last third (compute-0-2 through compute-0-9) is the newest nodes and is associated with the batch queue newest. Note that the oldest nodes do not allow hyperthreading (i.e. virtual threading: you cannot run more threads than there are cpus). There is also a queue called default. This uses all nodes across the Infiniband switch. Compiling parallel programs with MPI: If you want to run Fortran/MPI on grape and multitask, you need to do the following steps:. Decide which compiler you are going to use and set the environment using modules. To see what compilers are available, do modules avail. To see what compileris loaded by default, do modules list. To switch to a particular compiler, do modules switch <compiler> 2. Compile your program using the related MPI compiler: mpif90 (mpif77, mpicc) This produces an <executable> Running parallel programs: For the convenience of all users, you should use the Portable Batch System (PBS) job scheduler (open source version: Torque/Maui) to run parallel programs. Running batch jobs using PBS:

36 Nic Brummell Page 3 0/6/204 Make a run script text file called "jobfile" (for example) which contains the following (for example): <start of file> #PBS -S /bin/bash ß or whatever Linux shell you wish to use #PBS u username ß your username #PBS N name ß some name to identify the job #PBS -l nodes=2:ppn=8 ß number of nodes and processors per node #PBS -l walltime=02:00:00 ß amount of time requested (2 hours) #PBS V ß passes all environment vars to processors (sometimes!) #PBS q debug ß queue: orig, new, newest, default cd $PBS_O_WORKDIR dir <-- changes directory to current submission ### insert here any Linux shell commands you need to set up and run ### e.g. cp /home/brummell/code/nics_exec. cp /home/brummell/test/inputfile. ### Then run the job mpirun -hostfile $PBS_NODEFILE -np no_of_total_processors nics_exec <end of file> To submit the job, type "qsub jobfile" To examine the status of the job, type "qstat" To kill the job, do a "qstat -u <your_username>" and get the job ID number, and then type "qdel <job_id_no>" Running interactive jobs using PBS: You can run interactive jobs too: e.g. qsub l nodes=2:ppn=8 I This starts an interactive session on 2 nodes using 8 processors per node and you can then type in interactive commands in the parallel environment. If you are not running a parallel program, you should use qsub l nodes=:ppn= I PBS allows you to use the least busy nodes automatically, and to demand exclusive use of nodes (not the default). For more information on using PBS, look at the man pages, search Google or here is a useful short summary of PBS concepts:

37 Nic Brummell Page 4 0/6/204 SOE machines For interactive servers (e.g. Matlab use etc): For another general cluster:

38 Beginner Fortran 90 tutorial Basic program structure in Fortran A very basic program in Fortran contains: The program statement (which tells the compiler where the program begins) Variable declarations (which tells the compiler what variables will be used, and what type they are) Instructions as to what to do An end statement (which tells the program where the program ends) This looks something like the following example: program nameofprogram implicit none integer :: i,j,k real :: x,y,z x = 3.6 y = cos(x) z = x + y i = 3 j = i**2 k = i - j end program nameofprogram Exercise : Write this little program up in a text editor, save the file as myprogram.f90 then compile the code using the command: gfortran myprogram.f90 -o myprog If it returns any compiling error, try to read what the compiler says and correct the error. If it returns no error, this means that the compiler successfully turned the code into an executable called myprog. To run the code, type./myprog at the prompt. What happens when you do that?

39 2 Outputting data In the previous example, your code probably ran but has nothing to show for it it did not print out any results that you could look at. To let the code actually print out what the result is, you have to tell it to do so. There are several options for outputting the results: To print the results to the screen To print the results to a file In both case, there are several ways of outputting the data, in normal text form, in compressed form, etc. Here, we will just focus on small problems where the data can be printed in text form. To print something out, you have to use a write statement. This statement usually looks like write(x,y) Z/ where Z is a list of things to print, X tells the code where to print Z, and Y tells the code in what format to print Z. The most basic write statement is write(*,*) Z, which tells the code to write Z in its default format to the screen. For instance: write(*,*) The value of x is,x, and the value of y is, y will write the sentence The value of x is, followed by the actual value of x, and then will write and the value of y is followed by the value of y, to the screen. Note the quotes around the sentences, and the commas separating each of the elements of the list of objects to print. Exercise 2: In the program above, write out (whichever way you want), the values of i,j,k,x,y and z. Recompile the code, and execute is. Is the output what you expected? Alternatively, you may want to write this information into a file. To do so, you first need to open the file, then write to the file, then close the file. At the most basic level, this is done by the following commands: open(n,file=filename) write(n,*) Z close(n) where n is any integer of your choice (larger than, say, 0) that will refer specifically to the file from the open statement to the end statement, and filename is the name of your file (defined as a string of characters, which should therefore be written in quotes. As an example, we can write open(0,file= mydata.dat ) write(0,*) The value of x is,x, and the value of y is, y 2

40 close(0) Exercise 3: In the program above, write all the data to a file instead of the screen. Recompile the code, and execute is. Is the output what you expected? 3 Reading data In many cases, you want to write a program that can be applicable to different input data without having to recompile it each time. For instance, suppose that we wish to take, as in the code above, a value of x, then takes its cosine, then add the two together and print out the result. But instead of writing the value of x in the code, we want to read it online, from a prompt, or from a file. To read information, the command is very similar to that of the write statements: they usually take the form of read(x,y) Z. For instance, to read the value of x from a screen prompt, and then write it back to the screen, you could add the following command to the code: write(*,*) What is the value of x? read(*,*) x write(*,*) x is equal to,x Exercise 4: Modify the code above to prompt the user to input x and i. Compile and run the code, and then run it on a few examples. Is the result what you expect? Instead, one may want to read the value of x and i from a file. Suppose you create a data file called input.dat that contains x on the first line, and i on the second line. To open the file and read the two values, simply add the following section to the code: open(,file= input.dat ) read(,*) x read(,*) i close() Note that the information in the file must match what the code expects (i.e. it must contain a real number in the first line, and an integer number on the second). Exercise 5: Modify the code above to read x and i from a file instead of the prompt. Compile and run the code. Is the result what you expect? Then switch the two lines in the input file, and re-run the code. What happens then? 3

41 4 Do loops Suppose you now want to create a code that repeats very similar (but not necessarily identical) instructions many times. Examples of this would be, for instance, to calculate successive numbers in the Fibonacci sequence, or to evaluate the same function f(x) for many different values of x. A good way of doing this is through the use of do loops. A do loop repeats a set of instructions for a set number of iterations, where the only thing that differs in each repeated set is the value of the iteration number. The do loop structure is do iter = startiter,enditer instruction instruction 2... enddo Here iter is an integer that will be varied in increments of from startiter to enditer. The following program evaluates the first 0 numbers of a geometric sequence: program geometric implicit none integer :: iter real :: a0, r, res write(*,*) What is the value of a0? read(*,*) a0 write(*,*) What is the value of r? read(*,*) r do iter=,0 write(*,*) iter, a0 a0 = a0 * r enddo end program geometric Exercise 6: Write, compile and run this code. Are the results what you expect? How would you modify it to calculate the values of an arithmetic sequence? How would you modify it to prompt the user to tell the code how many iterations to run? How would you modify it to print the results to a file instead of the screen? Do all of these modifications, run the code and compile it. Are the results what you expect? 4

42 5 Functions Functions in Fortran have the same purpose and act in very much the same way as normal mathematical functions: they take in a number of arguments, and return the quantity that is the result of applying the function to its arguments. Note that the quantity returned can be of any different data types, and can either be a number, a character, a vector, a matrix, etc... There are three ways of writing a function: the latter can be embedded in the original program (in which case, it can only be called from that original program), it can be added after the original program, or it can be put in a separate file (in which case different programs can appeal to the same function). To understand the difference between the various cases, imagine that we want to write a code that produces a file that can be plotted which contains in the first column values of x in a given interval, and in the second column the corresponding values of cos(x). Example : program: This first example contains the function as part of the original program plotfunction implicit none integer :: i real :: x real, parameter :: xmin = 0.,xmax=0., a=-2. open(0,file= myplot.dat ) do i =,00 x = xmin + xmax*(i-.0)/( ) write(0,*) x,f(x) enddo close(0) contains function f(x) implicit none real :: f,x f = cos(x+a) end function f end program plotfunction 5

43 A few things to note here: xmin, xmas and a are defined as parameters of the original program. These are variables that are not meant to ever be changed by any operation in the program. Their values are forever fixed by the declaration statement. Note how we did not declare the type of f in the bulk of the calling program. This is actually done within the function. Exercise 7: Write, compile and run this code. Plot the results using gnuplot, or any visualization routine of your choice. Is the result what you expect? Now change the values of xmas, recompile and re-run the code. Look at the results: are they what you expect. Do the same but this time change the value of a. Are the results what you expect? Example 2: We now consider the alternative program in which f(x) is appended in the same file after the end of the program. Your code should now look like this: program plotfunction implicit none integer :: i real :: x real, parameter :: xmin = 0.,xmax=0., a=-2. open(0,file= myplot.dat ) do i =,00 x = xmin + xmax*(i-.0)/( ) write(0,*) x,f(x) enddo close(0) end program plotfunction function f(x) implicit none real :: f,x f = cos(x+a) end function f Exercise 8: Modify your code from Exercise 7 to append f(x) at the end of the original program, as shown above. Compile it. What happens? How would you correct the problem? 6

44 This example illustrates that when the function is written outside of the original program The program needs to be notified what is the data type of the quantity returned by the function (here, f) The function needs to be notified of the type of all the variables it contains (here, f, x and a). Exercise 9: Correct your code from Exercise 8 accordingly, until it compiles correctly. Plot the results using gnuplot, or any visualization routine of your choice. Is the result what you expect? Now change the values of xmas, recompile and re-run the code. Look at the results: are they what you expect. Do the same but this time change the value of a. Are the results what you expect? In this example, the function f does not know what the value of a is. Because of this, it usually (but not always, that depends on the compiler) just sets this unknown value to 0. To correct the problem a must also be passed as an argument of the function. Exercise 0: Correct your code from Exercise 9 accordingly. Run it a few times with different values of a. Does it now behave as it should? Example 3: Finally, you can also take the last example and put the function in an entirely different file instead of appending to the end of the program. To do so, simply copy and paste the function into a file called, say, fcosx.f90. To compile the code with the program and the function in different files, simply type: gfortran myplot.f90 fcosx.f90 -o myplot. The compiler will then take any program and function that it finds in the two listed files and attempt to link them to one-another. If successful, it will generate the executable called myplot. Exercise : Move the function to a separate file, recompile as suggested, and re-run the program. Does it behave as expected? The advantage of the last form is that you can now call the same function from an entirely different program, simply by adding the file fcosx.f90 to the list of files that the compiler of the new code must link. 6 Arrays In Fortran on can easily construct and manipulate vectors, matrices, and higherdimensional arrays. The use of vectors/arrays is, at least superficially, very intuitive. By default, the range of indices of a vector is between and the vector dimension (and similarly for matrices). So, in order to access the third component of a vector v we write v(3), and to access the coefficient in the second line, 7

45 first column of a two-by-two matrix A, we simply write A(2,). Suppose for instance that we want to create two square matrices (one superdiagonal and one sub-diagonal, for simplicity) and add them together. The following program illustrates how one would declare, create, and add the matrices. program addmats implicit none integer, parameter :: dimmat = 3 real, dimension(dimmat,dimmat) :: a,b,c integer :: i,j! This creates the matrices. a(,2) = 2.0 do i=2,dimmat- a(i,i+) = 2.0 b(i,i-) =.0 enddo b(dimmat,dimmat-) =.0! This adds the matrices a and b do i=,dimmat do j=,dimmat c(i,j) = a(i,j)+b(i,j) enddo enddo! This prints c write(*,*) c end program addmats Note: The arrays were defined to be arrays through the declaration statement dimension followed by the dimension of the array. This is one way of doing it called static allocation, in which the size of the array is predetermined right at the beginning of the program. This used to be the old Fortran way of doing it, and is still a very useful method for simple problems. Later on, we will learn about dynamic array allocation, in which one can create arrays on the fly. Note how comments have been added to the program to make it more readable. This is done with the exclamation mark. Everything on the line after the exclamation mark is ignored by the compiler. 8

46 Exercise 2: Write, compile and run this program. What do you notice? The program as written has two issues: the first is that all the coefficients of c end up written in a line, which is confusing, and second, some of them have values that are clearly gibberish. To understand and correct the first problem, note that the way that Fortran actually stores a matrix is the column-major order, meaning that it stores, one after the other, first all the elements of the first column of the matrix, then all the elements of the next column, and so forth. So the matrix c, which should be equal to c = a + b = = in the program above, is stored and therefore is (or rather, should be) returned to the screen as To print it out in a more readable form, one can use what is called an implied do list in the write statement (a nice feature of modern Fortran): do i=,dimmat write(*,*) ( c(i,j), j=,dimmat ) enddo Exercise 3: Replace the relevant lines of code in the previous program by these ones, re-compile and re-run it. Has this corrected the first problem? Let s now deal with the second issue. Clearly, some of the elements are returned correctly, and some are not. The reason for this is that numbers in Fortran are not necessarily zero upon starting the program!. One should never assume that they are. To correct the problem, we then either have to zero the matrices by hand just after entering the program, or, we can explicitly enter all the coefficients of a and b, including all of the ones that should be zero. Exercise 4: Correct the program above using either of the two methods described. Recompile and re-run it. Does it now give the correct answer? At this point, it is worth mentioning that there are actually much more elegant ways of doing the same things, using intrinsic array manipulation routines in Fortran. For instance, it is possible to multiply a matrix a by a scalar s simply with the command s*a. It is also possible to add the matrices a and b simply with the command a+b. This is often a much faster way of manipulating arrays, since good compilers will optimize the array operations in a way that is difficult/cumbersome to do by hand. Exercise 5: Write a separate program that makes use of the scalar multi- 9

47 plication and matrix addition commands to do the same thing as before. Run and compile it to check that it gives the same result as your original code. Now crank up the dimension of the array dimmat to a very large value (0,000 or more), comment out the parts where you write the result out to the screen, recompile both codes, and run each of them by adding the command time in front of your execute command. How long does it take for each of the two codes to do the same thing. As you can see, the intrinsic functions are significantly more efficient than writing out the commands by hand so use them as much as possible. To understand the origin of the difference, see course on High-Performance Computing. Warning: Note that the compiler interprets these intrinsic commands as applying to the coefficients, not to the matrices. So the command a+b creates the matrix whose coefficients are a(i,j) + b(i,j). This is fine, since this is how matrix addition/subtraction works. However, the similar command a*b will create the matrix whose coefficients are a(i,j)*b(i,j). This is not the result of a normal matrix multiplication. Similarly the command a/b creates the matrix whose coefficients are a(i,j)/b(i,j), which is not the result of a matrix inversion of b followed by a multiplication with a. To multiply two matrices, there is an intrinsic Fortran 90 function called matmul, which simply works by writing c = matmul(a,b) (this creates the matrix c as the matrix-product of a and b). No similar intrinsic function exists for matrix inversions, although many Libraries exist that can be of help. See later for this. 7 Subroutines Subroutines are essentially sub-programs. While functions are usually used to perform rather simple operations (though this doesn t need to be always true), subroutines commonly have hundreds or thousands of lines. They do not have to return an argument, as in the case of functions, but on the other hand they can return many arguments if needed. The standard structure of the subroutine is subroutine nameofroutine(arg,arg2,arg3,...,argn) implicit none declarations instructions end subroutine nameofroutine Among the arguments of the routine, there could be input arguments, or out- 0

48 put arguments, or mixed ones (that is, arguments that provide the routine with information on input, and are changed on output). As in the case of functions, the routine can either be contained in the calling program, or written separately (either in the same file, after the body of the program, or in a separate file). If the routine is contained within the program, then it knows about any variable that is being used by the calling program. If on the other hand the routine is not contained in the program, then any variable it needs must be passed as an argument. Remarkably, routines can also take other routines, or other functions as arguments. This is very useful if we want to create, for instance, a routine that uses the bisection method to find out if a user-supplied function has a zero in a given interval. Here is the code for the routine. It takes as argument the interval over which we want to search for roots, the name of the function we want to test, the number of iterations to perform, and the value of the solution returned, if it exists. subroutine bisect(xmin,xmax,func,sol,iter,err) implicit none! On entry this routine must be provided with the interval [xmin,xmax]! over which to search for a solution,! the name of the function fund to search for! and the number of iterations iter to apply.! On exit this routine returns the solution in sol, and its error in err. real :: xmin,xmax,sol,func integer :: iter real :: x,x2,res,err integer :: i external func x=xmin x2=xmax! Check if there could be a unique solution in interval. res=func(x)*func(x2) if(res.gt.0.0) then write(*,*) There is either no solution or more than one solution in this interval. write(*,*) Try again with a different interval. endif do i=,iter sol=(x+x2)/2! Calculate the mid-point of the interval res=func(sol)*func(x2) if(res.gt.0.0) then x2 = sol! The solution is between x and sol, shrink x2

49 else x = sol! The solution is between sol and x2, increase x endif enddo sol = (x2+x)/2. err = (x2-x)/2. end subroutine bisect Note the rather self-explanatory use of the if, then, else statements. Also note the declaration of func as an external function. This is needed both in the routine and in the calling program (see below) if func is passed as an argument of the routine. Suppose we now want to use this routine, with the function cosine created and written out in the separate file fcosx.f90, we could use the driver program program solbybisection implicit none real, parameter :: xmin=0.0, xmax=2.0 integer, parameter :: iter = 0 real :: sol,err,fcosx external fcosx call bisect(xmin,xmax,fcosx,sol,iter,err) write(*,*) sol,err end program solbybisection Exercise 6: Create the three files containing the calling program, the function fcosx (if this wasn t done earlier) and the subroutine. Compile them all together, as shown earlier, and run the program. In this particular example, the solution should be π/2. Is it? Try other functions and other intervals. Correct the code as appropriate if things don t go as expected. 2

50 Beginner Fortran 90 tutorial (part 2) Modules Modules were introduced in Fortran 90, and greatly help with the organization of a single program, and with linking routines across many different programs. They are very versatile, and have many different uses. In this Section, we will learn a few of them. First, note that a module should be viewed as a library. That library can contain different things, such as a list of important universal constants (if you are writing a program for physics computations for instance), or a list of functions or subroutines that you often use, or a list of variables that are common to many routines in a single program, and that need to be shared by all these routines. The structure of a module is usually of the form: module nameofmodule declarations of variables contains list of functions and routines end module nameofmodule Any program, function or subroutine that needs to access a particular module usually has, just after its first line (e.g. program thisprog, or subroutine myroutine), the statement use nameofmodule. Here is an example of a module that contains various important constants, and then of a program that uses them. module mathconsts implicit none real, parameter :: pi = acos(-.0) real, parameter :: e = exp(.0) end module mathconsts

51 program sillyprog use mathconsts real :: x x = cos(pi) write(*,*) The cosine of pi is,x x = log(e) write(*,*) The natural log of e is,x end program sillyprog The advantage of doing this rather than declaring π and e in the program sillyprog is that we can from now on call this module from any program, routine or function we ever write. Exercise : Save the module in a file called mathconsts.f90 and the program in a file called sillyprog.f90. To compile this module with the program, simply type the command gfortran mathconsts.f90 sillyprog.f90 -o test.exe. Run the program. Does this behave as you expected? Modules, as described earlier, can also be used to create a library of the routines and functions you may commonly use. In fact, one of the goals of the first part of this course will be to create a linear algebra module that contains all of your linear algebra routines. Next, we will start creating that module, and study the advantage of putting functions and routines in a module rather than separately. 2 Allocatable arrays In section last week we learned about arrays, and defined them in a static way at the start of the program. This means that some memory space has to be reserved at the beginning of the program for the array, and that memory space cannot be modified later. This is usually not a problem if the task at hand is very predictable, e.g. if the program always has to deal with matrices of a predictable size. However consider an example in which you may want to do some data processing (e.g. multiple images, or time-signals), but the data in question varies a lot in size. With this static allocation, the only way to deal with varying size is to reserve upfront the largest possible memory space (to accommodate for any possible dataset size) and later to only use a part of it if the problem is smaller than this maximum size. This method is quite wasteful, and often requires a lot of inelegant commands to deal with the difference between the actual array size and the reserved array size. This used to be the standard in Fortran 77. From Fortran 90 onward, it has 2

52 become possible to allocate arrays on the fly, that is, even after the program has started. To do so, the arrays have to be defined at the beginning of the program as allocatable. Once the desired array size is known, we can then create it using the command allocate. Once we re done using it, we release the memory space using deallocate. Here is an example, in which the program first reads the array size, then creates the array, then reads it in, then calculates its trace, then deallocates the memory space. The read and allocate routine, and trace function are both stored in a module. program firstlinalprog use LinAl real,dimension(:,:), allocatable :: mat real :: x character*00 filename if(iargc().ne.) then write(*,*) Wrong number of arguments (need file name) stop endif call getarg(,filename) call readandallocatemat(mat,filename) x = trace(mat) write(*,*) The trace of this matrix is, x deallocate(mat) end program firstlinalprog module LinAl implicit none integer :: nsize,msize integer :: i,j contains subroutine readandallocatemat(mat,filename) character*00 filename real, dimension(:,:), allocatable :: mat open(0,file=filename) read(0,*) nsize,msize 3

53 allocate(mat(nsize,msize)) do i=,nsize read(0,*) ( mat(i,j), j=,msize ) write(*,*) ( mat(i,j), j=,msize ) enddo close(0) end subroutine readandallocatemat real function trace(mat) real, dimension(nsize,msize) :: mat trace = 0. if(nsize.ne.msize) then write(*,*) This is not a square matrix, cannot calculate trace else do i=,nsize trace = trace + mat(i,i) enddo end if end function trace end module LinAl Exercise 2: Create a file that contains, in the first line, 2 integers (separated by a tab) that will be the number of lines and number of columns in the matrix, and then write the matrix line by line. Separate each element by a tab. Call this, for instance mymatrix.dat. Then compile and run this program as usual. What happens? Note how in this case the program actually expects an argument right after calling the executable. To do so, (supposing your executable is called myprog) type./myprog mymatrix.dat. What happens then? To understand this program step by step, note that This time the executable expects an argument. This is contained in the iargc() command in the main program. That command counts the number of arguments given to the main program. The lines from if(iargc().ne.)... to endif simply say that if the number of argument given to the executable is not, (which is the expected number) then the program has to stop. Otherwise, the next command getarg reads in the argument, which is the name of the file to use. We shall use more of this later on, and learn of the subtleties of the getarg command. The program itself is very short! Most of the action happens in the module. 4

54 The module itself is written in such a way that there are many variables global to all the functions and routines of the module. Note how nsize and msize are declared before the contains statement, and therefore are implicitly known by all the functions and routines of the module. Same for i and j. This avoids having to pass them back and forth between the program and the subprograms, and having to redeclare them every time. The array mat doesn t really exist until it has been allocated. In the main program, it is merely a pointer to a position in memory space. It is only in the subroutine that this pointer is actually allocated an address, together with all the bits afterwards that are needed to contain the whole array. Exercise 3: Create a new routine in the module LinAl that finds and returns the largest element (in absolute value) of the matrix, as well as its position. Call that routine from the main program, and write a statement to the screen about that element. Exercise 4: Now modify your input matrix to have a different dimension. Check that the program still works fine. 5

55 Beginner Fortran 90 tutorial (part 3) Canned routines Very often, you will find yourself in a situation where you want to use a routine or a function created by another person. To do so, you will have yo to find this routine on the web, or in a book, and understand it well-enough to embed it in your own program. A good example source for such routines is the Numerical Recipes library. You can either copy the routines from the book itself, or from an online website. The key to using such canned routines is to understand: Exactly what they do How to call them, what parameters do they need, what items do they return? What are the dependencies of the routines (i.e. are they standalone, or do they need other routines or module to function) How stable/reliable they are (i.e. under which circumstances are they expected to work, and when are they expected not to work) In a good routine source, you should find all of this information in the documentation provided. If not, you ll have to go digging into the program to understand what it does. Finally, once you understand how the routine works, you should test it on a case for which you know the answer before applying it to problems for which you do not know the answer. In this section, we will learn to use and test the gaussj.f90 routine of Numerical Recipes to find the inverse of a matrix A. Exercise : Either copy the routine from the book, or find it online. Read the instructions very carefully and try to answer the following questions:. What does this routine do? 2. What are the input and output parameters required in the calling sequence. How are these related to the linear algebra problem you are interested in solving?

56 3. What subroutines/functions does it call, what modules does it use? The answers to all of these questions are not all trivial. Carefully read the book and related material on gaussj.f90. Exercise 2: Download or copy all of the additional modules and subroutines you need, and put them in the same directory as gaussj.f90. Modify the program from the last homework to call gaussj.f90 instead of your own Gauss-Jordan elimination routine. Make sure you give the program a different name (you will need to keep your old program as a point of comparison). Create a Makefile to compile your new program, gaussj.f90 as well as all of the required dependencies. Try to compile it. Correct whatever errors come up, and repeat as necessary until you have a program that does compile. Exercise 3: Run your old program and the new program using the same input matrix files for A (note that you will have to give gaussj.f90 a B matrix just give it any vector of your choice). In each case, print the inverse of A to the screen (note that you need to make sure you understand where and how the solution is stored). Do you obtain the same answer? If so, hurray (you re done). If not, could the error be due to partial vs. full pivoting (i.e. does it look like a small truncation error or is it more fundamental)? If the latter, debug and repeat as necessary! This illustrates that it can sometimes take longer to understand how someone else s code works than to write one yourself. However, well-documented and well-tested publicly available routines can be very useful! 2 Routines available in a pre-compiled library In the example given earlier, the entire routine gaussj.f90 was available for you to read, modify if desired, etc.. and the way in which it is included in the program is very similar to the way in which you would include one of your own routines in the program. Routines coming from commercial libraries on the other hand rarely work like that. Usually, these libraries are pre-compiled for your particular computer. You just have to call the routine from your own program, and link the library to the program using a compiler command. The main differences with what we did earlier is that () you sometimes don t even get to see the code for the routine you re using, and (2) you don t link your program to the actual routine, and the routine is not in the same directory as the main code. Instead, you link your program to a library and that library is usually held centrally somewhere on the computer you are using. This way, you do not have to copy the entire library into your working directory every time you want to create a new code that uses it. 2

57 In what follows, we will learn to search for, use, and link routines from the LA- PACK (and BLAS) libraries. Both are very commonly-used, highly-optimized routines for linear algebra problems. Exercise 4: Search the LAPACK library for a routine that will return the inverse of a matrix. Be careful find one that works for any real matrix (for your own education, look at what other options are available for more specific kinds of matrices). What do you notice? Exercise 5: It is quite clear from the documentation that this routine is not a standalone routine : instead, it takes the output from another routine to calculate the inverse of A. Algorithmically, this means that we will have to have our program call sgetrf first, and then sgetri. Read the documentation, and try to answer questions -3 of the previous section for both of the routines. Exercise 6: Modify the program used earlier to use the LAPACK routines instead. This time, you can ignore the matrix B. As before, have the program print the resulting inverse matrix to the screen. Once the program is written, comes the tricky part of linking it to the libraries. Take a deep breath, because this can unfortunately be a very frustrating process the first time around. Typically, the compiler command will look something like: gfortran (the usual list of all the code files you need) (a command that tells the compiler where the library is) -llapack -lrefblas Suppose that your code only uses your normal LinAl.f90 module, as well as a program called testlapackinverse.f90. On the grape cluster, the libraries are located in the directory /scratch/lapack-3.5.0/. The correct compile command would then be (in one line) gfortran LinAl.f90 testlapackinverse.f90 -L/scratch/lapack-3.5.0/ -llapack -lrefblas -o name.exe To put it in a Makefile instead, you could add: testlapack: gfortran LinAl.f90 testlapackinverse.f90 -L/scratch/lapack-3.5.0/ -llapack -lrefblas -o mytest.exe making sure the space in front of the commands are tabs, not separate spaces. You can then compile the code with the simple command make testlapack. Now this will work fine if you are on grape. If you are on another machine, you will first have to find out where the LAPACK and BLAS libraries are (note that they are usually stored together, because the BLAS library is needed to 3

58 compile the LAPACK one). You may even have to install them yourself. To do so, don t panic just yet, go to and try to follow the instructions. I m afraid I have no idea how to install these libraries on a Windows system (so if that s your only option you are now allowed to panic) but it is pretty straightforward to do it on a Unix or Mac system. Just make sure you write down where you install them (anywhere is fine, I think), because the lapack directory the installation creates is the one you will have to refer to in the compiler commands above. Note that you have to install the BLAS libraries first, and then the LAPACK ones. The installation, if successful, should create 2 files in the directory lapack-3.5.0: the file librefblas.a and the file liblapack.a. These two files are the two compiled libraries you need! Exercise 7: Modify your Makefile as described above to compile the new program. If it compiles, hurray, otherwise, come and see me. Once it does eventually compile, compare the output of your new code with the one from the two other codes your created... Is the answer what your hoped for? You are now an expert in using canned routines and commercial libraries! You are ready to start the homework! 4

59 Chapter 2 Systems of Linear Equations There are many relationships in nature that are linear, meaning that effects can be described by a matrix-vector notation, or a linear system of equations Ax = b, (2.) where A is a m n matrix, x is an n-vector, and b is an m-vector. This form of liner equation tells us that if we know causes x then we can predict the resulting effects b. More interestingly, we we often want to know is the reverse of the process: if we know the effect b then we should like to be able to determine the corresponding cause vector x. Numerical methods for accomplishing this task is our primary goal in numerical linear algebra. What about when relationships are nonlinear? In this case, we often seek for an approximated solution that is linear locally, whereby we makes use of the linear theory. One can identify different types of linear problems in three different approaches: Solutions of well-posed linear systems Ax = b with A a n n square matrix, and x and b are n-vectors this is a topic in this chapter. Approximate solutions of overdetermined linear systems with Ax = b with A a m n matrix (m > n), x is an n-vector, and b is an m-vector this is a topic in Chapter 3. Eigenvalue and eigenvector problems Ax = λx this is a topic in Chapter 4. Studying the systems of linear equations is not only very important in its own sake mathematically, but also crucial in solving various types of discrete solutions in ODEs and PDEs. Learning stable, accurate, fast and efficient numerical algorithms for linear algebra therefore will be a fundamental resource in both pure mathematics and applied mathematics. 59

60 60 The goal will be for you to develop a set of useful routines for solving a wide range of linear algebra problems. Let s begin with reviewing basic concepts of vectors, matrices, and their relations.. Review of Basic Linear Algebra.. Existence and uniqueness An n n matrix A is said to be nonsingular if it satisfies any one of the following equivalence conditions:. A has an inverse A such that AA = I, where I is an identity matrix. 2. det(a) rank(a) = n (the rank of a matrix is the maximum number of linearly independent rows or columns it contains). 4. For any nonzero vector x 0, Ax 0 (i.e., A does not annihilate non-zero vector). Otherwise, the matrix is said to be singular. For a given square matrix A and b, the possibilities of solution x are summarized as follows: Unique solution x = A b if A is nonsingular and b is arbitrary. Infinitely many solutions if A is singular and b span(a) = {Ax : x R n } (why? Hint: Assume A(x + γz) = 0 for any scalar γ and for nonzero z). No solution if A is singular and b span(a). Definition: The p-norm (or l p -norm) of an n-vector x is defined by Important special cases are: -norm: 2-norm: ( n p x p = x i i= x = ) p. (2.2) n x i (2.3) i= ( n 2 x 2 = x i i= ) 2 (2.4) -norm (or max-norm): x = max x i (2.5) i n

61 6 Figure. Illustrations of unit circle, x =, in three different norms: -norm, 2-norm and -norm. Example: For the vector x = (.6,.2) T, we get x = 2.8, x 2 = 2.0, x =.6. (2.6) Definition: The matrix p-norm of m n matrix A can be defined by A p = max x 0 Ax p x p. (2.7) Some matrix norms are easier to compute than others, for example, -norm: -norm: A = max j A = max i m a ij (2.8) i= n a ij (2.9) j= Definition: The condition number of a nonsingular square matrix A with respect to a given matrix norm is defined to be By convention, cond(a) = if A is singular. cond(a) = A A (2.0) The following important properties of the condition number are easily derived from the definition and hold for any norm:. For any matrix A, cond(a). 2. For the identity matrix, cond(i) =.

62 62 Figure 2. The norm equivalence theorem indicates any given norm in finite dimensional vector space can be scaled to be bounded in a different choice of norms. 3. For any matrix A and nonzero γ, cond(γa) = cond(a). 4. For any diagonal matrix D = diag(d ii ), cond(d) = max i d ii min i d ii. Remark: The condition number is a measure of how close a matrix is to being singular: a matrix with a large condition number is nearly singular, whereas a matrix with a condition number close to is far from being singular. Remark: Notice that the determinant of a matrix is not a good indicator of near singularity. In other words, the magnitude of det(a) has no information on how close to singular the matrix A may be. For example, det(αi n ) = α n. If α < the determinant can be very small, yet the matrix αi n is perfectly well-conditioned for any nonzero α. Remark: The usefulness of the condition number is in accessing the accuracy of solutions to linear system. However, the calculation of the condition number is not trivial as it involves the inverse of the matrix. Therefore, in practice, one often seeks for a good estimated approach to approximate condition numbers.

63 63 2. Direct Methods for Solving Linear Systems Recall that in this chapter we are interested in solving a well-defined linear system given as Ax = b, (2.) where A is a n n square matrix and x and b are n-vectors. 2.. Invariant Transformations 2... Permutation To solve a linear system, we wish to transform the given linear system into an easier linear system where the solution x = A b remains unchanged. The answer is that we can introduce any nonsingular matrix M and multiply from the left both sides of the given linear system: MAx = Mb. (2.2) We can easily check that the solution remains the same. To see this, let z be the solution of the linear system in Eqn Then z = (MA) Mb = A M Mb = A b = x. (2.3) Example: A permutation matrix P, a square matrix having exactly one in each row and column and zeros elsewhere which is also always a nonsingular can always be multiplied without affecting the original solution to the system. For instance, P = (2.4) 0 0 permutes v as P v v 2 v 3 = v v 2 v 3 = v 3 v v 2. (2.5) Row scaling Another invariant transformation exists which is called row scaling, an outcome of a multiplication by a diagonal matrix D with nonzero diagonal entries d ii, i =,... n. In this case, we have DAx = Db, (2.6) by which each row of the transformed matrix DA gets to be scaled by d ii from the original matrix A. Note that the scaling factors are cancelled by the same scaling factors introduced on the right hand side vector, leaving the solution to the original system unchanged. Note: The column scaling does not preserve the solution in general.

64 LU factorization by Gaussian elimination Consider the following system of linear equations: x + 2x 2 + 2x 3 = 3, (2.7) 4x 2 6x 3 = 6, (2.8) x 3 =. (2.9) We know this is easily solvable since we already know x 3 =, which gives x 2 = 3, therefore recursively arriving a complete set of solution with x =. When putting these equations into a matrix-vector form, we have x x 2 x 3 = where the matrix has a form of (upper) triangular. 3 6, (2.20) Therefore, our strategy then is to devise a nonsingular linear transformation that transforms a given general linear system into a triangular linear system. This is a key idea of LU factorization (or LU decomposition) or also known as Gaussian elimination. The main idea is to find a matrix M such that the first column of M A becomes zero below the first row. The right hand side b is also multiplied by M as well. Again, we repeat this process in the next step so that we find M 2 such that the second column of M 2 M A becomes zero below the second row, along with applying the equivalent multiplication on the right hand side, M 2 M b. This process is continued for each successive column until all of the subdiagonal entries of the resulting matrix have been annihilated. If we define the final matrix M = M n M, the transformed linear system becomes M n M Ax = MAx = Mb = M n M b. (2.2) Note: As seen in the previous section, we recall that any nonsingular matrix multiplication is an invariant transformation that does not affect the solution to the given linear system. The resulting transformed linear system MAx = Mb is upper triangular which is what we want, and can be solved by back-substitution to obtain the solution to the original linear system Ax = b. Example: We illustrate Gaussian elimination by considering: 2x +x 2 +x 3 = 3, 4x +3x 2 +3x 3 +x 4 = 6, 8x +7x 2 +9x 3 +5x 4 = 0, 6x +7x 2 +9x 3 +8x 4 =. (2.22)

65 or in a matrix notation Ax = x x 2 x 3 x 4 = = b. (2.23) The first question is to find a matrix M that annihilates the subdiagonal entries of the first column of A. This can be done if we consider a matrix M that can subtract twice the first row from the second row, four times the first row from the third row, and three times the first row from the fourth row. The matrix M is then identical to the identity matrix I 4, except for those multiplication factors in the first column: M A = = , (2.24) where we treat the blank entries to be zero entries. At the same time, we proceed the corresponding multiplication on the right hand side to get: M b = (2.25) The next step would be to annihilate the third and fourth entries from the second column (3 and 4), which will give a next matrix M 2 that has the form: M 2 M A = 3 4 now with the right hand side: M 2 M b = = , (2.26). (2.27) The last matrix M 3 will complete the process, resulting an upper triangular matrix U: M 3 M 2 M A = 2 2 = 2 2 = U, (2.28)

66 66 together with the right hand side: M 3 M 2 M b = = y. (2.29) We see that the final transformed linear system MAx = Ux = y is upper triangular which is what we wanted and it can be solved easily by backsubstitution, starting from obtaining x 4 = 3, followed by x 3, x 2, and x in reverse order to find a complete solution x = (2.30) The full LU factorization A = LU can be established if we compute L = (M 3 M 2 M ) = M M 2 M 3. (2.3) At first sight this looks like an expensive process as it involves inverting a series of matrices. Surprisingly, however, this turns out to be a trivial task. The inverse of M i, i =, 2, 3 is just itself but with each entry below the diagonal negated. Therefore, we have L = M = = = M 2 M (2.32) Notice also that the matrix multiplication M M 2 M 3 is also trivial and is just the unit lower triangle matrix with the nonzero subdiagonal entries of M, M 2, and M 3 inserted in the appropriate places. All together, we finally have our decomposition A = LU: = (2.33)

67 67 Quick summary: Gaussian elimination proceeds in steps until a upper triangular matrix is obtained for back-substitution: M M M (2.34) Algorithm: LU factorization by Gaussian elimination: for k = to n #[loop over column] if a kk = 0 then stop #[stop if pivot (or divisor) is zero] endif for i = k + to n m ik = a ik /a kk #[compute multipliers for each column] endfor for j = k + to n for i = k + to n a ij = a ij m ik a kj #[transformation to remaining submatrix] endfor endfor endfor 2.3. Pivoting Need for pivoting We obviously run into trouble when the choice of a divisor called a pivot is zero, whereby the Gaussian elimination algorithm breaks down. As illustrated in Algorithm above, this situation can be easily checked and avoided so that the algorithm stops when one of the diagonal entries become singular. The solution to this singular pivot issue is almost equally straightforward: if the pivot entry is zero at state k, i.e., a kk = 0, then one interchange row k of both the matrix and the right hand side vector with some subsequent row whose entry in column k is nonzero and resume the process as usual. Recall that permutation does not alter the solution to the system.

68 68 This row interchanging process is called pivoting, which is illustrated in the following example. Example: Pivoting with permutation matrix can be easily explained as below: P (2.35) where we interchange the second row with the fourth row using a permutation matrix P given as P =. (2.36) Note: The potential need for pivoting has nothing to do with the matrix being singular. For example, the matrix A = [ 0 0 ] (2.37) is nonsingular, yet we can t process LU factorization unless we interchange rows. On the other hand, the matrix can easily allow LU factorization A = while being singular. [ A = ] = [ [ 0 ] ] [ 0 (2.38) ] = LU, (2.39) Partial pivoting There is not only zero pivots, but also another situation we must avoid in Gaussian elimination a case with small pivots. The problem is closely related to computer s finite-precision arithmetic which fails to recover any numbers smaller than the machine precision ɛ. Recall that we have ɛ 0 7 for single precision, and ɛ 0 6 for double precision. Example: Let us now consider a matrix A defined as A = [ ɛ ], (2.40)

69 where ɛ < ɛ 0 6, say, ɛ = If we proceed without any pivoting (i.e., no row interchange) and take ɛ as the first pivot element, then we obtain the elimination matrix [ ] M =, (2.4) / ɛ and hence the lower triangular matrix [ ] 0 L = / ɛ 69 (2.42) which is correct. For the upper triangular matrix, however, we see an incorrect floating-point arithmetic operation [ ] [ ] ɛ ɛ U = =, (2.43) 0 / ɛ 0 / ɛ since / ɛ >>. But then we simply fail to recover the original matrix A from the factorization: [ ] [ ] [ ] 0 ɛ ɛ LU = = A. (2.44) / ɛ 0 / ɛ 0 Using a small pivot, and a correspondingly large multiplier, has caused an unrecoverable loss of information in the transformation. We can cure the situation by interchanging the two rows first, which gives the first pivot element to be and the resulting multiplier is ɛ: [ ] 0 M =, (2.45) ɛ and hence L = [ 0 ɛ ] and U = [ 0 ɛ ] = [ 0 ] (2.46) in floating-point arithmetic. We therefore recover the original relation: [ ] [ ] [ ] 0 LU = = = A, (2.47) ɛ 0 ɛ which is the correct result after permutation. The foregoing example is rather extreme, however, the principle in general holds to find the largest pivot in producing each elimination matrix, by which one obtains a smaller multiplier as an outcome and hence smaller errors in floating-point arithmetic. We see that this process involves repeated use of permutation matrix P k that interchanges rows to bring the entry of largest magnitude on or below the diagonal in column k into the diagonal pivot position.

70 70 Quick summary: Gaussian elimination with partial pivoting proceeds as below. Assume x ik is chosen to be the maximum in magnitude among the entries in k-th column, thereby selected as a k-th pivot: x ik P x ik M x ik 0 0 In general, A becomes an upper triangular matrix U after n steps, (2.48) M n P n M P A = U. (2.49) Note: The expression in Eq can be rewritten in a way that separates the elimination and the permutation processes into two different groups so that we write the final transformed matrix as P = P n P 2 P, (2.50) L = (M n M 2M ), (2.5) PA = LU. (2.52) To do this we first need to find what M i should be. Consider reordering the operations in Eq in the form, for instance with n = 3, Rearranging operations, M 3 P 3 M 2 P 2 M P = M 3M 2M P 2 P 2 P (= L P). (2.53) M 3 P 3 M 2 P 2 M P (2.54) = (M 3 )(P 3 M 2 P 3 )(P 3P 2 M P 2 P 3 )(P 3P 2 P ) (2.55) (M 3)(M 2)(M )P 2 P 2 P, (2.56) whereby we can define M i, i =, 2, 3 equals to M i but with the subdiagonal entries permuted: M 3 = M 3 (2.57) M 2 = P 3 M 2 P 3 (2.58) M = P 3 P 2 M P 2 P 3 (2.59) We can see that the matrix M n M 2 M is unit lower triangular and hence easily invertible by negating the subdiagonal entries to obtain L.

71 Example: To see what is going on, consider A = (2.60) With partial pivoting, let s interchange the first and third rows with P : = 2 0. (2.6) The first elimination step now looks like this with left-multiplication by M : / /2 3/2 3/2 /4 2 0 = 3/4 5/4 5/4. (2.62) 3/ /4 9/4 7/4 Now the second and fourth rows are interchanged with P 2 : /2 3/2 3/2 7/4 9/4 7/4 3/4 5/4 5/4 = 3/4 5/4 5/4. 7/4 9/4 7/4 /2 3/2 3/2 (2.63) With multiplication by M 2 the second elimination step looks like: /4 9/4 7/4 7/4 9/4 7/4 3/7 3/4 5/4 5/4 = 2/7 4/7 2/7 /2 3/2 3/2 6/7 2/7 (2.64) Interchanging the third and fourth rows now with P 3 : /4 9/4 7/4 7/4 9/4 7/4 2/7 4/7 6/7 2/7. (2.65) 6/7 2/7 2/7 4/7 The final elimination step is obtained with M 3 : /4 9/4 7/4 6/7 2/7 /3 2/7 4/7 = /4 9/4 7/4 6/7 2/7 2/3 7. (2.66)

72 72 Remark: The name partial pivoting comes from the fact that only the current column is searched for a suitable pivot. A more exhausting pivoting strategy is complete pivoting, in which the entire remaining unreduced sub matrix is searched for the largest entry, which is then permuted into the diagonal pivot position. Algorithm: LU factorization by Gaussian elimination with Partial Pivoting: for k = to n #[loop over column] Find index p such that a pk a ik for k i n #[search for pivot in current column] if p k then interchange rows k and p #[interchange rows if needed] endif if a kk = 0 then continue with next k #[skip current column if zero] endif for i = k + to n m ik = a ik /a kk #[compute multipliers for each column] endfor for j = k + to n for i = k + to n a ij = a ij m ik a kj #[transformation to remaining submatrix] endfor endfor endfor 2.4. Gauss-Jordan elimination for inverse matrix We have seen in Gaussian elimination that the LU factorization transforms a general matrix into a triangular form which becomes easier to solve than the original linear system. Can we extend this transformation technique bit further so that we can possibly obtain a system that is even easier than the triangular form? The answer is yes. We notice that a diagonal linear system appears to be the next desirable target. This method is called Gauss-Jordan elimination and is a variation of standard Gaussian elimination in which the matrix is reduced to diagonal form rather than merely a triangular form.

73 Quick summary: Gauss-Jordan elimination can be illustrated as in the following pictorial steps: (2.67) (2.68) Definition: Let us introduce a form called augmented matrix of the system Ax = b which writes the n n matrix A and the n-vector b together in a new n (n + ) matrix form: [ ] A b. (2.69) The use of augmented matrix allows us to write each transformation step of the linear system (i.e., both A and b) in a compact way. Example: Consider the following system using Gauss-Jordan elimination without pivoting: x +x 2 +x 3 = 4 2x +2x 2 +5x 3 = 4x +6x 2 +8x 3 = 24, (2.70) which can be put in as an augmented matrix form: (2.7) First step is to annihilate the first column: 4 4 M , where M = Next we permute: P (2.72) 4 8, where P =. (2.73) 3

74 74 Next row scaling by multiplying a diagonal matrix D : 4 4 D , where D = /2. / (2.74) Next annihilate the remaining upper diagonal entries in the third column: M , where M 2 = Finally, annihilate the upper diagonal entry in the second column: M , where M 3 = (2.75). (2.76) In this example the right hand side is a single n-vector. What happens if we perform the same procedure using multiple n-vectors? In other words, consider now a new choice of the right hand side in the form of n n augmented matrix: [b b2 bn ]. (2.77) We see that the same operation can easily be performed simultaneously on individual b i, i n. Especially, if we choose b i = [0, ],,, 0] T with unity at ith entry, then b2 bn the collection of vectors [b actually becomes the identity matrix I. In this case we see that Gauss-Jordan elimination yields the inverse of A: [ ]] [ ] [ b2 bn A [b = A I I A ]. (2.78) Remark: Although the resulting diagonal system in GJ provides an easier way to obtain the final solution than the back-substitution in triangular form, the elimination process of GJ is much more expensive requiring about n 3 /2 multiplications and a similar number of additions, which is 50 percent more expensive than standard Gaussian elimination. Therefore, in practice, GJ is not a preferred way to compute linear systems.

75 Remark: Then why do we learn GJ at all? Because it is straightforward, understandable, solid as a rock, and an exceptionally good psychological backup for those times that something is going wrong and you think it might be your linear-equation solver Cholesky factorization for symmetric positive definite systems Thus far we have assumed that the linear system has a general matrix and is dense, meaning that majority of the matrix entries are nonzero. On the other hand, there are some special cases we can seek for improved efficiency in both working and storing data when operating on some special matrices. Some examples of special properties of real matrix A that can be exploited include the following: Symmetric: A = A T, i.e., a ij = a ji for all i, j. Positive definite: x T Ax > 0 for all x 0. Banded: a ij = 0 for all i j > β, where β is the bandwidth of A. An important special case is a tridiagonal matrix, for which β =. Sparse: most entries of A are zero. Remark: The properties defined above for real matrices have analogues for complex matrices. In the complex case the usual matrix transpose (denoted by T ) is replaced by the conjugate transpose (denoted by H). For instance, the conjugate transpose of a complex matrix A is denoted as A H = ā ji, (2.79) where ā ji represents complex conjugate for each matrix entry. Remark: For a real matrix A, A H = A T. Definition: An analog to the real symmetric matrix in complex matrix is called Hermitian matrix if A = A H. (2.80) Definition: Similarly, a complex matrix A is called positive definite if x H Ax > 0, for all complex vector x 0. (2.8) Definition: If the matrix A symmetric and positive definite (SPD), then an LU decomposition of A indicates that U T L T = (LU) T = A T = A = LU, (2.82)

76 76 so that U = L T, that is, A = LL T, where L is lower triangular and has positive diagonal entries (but not in general, a unit diagonal). This is known as the Cholesky factorization of A. Remark: Since U = L T, Cholesky factorization is twice faster than the standard Gaussian elimination. Example: Let us begin with considering a simple 2 2 case of a symmetric positive definite matrix decomposition: This implies we have [ ] [ ] [ ] a a 2 l 0 l l = 2. (2.83) a 2 a 22 l 2 l 22 0 l 22 The algorithm of CF can be generalized as follow: Algorithm: Cholesky factorization (decomposition): l = a, (2.84) l 2 = a 2 /l, (2.85) l 22 = a 22 l2 2. (2.86) for k = to n #[loop over column] a kk = a kk for i = k + to n a ik = a ik /a kk #[scale current column] endfor for j = k + to n for i = k + to n a ij = a ij a ik a kj #[from each remaining column, # subtract multiple of current column] endfor endfor endfor Note: There is a number of facts about the CF algorithm that make it very attractive and popular for symmetric positive definite matrices: In the above Algorithm we see that the Cholesky factor L overwrites the original matrix A, without requiring a separate storage space for L.

77 The n square roots required are all of positive numbers, therefore CF is well-defined. No pivoting is required for numerical stability. Only the lower triangle of A (e.g., a, a 2, a 22 ) is accessed and hence the strict upper triangular potion (e.g., a 2 ) need not be stored. Only about n 3 /6 multiplications and a similar number of additions are required. In all, CF requires only about half as much work and half as much storage as are required for LU factorization of a general matrix by Gaussian elimination. 77 Example: To illustrate the algorithm, we compute the CF of the symmetric positive definite (SPD) matrix 3 A = 3 (2.87) 3 Step: Dividing the first column by from the first for-loop: (2.88) Step2: Second column update from the second for-loop: ( ) ( )( ) 3 ( )( ) = (2.89) We are now done with the first iteration of the outer most for-loop (i.e., k = ), and move on to the next one, k = 2. Step4: Second column scaling from the first for-loop: / / = (2.90)

78 78 Step5: Third column update from the second for-loop: ( 0.865) 2 = (2.9) We are now done with the second iteration of the outer most for-loop (i.e., k = 2), and move on to the next one, k = 3. In this case, there is nothing to be done in the second for-loop as j = k + = 4 which is beyond the size of the matrix. Step6: Third column scaling from the first for-loop: L = / 2.0 = (2.92) Short Discussion on Operation Counts of Gaussian elimination and Gauss-Jordan In general, the overall operation count of seeking for the solution X to the linear system AX = B, where A is an n m matrix, X is an m m matrix, and ] b2 bm B = [b, an n m matrix, (2.93) can be found out to be (see more details in one of our references Numerical Recipes): Gaussian elimination: O( n3 3 + n2 m 2 + n2 m 2 ) Gauss-Jordan: O(n 3 + n 2 m) Note: The above quick estimation tells us that GE is about a factor 3 advantage over GJ for small number of m << n, e.g., m =. Note: One also can see that (again, see Numerical Recipes for more logical discussion) for matrix inversion, the two methods turn out to be identical in performance.

79 2.7. Crout s Method for LU decomposition As yet another variant of Gaussian elimination method we consider a more compact transformation method, called Crout s method, that allows to convert A to L and U directly, including an efficient storage algorithm of entries of them. Recall that the previous LU decomposition method using Gaussian elimination, with or without pivoting, does not provide both L and U simultaneously during the calculation steps. Rather we first compute the upper triangular matrix U, and we separately compute the lower triangular matrix by multiplying elimination matrices M i. The solution x is then evaluated by the back-substitution. One can say that the Cholesky factorization algorithm does provide both L and U simultaneously but it CF is only limited to a special case for symmetric positive definite systems. The Crout s algorithm can be applied for any general (dense) n n matrix A and directly decomposes A into L and U, A = LU, and hence 79 b = Ax = (LU)x = L(Ux). (2.94) This can be further broken into two successive linear systems: Ly = b, (2.95) Ux = y. (2.96) Notice that Eq can be solved by forward-substitution, which is analogous to the back-substitution in Eq The idea in Crout s method is to consider an efficient method to decompose A = LU. Putting this in a 4 4 component form for instance, l l 2 l 22 l 3 l 32 l 33 l 4 l 42 l 43 l 44 u u 2 u 3 u 4 u 22 u 23 u 24 u 33 u 34 u 44 = a a 2 a 3 a 4 a 2 a 22 a 23 a 24 a 3 a 32 a 33 a 34 a 4 a 42 a 43 a 44 (2.97) We first note that there are 6 equations and = 0 unknowns for each l ij and u ij (20 total unknowns), hence the system can t be solved. However, we can overcome this difficulty by imposing. l ii =, (2.98) as can be observed experimentally in Eq This then removes 4 unknowns from L, whereby we can easily solve for L and U, given A: l 2 l 3 l 32 l 4 l 42 l 43 u u 2 u 3 u 4 u 22 u 23 u 24 u 33 u 34 u 44 = a a 2 a 3 a 4 a 2 a 22 a 23 a 24 a 3 a 32 a 33 a 34 a 4 a 42 a 43 a 44. (2.99)

80 First try without pivoting We see that we can write the relation in Eq as (n = 4 in our example) a ij = n l ik u kj = k= min(i,j) k= l ik u kj, (2.00) since { lik = 0 if k > i u kj = 0 if k > j (2.0) Evaluating first few steps, we get: Step : For i =, a j = l k u kj = u j, j. (2.02) k= This completes evaluations denoted by u j which can be simply stored in the position of corresponding a j. u u 2 u 3 u 4 a 2 a 22 a 23 a 24 a 3 a 32 a 33 a 34 a 4 a 42 a 43 a 44. (2.03) Now for i 2, j =, hence, a i = l ik u k = l i u, i 2, j, (2.04) k= l i = a i u, i 2. (2.05) This completes evaluations denoted by l i that can stored in the place of corresponding a i : u u 2 u 3 u 4 l 2 a 22 a 23 a 24 l 3 a 32 a 33 a. (2.06) 34 l 4 a 42 a 43 a 44 Step 2: Now for i = 2, a 2j = 2 l 2k u kj = l 2 u j + u 2j, j, (2.07) k=

81 giving 8 u 2j = a 2j l 2 u j, j. (2.08) This completes evaluations denoted by u 2j which can be simply stored in the position of corresponding a 2j : u u 2 u 3 u 4 l 2 u 22 u 23 u 24 l 3 a 32 a 33 a 34 l 4 a 42 a 43 a 44. (2.09) Similarly, for i 3, j = 2, we get: giving a i2 = 2 l ik u k2 = l i u 2 + l i2 u 22, i 3, (2.0) k= l i2 = a i2 l i u 2 u 22, i 3. (2.) This completes evaluations denoted by l i2 which can be simply stored in the position of corresponding a i2 : u u 2 u 3 u 4 l 2 u 22 u 23 u 24 l 3 l 32 a 33 a. (2.2) 34 l 4 l 42 a 43 a 44 Step 3: We repeat the same process until the end. We can generalize this and write Crout s algorithm without pivot as Algorithm: Crout s algorithm without pivot: for k = to n #[sweep though Step, Step2, etc.] for j = k to n u kj = a kj k m= l kmu mj #[fill out each row] #[this can be stored at a kj ] endfor for i = k + ( to n l ik = u kk a ik ) k m= l imu mk #[fill out subdiagonal entries of each column] #[this can be stored at a ik ] endfor endfor

82 2.7.2. Second try with pivoting The previous attempt of designing the Crout s algorithm does not facilitate to provide any pivoting, which is essential to stability and accuracy, as the order of

82 Second try with pivoting The previous attempt of designing the Crout s algorithm does not facilitate to provide any pivoting, which is essential to stability and accuracy, as the order of processes (i.e., Steps) alternates rows and columns, which is not suitable for pivoting. Also, such an implementation will significantly slows down array handling in both Fortran and C because the use of indices i, j are adjacent at all. We can rectify the situation by re-ordering the operations to column-wise operation only by postponing the row-wise evaluation (i.e., no alternating column fill and row fill, but just column fill) so that the algorithm allows a partial pivoting. The modified algorithm can be written as: Algorithm: Crout s algorithm with column-wise only: for j = to n #[loop over column] for i = to j u ij = a ij i m= l imu mj #[fill out each row] #[this can be stored at a ij ] endfor for i = j + ( to n l ij = u jj a ij ) j m= l imu mj #[fill out subdiagonal entries of each column] #[this can be stored at a ij ] endfor endfor Figure 3. Column-wise operation of Crout s algorithm, delaying the row fill operation until later.

83 83 The Crout s algorithm with partial pivoting can be written as: Algorithm: Crout s algorithm with implicit pivoting: for i = to n scale(i) = max j n a ij #[loop over rows to get max] endfor for j = to n #[loop over column] for i = to j u kj = a kj k m= l kmu mj #[compute all u ij except for u jj # which must be selected by pivoting] endfor for i = j to n l ij = a ij j m= l imu mk #[partial evaluation of l ij omitting division by u jj. # The largest of these will be the pivot u jj. # Note that this formula is correct for u jj ] l pivot = max ij j i n scale(i) set i pivot = i for row index i that contains pivot endfor if j i pivot then interchange row j with row i pivot u jj = pivot #[switch rows if max pivot is found] record the switch for RHS endif for i = j + to n l ij = l ij u jj #[divide by u jj to complete the l ij calculation] endif Remark: This algorithm uses implicit pivoting where each equation is first scaled by its largest entry, then the Crout s algorithm performed.

84 84 The Crout s algorithm with partial pivoting can be written as: Algorithm: Crout s algorithm with partial pivoting: for j = to n #[loop over column] for i = to j a ij = a ij i m= l imu mj i pivot = i pivot = a j,j #[compute all u ij except for u jj # which must be selected by pivoting] #[The summation is zero if i < ] endfor for i = j + to n a ij = a ij j m= l imu mj #[partial evaluation of l ij omitting division by u jj. # The largest of these will be the pivot u jj. # Note that this formula is correct for u jj ] #[The summation is zero if j < ] if pivot < a ij then pivot = a i,j i pivot = i #[Record a new max] endif endfor if j i pivot then interchange row j with row i pivot #[switch rows if max pivot is found] record the switch for RHS endif for i = j + to n a ij = a ij a jj #[divide by u jj to complete the l ij calculation] endif Remark: This algorithm overwrites entries of L and U to A as shown in Fig. 3. Remark: Notice that this type of pivoting is what we used in the Gaussian elimination. The resulting matrix A holds the elements of both the lower and the upper triangular matrices, arranged as in Fig. 3. Note also that the product of the resulting lower and upper matrices LU looks slightly different from the original matrix A because of the row swappings from the partial pivoting.

85 However, LU is equivalent to the original A except for the corresponding swaps, or permutations, and the successive forward and backward substitutions should give the correct solution x. 85

86 Chapter 3 Linear Least Squares. Linear Least Square Problems In the previous chapter, we focused on solving well-defined linear system defined by n number of linear equations for n unknowns. This can be put into a compact matrix-vector form of Ax = b with A an n n square matrix of coefficients, n unknown quantities x and n known quantities b. As already studied, the primary idea we adopted was to take the direct methods to seek for exact solutions to such well-defined linear systems, because a square linear system is exactly determined, provided the matrix is nonsingular. In this chapter, we further extend our interests to more generalized problems defined by so called an overdetermined system a system with larger numbers of m equations than n unknowns with m > n. As opposed to the previous goal to find exact solutions in Chapter 2, we find there is no particular virtue in seeking for exact solutions in the overdetermined system. Rather we now want to find approximate solutions to overdetermined linear systems which is defined by Ax = b with A a m n matrix (m > n), x is an n-vector, and b is an m-vector. One of the most popular and computationally convenient approaches to obtain approximate solutions is the method of least squares, which we will study this chapter. We will restrict our attention to linear least square problems only in this course, leaving nonlinear least square problems as a topic for further study. It is important to realize a big difference in this overdetermined system. In the current setting, in general, with only n parameters in the vector x, one would not expect to be able to reproduce the m-vector b as a linear combination of the n columns of A. In other words, for an overdetermined system there is usually no solution in the usual sense. Instead, what we wish to establish is to minimize the distance r = b Ax (3.) using some norm. The vector r = b Ax is called the residual vector and is a function of x. 86

87 Any choice of norms would work, although, in practice, we prefer to use l 2 -norm which provides more numerical conveniency as a result of its relationship with the inner product and orthogonality, as well as its smoothness and convexity. Remark: The use of the l 2 -norm gives the method of least squares its name: the solution is the vector x that minimizes the sum of squares of differences between b and Ax: r 2 2 = b Ax 2 2. (3.2) Remark: We often write a linear least squares problem as 87 Ax = b (3.3) in order to explicitly reflect the fact that x is not the exact solution to the overdetermined system, but rather is an approximate solution in the l 2 -norm (or least square) sense... Overdetermined System The question that naturally arises is then what makes us to consider an overdetermined system? Let us consider some possible situations. Example: Suppose we want to know monthly temperature distribution in Santa Cruz. We probably would not make one single measurement for each month and consider it done. Instead, we would need to take temperature readings over many years and average them. From this procedure, what we are going to make is a table of typical temperatures from January to December, based on vast observational data, which often times even include unusual temperature readings that deviate from the usual temperature distributions. This example illustrates a typical overdetermined system: we need to determine one representative meaningful temperature for each month (i.e., one unknown temperature for each month) based on many numbers (thus overdetermined) of collected sample data including sample noises (due to measurement failures/errors, or unusually cold or hot days data deviations, etc.). Example: Early development of the method of least squares was due largely to Gauss, who used it for solving problems in astronomy, particularly determining the orbits of celestial bodies such as asteroids and comets. The least squares method was used to smooth out any observational errors and to obtain more accurate values for the orbital parameters. Example: A land surveyor is to determine the heights of three hills above some reference point. Sighting from the reference point, the surveyor measures their respective heights to be h = 237ft., h 2 = 942ft., and h 3 = 247ft. (3.4)

88 88 And the surveyor makes another set of relative measurements of heights h 2 h = 7ft., h 3 h = 77ft., and h 3 h 2 = 475ft. (3.5) The surveyor s observation can be written as Ax = 0 h h 2 = 0 h = b. (3.6) It turns out that the approximate solution becomes (we will learn how to solve this soon) x T = [h, h 2, h 3 ] = [236, 943, 246], (3.7) which differ slightly from the three initial height measurements, representing a compromise that best reconciles the inconsistencies resulting from measurement errors. Figure. A math-genius land surveyor obtains three hills height by computing the least squares solution to the overdetermined linear system..2. Data Fitting One of the most common standard problems for the least squares method is that of data fitting, or curve fitting, especially when the data have some random error associated with them. The problem can be described as: Given: data points (t i, y i ), i =,..., m Goal: to find the n-vector x that gives the best fit to the data by the model function f(t, x), f : R n+ R in the least square sense: m ( 2. min y i f(t i, x)) (3.8) x i=

89 89 Figure 2. The result of fitting a set of data points (t i, y i ) with a quadratic function, f(t, x) = x + x 2 t + x 3 t 2. Image source: Wikipedia Definition: A data fitting problem is linear if f is linear in x = [x,, x n ] T, although f could be nonlinear in t. Example: A polynomial fitting is a linear data fitting problem. Example: An exponential fitting is a nonlinear data fitting problem. f(t, x) = x + x 2 t + x 3 t x n t n (3.9) f(t, x) = x e x 2t + x 2 e x 3t + + x n e xnt (3.0) Example: Consider five given data points, (t i, y i ), i 5, and a data fitting using a quadratic polynomial. This overdetermined system can be written as, using a Vandermonde matrix, Ax = t t 2 t 2 t 2 2 t 3 t 2 3 t 4 t 2 4 t 5 t 2 5 x x 2 = x 3 y y 2 y 3 y 4 y 5 = b. (3.) The problem is to find the best possible values of x = [x, x 2, x 3 ] T which minimizes the residual r in l 2 -sense: r 2 2 = b Ax 2 2 = 5 i= ( y i (x + x 2 t i + x 3 t 2 i )) 2 (3.2)

90 90 Such an approximating quadratic polynomial is plotted as a smooth curve in blue in Fig. 2, together with m given data points denoted as red dots. Remark: In statistics, the method of least squares is also known as regression analysis. 2. Geometric Aspects of Least Squares Problems 2.. Normal Equations Observation of Eq. 3.3 tells us that we can consider a least squares problem as a minimization problem in multivariate calculus, where we set the first derivative equal to zero. To do that, let us write the l 2 -norm of the residual vector r using an objective function φ : R n R: φ(x) = r 2 2 = r T r = (b Ax) T (b Ax) = b T b 2x T A T b + x T A T Ax. (3.3) Minimizing the residual implies we find solutions x that satisfies or in a compact form, φ x i = 0, i =,..., n, (3.4) 0 = φ(x) = 2A T Ax 2A T b, (3.5) which yields an n n symmetric linear system, referred to as the system of normal equations: A T Ax = A T b. (3.6) Note: We note that A T A is positive definite rank(a) = n (3.7) Throughout this chapter, we always assume rank(a) = n in order to avoid rank-deficiency, a situation that does not allow uniqueness of a solution to the corresponding least squares problem. Therefore, we always guarantee to have symmetric positive definiteness (SPD) of A T A. Note: There are good news and bad news regarding Eq A good news is that, theoretically, we can use the Cholesky decomposition to solve this PSD system of normal equations. However, in practice, this is not always the case and sometimes produces disappointingly inaccurate results. Here is one reason for this behavior: Consider, for example, to take A = ɛ 0, (3.8) 0 ɛ

91 where 0 < ɛ < ɛ, with ɛ a machine accuracy. In floating-point operation, we get [ ] [ ] A T + ɛ 2 A = + ɛ 2 =, (3.9) which is a singular matrix as a result. Remark: This shortcoming of the normal equations, as well as another adverse sensitivity issue due to potential condition number squaring effect, 9 cond(a T A) = cond(a) 2, (3.20) makes us to seek more numerically robust and accurate methods for linear least squares problems. We will continue this discussion in Section Orthogonality and Orthogonal Projectors For a geometric view of least squares problems, we first look at the notion of orthogonality. Definition: For any two vectors x, x 2 R m, we define the angle between the two vectors to be any angle θ satisfying < x, x 2 >= x T x 2 = x 2 x 2 2 cos(θ). (3.2) Figure 3. Geometric interpretation of linear squares problem. Definition: Two vectors x, x 2 are said to be orthogonal, or perpendicular, or normal to each other if cos(θ) = 0, or equivalently, x T x 2 = 0. It is important to notice that for a least squares problem Ax = b, the m-vector b does not lie in span(a), which is a subspace of dimension at most n, since m > n. We can then choose y = Ax span(a) such that the distance between b and y is minimum. Easily, we see that this takes place when the residual vector r = b Ax becomes orthogonal to span(a) as shown in Fig. 3.

92 92 Because span(a) is a subspace spanned by columns of A, we see that r must be orthogonal to each of those columns, or we simply recover the normal equations, 0 = A T r = A T (b Ax), (3.22) A T Ax = A T b. (3.23) Note: Fig. 3 also shows that, the choice of the closest of vector y = Ax span(a) is nothing but a projection of b onto span(a). This motivates us to define a projector matrix P as follows: Definition: A square matrix P is said to be a projector if it is idempotent, that is, P 2 = P. (3.24) Such a matrix projects any given vector onto a subspace, namely span(p), but leaves unchanged any vector x span(p) already. Definition: If a projector P is also symmetric, P T = P, then it is called an orthogonal projector. Example: P = (3.25) is an orthogonal projector that maps all vectors in R 3 onto the x-y plane, while keeping those vectors in the x-y plane unchanged: P x y z = x y 0, (3.26) and It is easy to check P 2 = P: P x y 0 = x y 0. (3.27) P 2 x y z = P x y 0 = x y 0. (3.28) Note: With an orthogonal projector P we find P = I P (3.29)

93 is an orthogonal projector onto span(p), the orthogonal complement of span(p). Then, we can express any vector x R m as a sum ( ) x = P + (I P) x = Px + P x. (3.30) Let us now consider how we can make use of the concept of the orthogonal projector P onto span(a) to help understanding in solving the overdetermined system Ax = b. First, by definition, we get Then we have 93 PA = A, P A = 0. (3.3) b Ax 2 2 = P(b Ax) + P (b Ax) 2 2 = P(b Ax) P (b Ax) 2 2 (by Pythagorean Theorem) = Pb Ax P b 2 2. (3.32) Therefore, we see that the least squares solution is given by the solution x that satisfies the first term in the last relation, which is the solution to the overdetermined linear system, Ax = Pb. (3.33) This is an intuitively clear result and is shown in Fig. 3 that the orthogonal projection Pb of b gives y span(a), the closest vector to b. Remember that we wish to transform our given overdetermined m n system into an n n square system, so that we can use the techniques we learned in Chapter 2. Assuming rank(a) = n, there are two ways to construct orthogonal projectors (i.e., symmetric and idempotent) explicitly, which allow us to have transformation into the square system: P = A(A T A) A T. P = QQ T, where Q is an m n matrix whose columns form an orthonormal bases (such Q is said to be orthogonal and satisfies Q T Q = I) for span(a). Obviously, this gives span(q) = span(a). First of all, we see that both choices of P easily satisfy P T = P and P 2 = P. Also, after substituting P into Eq. 3.33, one can respectively obtain the following square systems: A T Ax = A T b (note here we used A T P = A T P T = (PA) T = A T ) Q T Ax = Q T b (note here we used Q T P = Q T QQ T = Q T ) Notice that the first transformation the easier construction of the two results in the system of normal equations which we already showed there are some associated numerical issues. The second orthogonal transformation, however, will provide us with a very useful idea on accomplishing so called the QR factorization as will be shown in Section

94 94 3. Invariant Transformations We will now focus on examining several methods for transforming an overdetermined m n linear least squares problem Ax = b into an n n square linear system A x = b which leaves x unchanged and which we already know how to solve using the methods of Chapter 2. In seeking for invariant transformations we keep in mind that the sequence of problem transformation we would like to establish is: rectangular square triangular Note: The second transformation (square to triangular) is what we already have learned in Chapter 2; while we now try to learn the first transformation (rectangular to square) in this chapter Normal Equations With rank(a) = n we already have seen several times the n n symmetric positive definite system of normal equations A T Ax = A T b (3.34) has the invariant property of preserving the same solution x as the m n least squares problem Ax = b. As discussed, we could theoretically pursuit to use the Cholesky factorization, A T A = LL T, (3.35) followed by solving the forward-substitution first Ly = A T b, and then the backward-substitution later L T x = y. But we ve seen there are numerical accuracy and stability issues that are related to floating-point arithmetics as well as condition number squaring effects. In this reason, we do not use the normal equations in practice Orthogonal Transformations In view of the potential numerical difficulties with the normal equations approach, we need an alternative that does not require A T A and A T b. In this alternative, we expect a more numerically robust type of transformation. Recall that in Chapter 2 we did use a similar trick to introduce transformations to a simpler system which was a triangular system. Can we use the same triangular form for the current purpose? The answer is no simply because such a triangular transformation does not preserve what we want to preserve in the least squares problems now, the l 2 -norm. What kind other transformation then preserve the norm, at the same time, without changing the solution x? Hinted by the previous section, we see that an orthogonal transformation given by Q, where Q is a orthogonal real square matrix, i.e., Q T Q = I (in other words, each column of Q is orthonormal basis)

95 95 would be a good candidate. The norm-preserving property of Q can be easily shown: Qx 2 2 = (Qx) T Qx = x T Q T Qx = x T x = x 2 2. (3.36) Similarly, we also have Q T x 2 2 = (Q T x) T Qx = x T QQ T x = x T x = x 2 2. (3.37) Remark: Orthogonal matrices are of great importance in many areas of numerical computation because of their norm-preserving property. With this property, the magnitude of errors will remain the same without any amplification. Thus, for example, one can use orthogonal transformations to solve square linear systems which will not require the need for pivoting for numerical stability. Although it looks very attractive the orthogonalization process is significantly more expensive computationally then the standard Gaussian elimination, so their superior numerical properties come at a price Triangular Least Squares As we now prepared ourselves with a couple of transformations that preserves the least squares solution, we are further motivated to seek for a more simplified system where a least squares problem can be solved in an easier way. As seen in the square linear systems in Chapter 2, we consider least squares problems with an upper triangular matrix of the form: [ ] [ ] R x 0 c = = c, (3.38) c 2 where R is an n n upper triangular matrix, x an n-vector. The right hand side vector c is partitioned accordingly into an n-vector c and an m n-vector c 2. We see that the least squares residual is given by r 2 2 = c Rx c 2 2 2, (3.39) which tells us that the the least squares solution x satisfies Rx = c, (3.40) which is solvable by back-substitution. The minimum residual then becomes r 2 2 = c (3.4) QR Factorization Let us now combine the two nice techniques, the orthogonal transformation and the triangular least squares, into one, and we call this method the QR factorization. The QR factorization method writes m n (m > n) matrix A as [ ] R A = Q, (3.42) 0

96 96 where Q is an m m orthogonal matrix and R is an n n upper triangular matrix. This QR factorization transforms the linear squares problem Ax = b into a triangular least squares problem. Do they both have the same solution? To see this, we check: where the transformed right hand side r 2 2 = b Ax 2 2 (3.43) [ ] R = b Q x (3.44) [ ] = Q T R (b Q x) (3.45) [ ] = Q T R b x (3.46) = c Rx c 2 2 2, (3.47) Q T b = [ c c 2 ], (3.48) with n-vector c and an m n-vector c 2. As in the previous section, we make the same conclusion on x that satisfies: Rx = c, (3.49) which is solvable by back-substitution, and the minimum residual is r 2 2 = c (3.50) We will study how to compute this QR factorization in the next section. 4. Orthogonalization Methods As in the successive zeroing processes using elimination in the LU factorization, we will adopt the same successive processes again, but now using orthogonal transformation, instead of elimination. In this way, the norm is preserved. There are a number of commonly used orthogonalization methods, and we are going to learn the following two methods: Householder transformation (based on elementary reflectors), Gram-Schmidt orthogonalization. Remark: There is another popular method called Givens transformation which is based on plane rotations. This method can be found in many numerical linear algebra text books and we do not cover this method in this course.

97 Householder Transformations The idea is very similar to what we took in the LU decomposition: annihilating desired components of a given vector, while preserving the norm the same. One very popular approach to accomplish this is a Householder transformation. The method reflects a given vector (like mirroring), with the same norm distance, to a new location across a hyperplane, in which the desired annihilation can be obtained. Before developing further mathematical tools around the method, It would be better if if we first take a look at what it means by in geometrical sense. Figure 4. Geometric interpretation of Householder transformation H as reflection. The transformation operator H is represented in two successive transformations denoted as the green arrow, followed by the purple arrow, across the hyperplane in dashed pale-blue line. The left panel in Fig. 4 shows the principle of the method that reflects the given vector a and projects onto the first coordinate, resulting the transformed vector a 2 e. The reflection is established by bisecting the angle between a and the the first coordinate axis, and applying the reflection across the hyperplane (the pale-blue dashed line). Obviously, the norm is well preserved in this reflection. Clearly, another transformation is also available, as indicated on the right panel in Fig. 4, where in this case the vector a is reflected across another choice of hyperplane orthogonal to the previous choice, yielding the transformed vector a 2 e farther away from the original a, while still placing its location on the first coordinate axis.

98 98 Now let s take a look at what kind of mathematical operators allow us to apply such reflections. Define a matrix H of the form H = I 2 vvt v T v, (3.5) where v is a nonzero vector. We see that from the definition, H is both symmetric, i.e., H T = H, and orthogonal, i.e., H T H = I, hence H T = H. With Eq. 3.5, what we want is to find v such that we can construct H in such a way that: the resulting Ha has annihilations (i.e., zero entries) in its components except for its first component, the norm should be preserved, i.e., Ha 2 = a 2. In other words, we want where α = ± a 2. Ha = α 0. 0 = αe, (3.52) With this in mind, if we proceed the actual calculation, we see that αe = Ha = finally arriving to have v of the form: (I 2 vvt ) v T a = a 2v vt a v v T v, (3.53) v = (a αe ) vt v 2v T a. (3.54) At first, it looks bit complicated but we can further simplify it by taking out the scalar factor vt v, since it cancels out in the fraction in Eq. 3.5 anyway. 2v T a Therefore, we can conveniently omit it and define the Householder vector v by v = a αe. (3.55) Let us now finalize our construction of H. First recall in Sec that the orthogonal projector onto span(v) is given by P = v(v T v) v T = vvt v T v. (3.56) Also the orthogonal complement projector onto span(v) is given by P = I P = I vvt v T v. (3.57)

99 This projector P gives the projection of a onto the hyperplane, P a = (I P)a = a v vt a v T v, (3.58) which is only the half way through to the desired location, the first coordinate axis in the current example. In order to reach the first coordinate axis we therefore need to go twice as far, which gives us our final form of H: H = I 2 vvt v T v. (3.59) The full design construction is illustrated in Fig. 6, where the hyperplane is given by span(v) = {x : v T x = 0}, for some v Geometric interpretation of Householder transformation as re- Figure 5. flection. Note: We should make a quick comment on the choice of signs of α = ± a 2 now. Depending the choice of the sign, we get the closer transformation a 2 e from a (shown in the left panel in Fig. 6), or the farther one a 2 e from a (shown in the right panel in Fig. 6). Both choices should work just fine in principle, however, we know from experience that dealing with subtraction that results in small in magnitude, i.e., v = a αe, is prone to numerical errors due to finite-precision arithmetic. In this reason, we prefer to choose the sign for α that yields the point on the first coordinate axis as farther away as possible from a. Example: The above construction illustrates a single step transformation only, possibly among many successive transformations. Let s take a single step example to determine a Householder transformation that annihilates all but the first component of the vector a = 2 2. (3.60)

100 00 Following Eq we have v = a αe = 2 2 α 0 0, (3.6) where α = ± a 2 = ±3. In order to place the transformation far away from a, we choose the negative sign for α, giving: v = = 5 2. (3.62) To confirm that the Householder transformation performs as expected, we compute Ha = a 2v vt a v T v = = 3 0, (3.63) which shows that the zero pattern of the result is correct and that the l 2 -norm is preserved. Note: In actual coding, there is no need to form the matrix H explicitly. What we all need is the vector v to apply H to any given vector a. Now we proceed to generalize this single step transformation into a series of successive transformations [ ] R H n H n H 2 H A =, (3.64) 0 which yields the final form of the wanted Q and R factorizations: Q T = H n H n H 2 H, or equivalently, Q = H H 2 H n H n. (3.65) This gives us the final decomposition: [ R A = Q 0 ]. (3.66) Quick summary: The overall process to seek for solutions to the least squares problem Ax = b is the following: decompose A into Eq. 3.66, obtaining Q T = H n H n H 2 H transform the right hand side simultaneously at each Householder transformation, i.e., [ ] Q T c b = H n H n H 2 H b = (3.67) c 2

101 0 compute the solution x by solving the n n triangular linear system Rx = c if needed, the minimum residual can also be evaluated by r 2 = c 2 2 Example: Let us now solve the land surveyor s least squares problem given by the system in Eq. 3.6 using Householder QR factorization: Ax = 0 h 94 h = 0 h 7 = b. (3.68) Recall that the solution we assumed to know was given by x T = [h, h 2, h 3 ] = [236, 943, 246], (3.69) and let s see if we get this solution indeed. The first Householder step is to construct the Householder vector v that annihilates the subdiagonal entries of the first column a of A, which is given by Eq. 3.55, with in this case, α = a 2 = 3 =.732, v = a αe = = Applying the resulting H to the first column gives H a = a 2v v T a v T v = (3.70). (3.7) Applying H to the second and third columns and also the right hand side vector b in a similar way gives, respectively: v T H a 2 = a 2 2v a 2 0 v T v = , (3.72) 0.23

102 02 and H a 3 = a 3 2v v T a 3 v T v H b = b 2v v T b v T v = = , (3.73). (3.74) Putting all things together, we get H A = , H b = (3.75) Next, we compute the second Householder vector v 2 from first considering a 2 by after annihilating the subdiagonal entries of the second column of H A (i.e., except for the first entry but keeping the rest): a 2 = Then, with α = a 2 2 =.6330, we get v 2 = a 2 αe 2 = (3.76) = (3.77) v2 Now we apply H 2 = I 2v T 2 v2 T v onto the second and the third columns of H A 2 as well as H b, where the actual operation begins from the second entry in both columns, keeping the first entries unchanged. Writing these out explicitly, we get:

103 03 H 2 H = = ( I 2v 2 v T 2 v T 2 v 2 ( I 2v 2 v T 2 v T 2 v 2 ) ) = = so that we have: H 2 H A = , H 2 H b = , (3.78) , (3.79). (3.80) The last step now uses the third Householder vector v 3 for annihilating the subdiagonal entries of the third column of H 2 H A by considering 0 0 a 3 = (3.8) This gives α = a 3 2 =.442 and hence we get v 3 = a 3 αe 3 = = (3.82) v3 Applying the final Householder transformation with H 3 = I 2v T 3 v3 T v to the 3 third column of H 2 H A gives H 3 H 2 H A = [ R = 0 ], (3.83)

104 04 and H 3 H 2 H b = = Q T b = [ c c 2 ]. (3.84) We now can solve the upper triangular system Rx = c by back-substitution to obtain x T = [h, h 2, h 3 ] = [236, 943, 246]. (3.85) Finally, the minimum residual is given by r 2 2 = c = 35. The algorithm of Householder transformation can be written as follow: Algorithm: Householder QR factorization of n m matrix A: for k = to n #[loop over column] α k = sign(a kk ) m i=k a2 ik # [compute α that avoids numerical cancellation] v k = [0, 0, a kk,, a mk ] T α k e k # [compute Householder vector for current column] β k = v T k v k if β k = 0 then continue with next k # [skip if it s already zero] endif for j = k to n γ j = v T k a j a j = a j 2(γ j /β j )v k # [apply transformation to remaining submatrix] endfor endfor endfor Example: The above example can be also solved by using the system of normal equations. That is, we now solve the symmetric positive definite system A T Ax = h h 2 h 3 = = A T b, (3.86) which can be solved by using the Cholesky factorization as shown in an example (see Eq. 2.87) in Chapter 2. We can easily check that the same solution x T = [236, 943, 246]. (3.87)

105 05 is obtained using the Cholesky factorization. From this, we can directly compute the minimum residual vector 2 r = b Ax = 4, (3.88) 3 2 which confirms the previous result of r 2 2 = Gram-Schmidt Orthogonalization Another method for computing the QR factorization is Gram-Schmidt orthogonalization. The method can be considered as the m least squares problem as will be shown in this section. The GS orthogonalization is to determine two orthonormal m-vectors q and q 2 from any two given linearly independent m-vectors a and a 2, as described in the following steps: Step : The first step is trivial which is to simply normalize a to obtain q q = a a 2. (3.89) Step 2: As seen in the Householder construction (see Eq. 3.56), we note that the m m projector P that projects a vector onto span(q ) is P = q q T q T q = q q T, (3.90) because q T q = as q is a normal vector. If we project a 2 onto span(q ) using this P, we get Pa 2 = q (q T a 2 ). (3.9) Step 3: In the next step we compute the residual vector by subtracting Pa 2 = q (q T a 2) from a 2 : which is orthogonal to q. r = a 2 q (q T a 2 ) (3.92) Step 4: All we need is to rescale the residual vector r into an orthonormal vector to get q 2 : q 2 = r (3.93) r 2

106 06 Figure 6. Geometric interpretation of Gram-Schmidt orthogonalization Step through Step 4 are illustrated from top to bottom.

107 07 Note: Notice that the projection process in Step 2 is equivalent to the m least squares problem q γ = a 2, (3.94) where γ is the solution to the least squares problem. With the projector P this least squares problem is transformed into another well-defined system in the subspace span(q ): q γ = Pa 2. (3.95) We can write the algorithm of Gram-Schmidt orthogonalization as follow: Algorithm: Gram-Schmidt orthogonalization: for k = to n #[loop over column] r kk = a k 2 if r kk = 0 then stop # [stop if linearly independent] endif q k = a k /r kk # [normalize current column] for j = k + to n r kj = q T k a j a j = a j r kj q k # [subtract from succeeding columns # their components in current column] endfor endfor Note: In the above Algorithm: we treated a k and q k separately for clear exposition purpose, but they can be shared in in the same storage. Example: Let us consider to solve the same least squares problem of the land surveyors example given in Eq. 3.68, using Gram-Schmidt orthogonalization. The first step is to normalize the first column of A: r = a 2 =.732, q = a r = (3.96) Calculating orthogonalization processes and subtractions in Step 2 and Step 3, we first obtain r 2 = q T a 2 = , r 3 = q T a 3 = , (3.97)

108 08 where a 2 and a 3 are the second and the third column of A. Continuing the remaining procedures in Step 2 and Step 3 we get: r 2 = a 2 r 2 q = , r 3 = a 3 r 3 q = (3.98) The resulting transformed matrix now has q for its first column, together with r 2 and r 3 for the unnormalized second and the third columns: (3.99) Let us abuse our naming strategy and let a 2 and a 3 be the second and the third columns of the new matrix now. Normalizing the second column gives r 22 = a 2 2 =.6330, q 2 = a 2 0 = r (3.00) If we evaluate the orthogonalization of the second column against the third column, we get: r 23 = q T 2 a 3 = 0.865, (3.0) and hence we further obtain yet another residual vector r 3 = a 3 r 23 q 2 = 0. (3.02) We now form another transformed matrix which has q 2 for its second orthonormal column and r 3 for its unnormalized third column: (3.03)

109 09 Finally we normalize the last column, again abusing our naming strategy and let the third column be a 3 : r 33 = a 3 2 =.442, q 3 = a = r 33 0, (3.04) and by replacing the last column of the last transformed matrix with q 3 results in the final transformed matrix (3.05) Now collecting entries r ij of R, we form R = , (3.06).442 which altogether provides the QR factorization: A = QR = (3.07) Computing the right hand side transformation, Q T b, we obtain Q T b = = c. (3.08) This allows us to solve the upper triangular system Rx = c by back-substitution, resulting in the same solution as before: x T = [236, 943, 246]. (3.09)

110 Chapter 4 Eigenvalue Problems The principle aim in this chapter is to study numerical methods for finding eigenvalues and eigenvectors for a given n n matrix A. Let us first review some basic mathematical concepts before we study relavant numerical methods in Section 3.. Reviews, Definitions One key question we may want to ask is why do we want to study eigenvalue problems?, and/or, what kind of insights can be obtained mathematically by understanding eigenvalue problems? As briefly mentioned at the beginning Chapter 2, we try to describe various relationships in nature using linear theories, easily written as a system of linear equations, Ax = b. (4.) In studying such linear systems, what we see is various types of linear transformations on a vector space of the forms including: expanding or shrinking of a given vector by a scalar multiple, rotating or reflecting a vector, permuting the components of a vector, etc. Often times, an effect of linear transformations occurs as a combination of aforementioned basic transformations. We therefore prefer to break down such combined linear transformation effects into simplest constituent actions whereby mathematical insights can be readily achieved by analyzing basic linear transformations. The key issue is what happens when a given linear transformation is applied repeatedly. If this happens, do the results converge to some steady state solution, or oscillate, or diverge to any unstable solutions? This question can be answered by resolving the transformation into a set of simple actions, specifically, expansion or contraction along certain directions. A given direction in a vector space is determined by any nonzero vector x pointing in that direction. Thus, we are interested in finding such a vector which is defined now. 0

111 Definition: Given an n n matrix A representing a linear transformation on an n-dimensional vector space, we wish to find a nonzero vector x and a scalar λ such that Ax = λx. (4.2) We call such a scalar λ an eigenvalue, and a corresponding vector x an (right) eigenvector. Note: We also can define nonzero left eigenvector y such that y T A = λy T. (4.3) For expositional purposes, however, we only develop our numerical studies using right eigenvectors in this course. We notice that eigenvectors are not unique but they can be scaled infinitely different ways: if Ax = λx, then for any nonzero scalar γ, γx is also an eigenvector corresponding to λ because A(γx) = λ(γx). For this reason, we usually consider eigenvectors to be normalized. This tells us that the fundamental object of interest is not really any particular choice of eigenvector, but rather a set of all such eigenvectors, even including x = 0. This allows us to define the following: Definition: Let A be an n n real or complex matrix. The subspace S λ = {x : Ax = λx} R n ( or C n ) (4.4) is called the eigenspace corresponding to the eigenvalue λ. Similarly we define the set of all eigenvalues as well: Definition: Let λ(a) be a set consisting of all the eigenvalues of A: the set λ(a) is called the spectrum of A. λ(a) = {λ : Ax = λx}, (4.5) Definition: The maximum modulus of the eigenvalues, denoted by ρ(a), is called the spectral radius of A. ρ(a) = max{ λ : λ λ(a)}, (4.6) Example: An eigenvector of a matrix determines a direction in which the effect of the matrix is particularly simple. For instance, consider a transformation matrix [ ] 2 A = (4.7) 2 The action by the matrix gives [ ] [ ] 2 x Ax = = 2 x 2 [ 2x + x 2 x + 2x 2 ]. (4.8)

112 2 If we consider its transformation at few discrete points, we can easily draw the corresponding mappings on the x-y plane as shown in Fig.. It turns out that the two vectors, v = (, ) T (in purple) and v 2 = (, ) T (in blue), that remain unchanged in their directions under the transformation of A, are eigenvectors of A, corresponding to λ = and λ 2 = 3, respectively. They simply get rescaled by the corresponding eigenvalues, i.e., multiples of and 3. Figure. Geometric interpretation of eigenvectors. The matrix A preserves the direction of vectors parallel to v = (, ) T (in purple) and v 2 = (, ) T (in blue). The vectors in red are not parallel to either eigenvector, so, their directions are changed by the transformation. Image source: Wikipedia Note: We see in the previous example that along the direction of the eigenvectors, the matrix transformation simply expands or shrinks by a scalar multiple, whose multiplication factor is determined by eigenvalue λ. This illustrates that eigenvalues and eigenvectors provide a means of understanding the complicated behavior of a general linear transformation by decomposing it into simpler actions. Note: Eigenvalue problems occur in many areas of science and engineering. Examples include: the natural modes (eigenvector) and frequencies (eigenvalue) of vibration of a structure, the stability of a given structure (eigenvalue), convergence analysis of iterative methods (eigenvalues), numerical stability analysis (or von Neumann stability analysis) of discretized difference equations (eigenvalues).

113 .. Characteristic Polynomials 3 The equation Ax = λx is equivalent to considering a homogeneous system of linear equations ( ) A λi = 0. (4.9) Recall that this system has a nonzero solution x if and only if A λi is singular, which is further equivalent to, ( ) det A λi = 0. (4.0) Definition: The relation in Eq. 4.0 is a polynomial of degree n in λ, and is called the characteristic polynomial p A (λ) of A, which can be written as, p A (λ) = c n λ n + c λ + c 0 (4.) = c n (λ λ ) (λ λ n ) (4.2) for c n 0 with c k R n (or c k C n ). The roots are the eigenvalues of A. Example: We now check that the characteristic polynomial of the matrix in Eq. 4.7 becomes (λ 2) 2 = λ 2 4λ + 3 = (λ 3)(λ ) = 0, (4.3) therefore has two distinct roots λ = and λ 2 = 3. Note: When seeking for eigenvalues and eigenvectors analytically, one usually first solves for the characteristic polynomial to find the eigenvalues λ k for each k. One then solves Ax k = λ k x k, (4.4) again for each k to obtain the corresponding eigenvector x k associated with λ k. That is, we take the following usual procedures: Matrix characteristic polynomial eigenvalues eigenvectors Note: The characteristic polynomial provides extremely useful theoretical aspects, however, it turns out not to be too useful as a means of actually computing eigenvalues for matrices of nontrivial size. Several issues that can arise in numerical computing include: computing the coefficients of p A (λ) for a given matrix A is, in general, a substantial task, the coefficients of p A (λ) can be highly sensitive to perturbations in the matrix A, and hence their computation is unstable,

114 4 rounding error incurred in forming p A (λ) can destroy the accuracy of the roots subsequently computed, computing the roots of p A (λ) of high degree is another substantial task. Example: To illustrate one of the potential difficulties in evaluating p A (λ), consider the matrix [ ] ɛ A =, (4.5) ɛ where we assume ɛ < ɛ, with ɛ the machine accuracy, as before. The exact eigenvalues of A are ± ɛ, however, computing p A (λ) in floating-point arithmetic gives, det(a λi) = λ 2 2λ + ( ɛ 2 ) = λ 2 2λ +, (4.6) which has λ = as a double root. Therefore, we cannot resolve the two eigenvalues computationally in the working precision..2. Multiplicity, Defectiveness, and Diagonalizability Definition: We define the algebraic multiplicity of a particular eigenvalue λ k the multiplicity h as a root of p A (λ), i.e., if (λ λ k ) h as a factor in p A (λ). Definition: The geometric multiplicity of λ k is the dimension of the eigenspace S λk associated with λ k. In other words, it is the maximal number of linearly independent eigenvectors corresponding to λ k. Example: The matrix [ A = 3 ] (4.7) has p A (λ) = (λ 2) 2 = 0, hence has an eigenvalue λ = 2 as a double root, Therefore the algebraic multiplicity of λ = 2 is 2. If evaluating eigenvector(s), we see that there is only one eigenvector associated with λ = 2, which gives v = [ ] T (4.8) dim S λ=2 =. (4.9) Therefore we see that the geometric multiplicity of λ = 2 is. One also notes that rank(a) =. In general, we have rank(a) + null(a) = n, (4.20) where the nullity null(a) of A is the dimension of the eigenspace of the null eigenvalue.

115 5 Example: λ = is the only eigenvalue with algebraic multiplicity two for both matrices: [ ] [ ] 0, and. (4.2) 0 0 Its geometric multiplicity, however, is one for the first and two for the latter. Definition: In general, the algebraic multiplicity is always greater or equal to the geometric multiplicity. If the algebraic multiplicity is strictly greater than the geometric multiplicity then the eigenvalue λ k is said to be defective. Similarly, an n n matrix that has fewer than n linearly independent eigenvectors is said to be defective. Definition: If n n matrix A is nondefective, then it has a full set of linearly independent eigenvectors x,, x n corresponding to the eigenvalues λ,, λ n. If we let D=diag(λ,, λ n ) and X = [x,, x n ], then X is nonsingular and we have AX = XD, (4.22) so that X AX = D. (4.23) If this is true, A is said to be diagonalizable. This is an example of a similarity transformation, which will be considered later in this chapter..3. Properties of Matrices and Eigenvalue Problems Here we summarize various properties of an n n real or complex matrix that are relevant for eigenvalue problems. Table. Properties of n n matrices. Property Real matrix Complex matrix Diagonal a ij = 0 for i j Upper Triangular a ij = 0 for i > j Lower Triangular a ij = 0 for i < j Tridiagonal a ij = 0 for i j > Upper Hessenberg a ij = 0 for i > j + Lower Hessenberg a ij = 0 for i < j Symmetric/Hermitian A T = A A H = A Orthogonal/Unitary A T A = AA T = I A H A = AA H = I Normal A T A = A T A A H A = A H A There are a couple of useful definitions and theorems in preparation for further discussion of eigenvalue problems, as will be shown below without proofs. Definition: Let A and B be n n square matrices. Then A is similar to B if there is a nonsingular matrix P for which B = P AP. (4.24)

116 6 Note that this is a symmetric relation (i.e., B is similar to A), since A = Q BQ, with Q = P. (4.25) Remark: If A and B are similar, then the followings are true. First let us assume B = P AP.. p A (λ) = p B (λ). Proof: Then p B (λ) = det(b λi) = det(p (A λi)p) = det(p ) det(a λi) det(p) = det(p ) det(p) det(a λi) = det(p P) det(a λi) = det(i) det(a λi) = det(a λi) = p A (λ). (4.26) 2. The eigenvalues of A and B are exactly the same, λ(a) = λ(b), and there is a one-to-one correspondence of the eigenvectors. Proof: Let Ax = λx. Then or equivalently, P AP(P x) = λp x, (4.27) By = λy, with y = P x. (4.28) Also the one-to-one correspondence between x and y is trivial with the relationship y = P x, or x = Py. 3. The trace and determinant are unchanged. trace(a) = trace(b), (4.29) det(a) = det(b). (4.30) Theorem: (Schur Normal Form) Let A be an n n real or complex matrix. Then there always exists a unitary matrix U such that T U H AU (4.3) is upper triangular. Also, since T is triangular and since U H = U, A and T are similar, and hence p A (λ) = p T (λ) = (λ t ) (λ t nn ), (4.32)

117 and thus the eigenvalues of A are the diagonal elements of T. Also, we see that 7 trace(a) = det(a) = n n λ i = t ii, (4.33) i= n λ i = i= i= i= n t ii. (4.34) Theorem: Let A be an n n Hermitian (or symmetric) matrix. Then A has n real eigenvalues λ,, λ n, not necessarily distinct, and n corresponding eigenvectors x,, x n that form an orthonormal basis for C n (R n ). Finally, there is a unitary matrix U for which U H AU = D = diag(λ,, λ n ). (4.35) If A is real, then U can be taken as orthogonal. Quick summary: We summarize the above as the following: Eigenvalues of diagonal and triangular matrices are their diagonal entries, Tridiagonal and Hessenberg matrices are useful intermediate forms in computing eigenvalues, Symmetric and Hermitian matrices have only real eigenvalues, Orthogonal and unitary matrices are useful in transforming general matrices into simpler forms, Normal matrices always have a full set of orthonormal eigenvectors..4. Localizing Eigenvalues: Gershgorin Theorem For some purposes it suffices to know crude information on eigenvalues, instead of determining their values exactly. For example, we might merely wish to know rough estimations of their locations, such as bounding circles or disks. The simplest such bound can be obtained as ρ(a) A. (4.36) This can be easily shown if we take λ to be λ = ρ(a), and if we let x be an associated eigenvector x = (recall we can always normalize eigenvectors!). Then ρ(a) = λ = λx = Ax A x = A. (4.37) A more accurate way of locating eigenvalues is given by Gershgorin s Theorem which is stated as the following:

118 8 Theorem: (Gershgorin s Theorem) Let A = {a ij } be an n n matrix and let λ be an eigenvalue of A. Then λ belongs to one of the circles Z i given by where Z k = {z R or C : z a kk r k }, (4.38) r k = n j=,j k a kj, k =,, n. (4.39) Moreover, if m of the circles form a connected set S, disjoint from the remaining n m circles, then S contains exactly m of the eigenvalues of A, counted according to their algebraic multiplicity. Proof: Let Ax = λx. Let k be the subscript of a component of x such that x k = max i x i = x, then we see that the k-th component satisfies λx k = n a kj x j, (4.40) j= so that Therefore n (λ a kk )x k = a kj x j. (4.4) j=,j k λ a kk x k a kj x j a kj x. (4.42) j=,j k j=,j k Example: Consider the matrix A = Then the eigenvalues must be contained in the circles. (4.43) Z : λ =, (4.44) Z 2 : λ + = 2, (4.45) Z 3 : λ = 2. (4.46) Note that Z is disjoint from Z 2 Z 3, therefore there exists a single eigenvalue in Z. Indeed, if we compute the true eigenvalues, we get λ(a) = { , , }. (4.47)

119 9 2. Invariant Transformations As before we seek for a simpler form whose eigenvalues and eigenvectors are determined in easier ways. To do this we need to identify what types of transformations leave eigenvalues (or eigenvectors) unchanged or easily recoverable, and for what types of matrices the eigenvalues (or eigenvectors) are easily determined. Shift: A shift subtracts a constant scalar σ from each diagonal entry of a matrix, effectively shifting the origin. Ax = λx = (A σi)x = (λ σ)x. (4.48) Thus the eigenvalues of the matrix A σi are translated, or shifted, from those of A by σ, but the eigenvectors are unchanged. Inversion: If A is nonsingular and Ax = λx with x 0, then A x = x. (4.49) λ Thus the eigenvalues of A are reciprocals of the eigenvalues of A, and the eigenvectors are unchanged. Powers: Raising power of a matrix also raises the same power of the eigenvalues, but keeps the eigenvectors unchanged. Ax = λx = A 2 x = λ 2 x = = A k x = λ k x. (4.50) Polynomials: More generally, if p(t) = c 0 + c t + c 2 t 2 + c k t k (4.5) is a polynomial of degree k, then we define p(a) = c 0 I + c A + c 2 A 2 + c k A k. (4.52) Now if Ax = λx then p(a)x = p(λ)x. Similarity: We already have seen this in Eq Eq Numerical Methods for Computing Eigenvalues and Eigenvectors 3.. Power Iteration A simple method for computing a single eigenvalue and corresponding eigenvector of an n n matrix A is known as power iteration, which multiplies an arbitrary nonzero vector repeatedly by the matrix, in effect multiplying the initial starting vector by successively higher powers of the matrix.

120 20 Algorithm: Power iteration to compute an approximate eigenpair (λ, x) of A with x = and λ the dominant eigenvalue: x 0 =arbitrary nonzero vector with x 0 = for k =, 2,... y = Ax k x k = y/ y λ (k) = x T k Ax k # [λ (k) is the eigenvalue at k-th iteration, # which is different from λ k ] # [keep generating next vector until λ (k) converges] endfor To see how this works, let s assume that A has eigenvalues λ, λ 2,, λ n, and corresponding eigenvectors v, v 2,, v n whose lengths are normalized. Let λ be the maximum eigenvalue in modulus and let v be the corresponding eigenvector. Let us also assume that the initial nonzero vector x 0 can be written as a linear combination of the eigenvectors v j of A, x (0) = Then we have, using Eq. 4.50, n α j v j, x 0 = j= x (k) = Ax (k ) = A 2 x (k 2) = = A k x (0) = n n A k α j v j = α j A k v j = j= n α j λ k j v j j= = λ k [ α v + n j=2 j= x(0) x (0). (4.53) α j ( λj λ ) kvj ] λ k α v, (4.54) as k, since λ j λ <. Assuming α 0, which is likely if x (k) is chosen randomly, we see that Therefore we get x k = x(k) x (k) λk α v λ k α = ±v. (4.55) λ (k) = x T k Ax k (±v ) T A(±v ) = (±v ) T λ (±v ) = λ v 2 = λ. (4.56)

121 2 Example: Consider the matrix A = [ 3 3 ] (4.57) and let s compute the dominant (i.e., largest in magnitude) eigenvalue λ by power iteration. We start with an initial vector, say, x 0 = [ 0 ] T. (4.58) We can analytically evaluate the two eigenvalues of A which are λ = 2 and λ 2 = 4. Table 2. Power iteration for the dominant eigenvalue. The normalization is obtained with l 2 norm. k x k = (x, x 2 ) k λ (k) 0 (0, ) N/A ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) We see from Table 2 that the k-th lambda value λ (k) converges to the dominant eigenvalue λ 2 = 4. Remark: Power iteration works well in practice, but it can fail for a number of reasons: If the initial guess x 0 has no nonzero component, i.e., α = 0 (so that x 0 = 0), then the iteration doesn t work. If there are multiple eigenvalues whose modulus are the same and maximum, the iteration may converge to a vector that is a linear combination of the corresponding multiple eigenvectors. For a real matrix and real initial vector, the iteration cannot converge to a complex vector.

122 Inverse Iteration For some cases the smallest eigenvalue of a matrix in modulus could be required instead of the largest. This can be easily achieved using the inverse property given by Eq. 4.49, so that we apply power iteration to A, which is to be computed numerically using the methods we learned in Chapter 2. This method using inverse property is known as inverse iteration. One can also employ and incorporate the shift property in Eq with inverse iteration in order to achieve faster computational convergence by choosing σ close to (but not equal to) the smallest eigenvalue λ. In this case, the eigenvalue of A σi of smallest magnitude is simply λ σ, where λ is the eigenvalue of A closest to σ. Thus, in general, with an appropriate choice of shift, inverse iteration can be used to compute any eigenvalue of A, not just the extreme ones. For this reason, inverse iteration is particularly useful for computing eigenpair (λ, x) of A. Algorithm: Inverse iteration to compute an approximate eigenpair (λ, x) of A with x = and λ the smallest eigenvalue: x 0 =arbitrary nonzero vector with x 0 = Choose a close approximation σ to λ for k =, 2,... solve (A σi)y = x k for y # [solve inverse problem using the methods in Chapter 2] x k = y/ y λ (k) = x T k Ax k # [λ (k) is the eigenvalue at k-th iteration, # which is different from λ k ] # [keep generating next vector until λ (k) converges] endfor Remark: Notice that when solving (A σi)y = x k for y, one would need to use some numerical methods in Chapter 2 (e.g., LU decomposition, Cholesky, or Gauss-Jordan method) in order to solve the system. If using, for instance, LU factorization, such factorization should be computed only once before iteration processes begin, since the matrix doesn t change during iterations, and neither do the factorizations. Example: Consider again the matrix A = [ 3 3 ] (4.59)

123 23 and let s find λ = 2 this time using inverse iteration. We take the same initial vector x 0 = [ 0 ] T. (4.60) Table 3. Inverse iteration for the smallest eigenvalue with different values of shift factor σ = and.9. The normalization is obtained with l 2 norm. k x k = (x, x 2 ) k λ (k) with σ =.0 0 (0, ) N/A ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) k x k = (x, x 2 ) k λ (k) with σ =.9 0 (0, ) N/A ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) ( , ) As shown in Table 3 the k-th iterated eigenvalue λ (k) converges to λ = 2 of the matrix A. It is also shown that the iteration with σ =.9 which is closer to λ = 2 is faster than than the case with with σ =, converging in 6 steps. What does this tell us? This suggests that we could now use Gershgorin theorem to estimate the locations of target eigenvalues, and choose σ accordingly in order to accelerate the search for all eigenvalues Rayleigh Quotient Iteration If x is an approximate eigenvector for a real matrix A then determining the best estimate for the corresponding eigenvalue λ can be thought as an n linear squares problem to find λ: xλ = Ax. (4.6)

124 24 As already studied several times, we can rely on using the normal equation hence the least squares solution is given by x T xλ = x T Ax, (4.62) λ = xt Ax x T x. (4.63) The quantity in Eq is called the Rayleigh Quotient. We now realize that without explicitly mentioning, we already have used the algorithm in both power iteration and inverse iteration methods, since for instance, Eq is nothing but the Rayleigh Quotient, λ (k) = x T k Ax k = (x(k) ) T Ax (k) x (k) 2 (4.64) for any nonzero vector x (k). Therefore, the Rayleigh Quotient iteration takes the exact same form as we introduced already in power iteration and inverse iteration methods. Remark: Our observation here suggests that the approach we introduced in the power and inverse iteration methods could have been designed without the normalization process in each iteration. In other words, it is still possible to simply implement the iteration schemes using x (k), a non-normalized vector, rather than x k = x(0). This simpler approach will make convergence slower x (0) than with the Rayleigh Quotient because the Rayleigh Quotient gives a better approximation to an eigenvalue by solving the least squares problem Deflation Suppose that an eigenvalue λ and corresponding eigenvector x for an n n matrix A have been found. We now consider how to find the next eigenvalue λ 2 of A by a process called deflation. One way of deflation is to use the Householder transformation H so that Hx = αe. (4.65) Now since (λ, x ) is an eigenpair we already found, we have Ax = λ x. Then HAx = λ Hx = HA(H H)x = λ Hx = (HAH )(αe ) = λ αe = (HAH )e = λ e. (4.66) This implies that HAH has the form of [ HAH λ b = T 0 B ], (4.67)

125 25 where b T is an (n ) vector, and B is an (n ) (n ) matrix whose eigenvalues are λ 2,, λ n. This can be easily shown if considering 0 = det(hah λi n ) = (λ λ) det(b λi n ). (4.68) We now repeat the process to compute the next eigenpair (λ 2, x 2 ) of B, and so on QR Iteration All the methods we studied so far evaluate a single eigenpair (λ, x) for a given matrix A once at a time. Therefore, in these methods, the only way is to subsequently compute each of them until all of them are found if we want to compute all available eigenpairs. Is there a better way to obtain multiple eigenpairs once at a time then? We will now consider such methods that are based on the QR factorization algorithm that provide a group of eigenpairs at once. The basic QR algorithm for eigenvalue problems is very simple to describe. Given an n n matrix A, we take the initial matrix to be itself, A (0) = A. (4.69) The sequence of matrices A (k), k =, 2,, is determined by the two steps procedures, the first step with QR factorization of A (k), followed by the second step with the reverse product of the QR factors of A (k) : Q (k) R (k) = A (k ), (4.70) A (k) = R (k) Q (k). (4.7) To see this, let us consider first few steps. Computing the QR factorization of the first initial matrix A (0), we get Q () R () = A (0). (4.72) Since the matrix Q () is unitary, (Q () ) H Q () = I = Q () (Q () ) H. Now it turns out that a unitarily similar relation, A () = (Q () ) H A (0) Q () = (Q () ) H Q () R () Q () = R () Q (), (4.73) which is the relation in Eq In general, one sees that the k-th iterate matrix A (k) in Eq. 4.7 is unitarily similar to the previous (k )th matrix A (k ), A (k) = (Q (k) ) H A (k ) Q (k), (4.74) which is eventually similar to A (0) = A by induction. Since the unitarily similar matrices have the same eigenvalue, the QR iteration gives a nice way of computing eigenvalues if the k-th iterate matrix A (k) is easier to compute them than the original matrix A.

126 26 Is this really true so that we can expect the iteration provides an easier way to compute multiple or all eigenvalues of A? We see from Eq that the resulting A (k) converges the Schur normal form of A. Therefore, the eigenvalues of A are given by the diagonal entries (or diagonal blocks) and orthonormal eigenvectors are also given by the product of the unitary matrices Q (k) generated by the algorithm. Note that if A is real symmetric (or complex Hermitian), then the symmetry is preserved by QR iteration. Thus, in this case QR iteration converges to a matrix that is both symmetric and triangular, which is diagonal. Algorithm: QR iteration to compute all eigenvalues: A (0) = A for k =, 2,... Q (k) R (k) = A (k ) # [Compute QR factorization] A (k) = R (k) Q (k) # [Reverse multiplication of the QR factors] endfor Example: To illustrate QR iteration we will apply it to the real symmetric matrix (an easy case!), A = , (4.75) which has eigenvalues λ = 4, λ 2 = 3, λ 3 = 2, λ 4 =. Computing its QR factorization (e.g., using Householder or Gram-Schmidt factorizations) and then forming the reverse product, we obtain A () = (4.76) Most of the off-diagonal entries are now smaller in magnitude and the diagonal entries are somewhat closer to the eigenvalues. Continuing for a couple of more iterations, we obtain and A (2) = A (3) = , (4.77). (4.78)

127 27 The off-diagonal entries are now fairly small, and the diagonal entries are quite close to the eigenvalues. Only a few more iterations would be required to compute the eigenvalues to the full accuracy shown. In general, the QR iteration algorithm yields slow convergence and high cost per iteration. We now consider an efficient algorithm that can boost the performance in iteration. As with any variant of power iteration, the convergence rate of QR iteration depends on the ratio of magnitudes of successive eigenvalues, and we have already seen that the value of this ratio can be made more favorable by using a shift. In each QR iteration, a shift is subtracted off before the QR factorization and then added back to the reverse product so that the resulting matrix will still be similar to the initial matrix. The pseudo algorithm can be given as: Algorithm: QR iteration with Shifts to compute all eigenvalues: A (0) = A for k =, 2,... Choose shift σ k Q (k) R (k) = A (k ) σ k I # [Compute QR factorization] A (k) = R (k) Q (k) + σ k I # [Reverse multiplication of the QR factors] endfor We now need to know how to choose the shift σ k at each iteration in order to approximate an eigenvalue, which will provide an increased computational efficiency. The idea is to observe that the element a nn (k ) of A (k ) is a very good candidate for such a shift because it converges to the smallest eigenvalue in modulus, λ n. See the previous example. Also note that if σ k = a (k ) nn were actually equal to λ n, then A (k ) σ k I would be singular, and the entire last row of the resulting R (k) as well as the last row of the reverse product R (k) Q (k) would be all zeros. Thus, A (k) = R (k) Q (k) + σ k I would be block upper triangular, with its last row all zeros except for the eigenvalue in the last column. This suggests that we can declare convergence of the iterations to an eigenvalue when the magnitudes of the offdiagonal entries of the last row of A (k) are sufficiently small. At this point, due to the block triangular form, we can then restrict attention to the leading submatrix of dimension n. Continuing this manner, eigenvalues of successively smaller matrices are deflated out until all the eigenvalues have been obtained. Example: To illustrate the QR algorithm with shifts, we repeat the previous example with the shift σ k = a nn (k ) at each iteration.

128 28 Thus, with A (0) = , (4.79) we take σ =.732 as shift for the first iteration. Computing the QR factorization of the resulting shifted matrix Q () R () = A (0) σ I, forming the reverse product, R () Q (), and then adding back to the shift, we get A () = , (4.80) which is noticeably closer to diagonal form and to the correct eigenvalues then after one iteration of the unshifted algorithm. Our next shift is then σ 2 =.253, which gives A (2) = The next shift, σ 3 =.0009, is very close to an eigenvalue and gives A (3) = (4.8). (4.82) Notice that the final iterate matrix A (3) is very close to diagonal form. As expected for inverse iteration with a shift close to an eigenvalue, the smallest eigenvalue has been determined to the full accuracy shown. The last row of A (3) is all zeros, so we can reduce the problem to the leading 3 3 submatrix for further iterations. Because the diagonal entries are already very close to the eigenvalues, only one or two additional iterations will be required to obtain full accuracy for the remaining eigenvaules. Remark: The QR method for a general matrix (rather than simpler cases with symmetric or Hessenberg matrices) as discussed here is impractical for two reasons. First, each step of the method requires a QR factorization that costs O(n 3 ) arithmetics operations, which is expensive. Second, we have only a linear convergence of the subdiagonal entries of A (k) to zero. Thus, the method described here requires too many steps and each step is too costly. To overcome such difficulties, one usually apply the following ingredients in making the QR algorithm more feasible:

129 29 the use of preliminary reduction to a tridiagonal form (if the original matrix is symmetric) or a Hessenberg form (if the original matrix is a general matrix) in order to reduce the cost per iteration, together with the following techniques we used: the use of deflation procedure whenever a subdiagonal entry effectively vanishes, again in order to reduce the cost per iteration, and the use of a shift strategy in order to accelerate convergence. 4. Singular Value Decomposition The object of this section is to develop yet another factorization of a matrix that provides valuable information about the matrix. This factorization is called the singular value decomposition, SVD for short. Recall that the QR factorization is to multiply on one side by an orthogonal matrix Q to produce an upper triangular matrix. Motivated from this, our question is then what happens if we multiply both side by a (possibly different) orthogonal matrix? The answer is a matrix that is both upper and lower triangular, which is, diagonal. We now state the theorem: Theorem: (Singular Value Decomposition) Let A be an m n matrix. Then there are m m matrix U and n n matrix V that are both unitary, such that U H AV = Σ, (4.83) where Σ is an m n diagonal matrix whose entries are σ σ 2 0 Σ = σ r 0 0. (4.84) The values σ k, k =,, r are called the singular values of A. They are all positive and they can be arranged so that σ σ 2 σ r 0. (4.85) We observe the following useful properties on SVD without providing rigorous proofs:

130 30 rank(a) = r Let σ(a) be a set of all singular values of A. Then σ(a) = λ(a H A): Proof: This can be easily shown if we consider A H A = VΣU H UΣV H = VΣ 2 V H, (4.86) whereby the eigenvalues of A H A are the same as eigenvalues of Σ which are σk 2, k =,, r. The columns of U are the left singular vectors of A, and the columns of V are the right singular vectors of A. The l 2 -norm is given by the largest singular value of the matrix A 2 = max x 0 The condition number of A is given by the ratio Ax 2 x 2 = σ max = σ (4.87) cond(a) = σ max σ min = σ σ r (4.88)

131 Chapter 5 Initial Value Problems for Ordinary Differential Equations We selectively follow Chapters 5, 6, 7 and 8 of the book Finite Difference Methods for Ordinary and Partial Differential Equations by Prof. Randy LeVeque, University of Washington. 3

132 32. Basic Concepts The conventional temporal discretization t n is given by t n = n t, n = 0,...M. (5.) Definition: Let u n = u(t n ) be the pointwise values of the exact solution of IVP, u (t) = f(t, u(t)), u(0) = η. (5.2) This is the analytical solution of the ODE and satisfies it without any form of numerical errors. Definition: Let U n be the numerical approximations to the exact solution of the ODE at t = t n. For instance, U n represents for the forward Euler s method. U n+ U n t = f(u n ) (5.3) Definition: Let D n be the exact solution of the associated difference equation (DE) of the IVP. e.g., D n satisfies the forward Euler s method D n+ D n t = f(d n ) (5.4) exactly without producing any error in computer arithmetic. Then since D n is the exact solution of the DE, there is no round-off errors involved. When we study numerical solution of ODEs and PDEs, the solutions are affected by numerical errors. They mainly come from two sources of numerical errors, and we are now ready to define them. Definition: The discretization error E n d at tn is defined by E n d = un D n. (5.5) Definition: The round-off error E n r at t n is defined by E n r = D n U n. (5.6) Definition: The global error E n g at t n is defined by Note by definition, E n g = E n d + En r. E n g = u n U n. (5.7) Definition: We say that the numerical method is convergent at t n in a given norm if lim t 0 En g = 0. (5.8)

133 33 Remark: We define the round-off error E n r by the numerical errors introduced after a repetitive number of arithmetic computer operations in which the computer constantly rounds off the numbers to some significant digits. Definition: Let N be the (linear) numerical operator mapping the approximate solution at one time step to the approximate solution at the next time step. Then a general explicit numerical method can be written as We define the one-step error E n step by and the local truncation error E n LT by U n+ i = N (U n ). (5.9) E n step = u n N (u n ), (5.0) E n LT = t En step. (5.) We have already discussed the the order of method previously, and we now can define it again using the local truncation error. Definition: We say that the numerical method is of order p (or pth order accurate) if for all sufficiently smooth data with compact support, the local truncation error is given as ELT n = O( t p ). (5.2) Definition: We say the numerical method is consistent in with a proposed DE if lim t 0 En LT = 0 (5.3) for all smooth functions f(t, u(t)) that satisfies the given ODE. Definition: We say the linear numerical method defined by the linear operator N is stable in if there is a constant C such that for each time T. N n C, n t T, (5.4) Definition: [Big-Oh] If f and g are two functions of t, then we say f is of order of g as t 0, denoted by f(t) = O(g(t)) as t 0, (5.5)

134 34 if there is some constant C such that we can bound f(t) < C g(t) for all t sufficiently small. (5.6) This implies that f(t) decays to zero at least as fast as the function g(t) does. Definition: [little-oh] We also write if f(t) = o(g(t)) as t 0, (5.7) f(t) 0 as t 0. (5.8) g(t) This is slightly stronger than the Big-Oh and means that f(t) decays to zero faster than g(t).

135 rjlfdm 2007/6/ page 3 Chapter 5 The Initial Value Problem for Ordinary Differential Equations In this chapter we begin a study of time-dependent differential equations, beginning with the initial value problem (IVP) for a time-dependent ordinary differential equation (ODE). Standard introductory texts are Ascher and Petzold [5], Lambert [59], [60], and Gear [33]. Henrici [45] gives the details on some theoretical issues, although stiff equations are not discussed. Butcher [2] and Hairer, Nørsett, and Wanner [43, 44] provide more recent surveys of the field. The IVP takes the form u 0.t/ D f.u.t/; t/ for t > t 0 (5.) with some initial data u.t 0 / D : (5.2) We will often assume t 0 D 0 for simplicity. In general, (5.) may represent a system of ODEs, i.e., u may be a vector with s components u ; :::; u s, and then f.u; t/ also represents a vector with components f.u; t/; ;:::; f s.u; t/, each of which can be a nonlinear function of all the components of u. We will consider only the first order equation (5.), but in fact this is more general than it appears since we can reduce higher order equations to a system of first order equations. Example 5.. Consider the IVP for the ODE, v 000.t/ D v 0.t/v.t/ 2t.v 00.t// 2 for t > 0: This third order equation requires three initial conditions, typically specified as v.0/ D ; v 0.0/ D 2 ; v 00.0/ D 3 : (5.3) 3

136 rjlfdm 2007/6/ page 4 4 Chapter 5. The Initial Value Problem for Ordinary Differential Equations We can rewrite this as a system of the form (5.), (5.2) by introducing the variables u.t/ D v.t/; u 2.t/ D v 0.t/; u 3.t/ D v 00.t/: Then the equations take the form u 0.t/ D u 2.t/; u 0 2.t/ D u 3.t/; u 0 3.t/ D u.t/u 2.t/ 2tu 2 3.t/; which defines the vector function f.u; t/. The initial condition is simply (5.2), where the three components of come from (5.3). More generally, any single equation of order m can be reduced to m first order equations by defining u j.t/ D v.j /.t/, and an mth order system of s equations can be reduced to a system of ms first order equations. See Section D.3. for an example of how this procedure can be used to determine the general solution of an r th order linear differential equation. It is also sometimes useful to note that any explicit dependence of f on t can be eliminated by introducing a new variable that is simply equal to t. In the above example we could define u 4.t/ D t so that The system then takes the form with 2 f.u/ D 6 4 u 0 4.t/ D and u 4.t 0 / D t 0 : u 2 u 3 u u 2 2u 4 u 2 3 u 0.t/ D f.u.t// (5.4) and u.t 0/ D The equation (5.4) is said to be autonomous since it does not depend explicitly on time. It is often convenient to assume f is of this form since it simplifies notation. 5. Linear ordinary differential equations The system of ODEs (5.) is linear if t : f.u; t/ D A.t/u C g.t/; (5.5) where A.t/ 2 R ss and g.t/ 2 R s. An important special case is the constant coefficient linear system u 0.t/ D Au.t/ C g.t/; (5.6)

137 rjlfdm 2007/6/ page Linear ordinary differential equations 5 where A 2 R ss is a constant matrix. If g.t/ 0, then the equation is homogeneous. The solution to the homogeneous system u 0 D Au with data (5.2) is u.t/ D e A.t t 0/ ; (5.7) where the matrix exponential is defined as in Appendix D. In the scalar case we often use in place of A. 5.. Duhamel s principle If g.t/ is not identically zero, then the solution to the constant coefficient system (5.6) can be written as Z t u.t/ D e A.t t0/ C e A.t / g./ d: (5.8) t 0 This is known as Duhamel s principle. The matrix e A.t / is the solution operator for the homogeneous problem; it maps data at time to the solution at time t when solving the homogeneous equation. Duhamel s principle states that the inhomogeneous term g./ at any instant has an effect on the solution at time t given by e A.t / g./. Note that this is very similar to the idea of a Green s function for the boundary value problem (BVP). As a special case, if A D 0, then the ODE is simply u 0.t/ D g.t/ (5.9) and of course the solution (5.8) reduces to the integral of g: Z t u.t/ D C g./ d: (5.0) t 0 As another special case, suppose A is constant and so is g.t/ g 2 R s. Then (5.8) reduces to Z t u.t/ D e A.t t0/ C e A.t / d g: (5.) t 0 This integral can be computed, e.g., by expressing e A.t / as a Taylor series as in (D.3) and then integrating term by term. This gives Z t t 0 e A.t / d D A e A.t t 0/ I (5.2) and so u.t/ D e A.t t 0/ C A e A.t t 0/ I g: (5.3) This may be familiar in the scalar case and holds also for constant coefficient systems (provided A is nonsingular). This form of the solution is used explicitly in exponential time differencing methods; see Section.6.

138 rjlfdm 2007/6/ page 6 6 Chapter 5. The Initial Value Problem for Ordinary Differential Equations 5.2 Lipschitz continuity In the last section we considered linear ODEs, for which there is always a unique solution. In most applications, however, we are concerned with nonlinear problems for which there is usually no explicit formula for the solution. The standard theory for the existence of a solution to the initial value problem u 0.t/ D f.u; t/; u.0/ D (5.4) is discussed in many texts, e.g., [5]. To guarantee that there is a unique solution it is necessary to require a certain amount of smoothness in the function f.u; t/ of (5.4). We say that the function f.u; t/ is Lipschitz continuous in u over some domain D Df.u; t/ W ju j at 0 t t g if there exists some constant L 0 so that jf.u; t/ f.u ; t/j Lju u j (5.5) for all.u; t/ and.u ; t/ in D. This is slightly stronger than mere continuity, which only requires that jf.u; t/ f.u ; t/j! 0 as u! u. Lipschitz continuity requires that jf.u; t/ f.u ; t/j DO.ju u j/ as u! u. If f.u; t/ is differentiable with respect to u in D and this derivative f u =@u is bounded then we can take L D max jf u.u; t/j;.u;t/2d since f.u; t/ D f.u ; t/ C f u.v; t/.u u / for some value v between u and u. Example 5.2. For the linear problem u 0.t/ D u.t/ C g.t/, f 0.u/ and we can take L Djj. This problem of course has a unique solution for any initial data given by (5.8) with A D. In particular, if D 0 then L D 0. In this case f.u; t/ D g.t/ is independent of u. The solution is then obtained by simply integrating the function g.t/, as in (5.0) Existence and uniqueness of solutions The basic existence and uniqueness theorem states that if f is Lipschitz continuous over some region D then there is a unique solution to the initial value problem (5.4) at least up to time T D min.t ; t 0 C a=s/, where S D max jf.u; t/j:.u;t/2d Note that this is the maximum modulus of the slope that the solution u.t/ can attain in this time interval, so that up to time t 0 Ca=S we know that u.t/ remains in the domain D where (5.5) holds.

139 rjlfdm 2007/6/ page Lipschitz continuity 7 Example 5.3. Consider the initial value problem u 0.t/ D.u.t// 2 ; u.0/ D >0: The function f.u/ D u 2 is independent of t and is Lipschitz continuous in u over any finite interval ju j a with L D 2. C a/, and the maximum slope over this interval is S D. C a/ 2. The theorem guarantees that a unique solution exists at least up to time a=.ca/ 2. Since a is arbitrary, we can choose a to maximize this expression, which yields a D and so there is a solution at least up to time =4. In fact this problem can be solved analytically and the unique solution is u.t/ D t : Note that u.t/! as t! =. There is no solution beyond time =. If the function f is not Lipschitz continuous in any neighborhood of some point then the initial value problem may fail to have a unique solution over any time interval if this initial value is imposed. Example 5.4. Consider the initial value problem u 0.t/ D p u.t/ with initial condition u.0/ D 0: The function f.u/ D p u is not Lipschitz continuous near u D 0 since f 0.u/ D =.2 p u/!as u! 0. We cannot find a constant L so that the bound (5.5) holds for all u and u near 0. As a result, this initial value problem does not have a unique solution. In fact it has two distinct solutions: u.t/ 0 and u.t/ D 4 t 2 : Systems of equations For systems of s > ordinary differential equations, u.t/ 2 R s and f.u; t/ is a function mapping R s R! R s. We say the function f is Lipschitz continuous in u in some norm kkif there is a constant L such that kf.u; t/ f.u ; t/k Lku u k (5.6) for all.u; t/ and.u ;;t/ in some domain D Df.u; t/ W ku k a; t 0 t t g.by the equivalence of finite-dimensional norms (Appendix A), if f is Lipschitz continuous in one norm then it is Lipschitz continuous in any other norm, although the Lipschitz constant may depend on the norm chosen. The theorems on existence and uniqueness carry over to systems of equations.

140 rjlfdm 2007/6/ page 8 8 Chapter 5. The Initial Value Problem for Ordinary Differential Equations Example 5.5. Consider the pendulum problem from Section 2.6, 00.t/ D sin..t//; which can be rewritten as a first order system of two equations by introducing v.t/ D 0.t/: d v u D ; D : v dt v sin./ Consider the max-norm. We have ku u k D max.j j; jv v j/ and kf.u/ f.u /k D max.jv v j; j sin./ sin. /j/: To bound kf.u/ f.u /k, first note that jv v jku u k. We also have j sin./ sin. /jj jku u k since the derivative of sin./ is bounded by. So we have Lipschitz continuity with L D : kf.u/ f.u /k ku u k : Significance of the Lipschitz constant The Lipschitz constant measures how much f.u; t/ changes if we perturb u (at some fixed time t). Since f.u; t/ D u 0.t/, the slope of the line tangent to the solution curve through the value u, this indicates how the slope of the solution curve will vary if we perturb u. The significance of this is best seen through some examples. Example 5.6. Consider the trivial equation u 0.t/ D g.t/, which has Lipschitz constant L D 0 and solutions given by (5.0). Several solution curves are sketched in Figure 5.. Note that all these curves are parallel ; they are simply shifted depending on the initial data. Tangent lines to the curves at any particular time are all parallel since f.u; t/ D g.t/ is independent of u. Example 5.7. Consider u 0.t/ D u.t/ with constant and L Djj. Then u.t/ D u.0/ exp.t/. Two situations are shown in Figure 5.2 for negative and positive values of. Here the slope of the solution curve does vary depending on u. The variation in the slope with u (at fixed t) gives an indication of how rapidly the solution curves are converging toward one another (in the case <0) or diverging away from one another (in the case > 0). If the magnitude of were increased, the convergence or divergence would clearly be more rapid. The size of the Lipschitz constant is significant if we intend to solve the problem numerically since our numerical approximation will almost certainly produce a value U n at time t n that is not exactly equal to the true value u.t n /. Hence we are on a different solution curve than the true solution. The best we can hope for in the future is that we stay close to the solution curve that we are now on. The size of the Lipschitz constant gives an indication of whether solution curves that start close together can be expected to stay close together or might diverge rapidly.

141 rjlfdm 2007/6/ page Lipschitz continuity Figure 5.. Solution curves for Example 5:6, where L D (a) (b) Figure 5.2. Solution curves for Example 5:7 with (a) D 3 and (b) D Limitations Actually, the Lipschitz constant is not the perfect tool for this purpose, since it does not distinguish between rapid divergence and rapid convergence of solution curves. In both Figure 5.2(a) and Figure 5.2(b) the Lipschitz constant has the same value L Djj D3. But we would expect that rapidly convergent solution curves as in Figure 5.2(a) should be easier to handle numerically than rapidly divergent ones. If we make an error at some stage, then the effect of this error should decay at later times rather than growing. To some extent this is true and as a result error bounds based on the Lipschitz constant may be orders of magnitude too large in this situation. However, rapidly converging solution curves can also give serious numerical difficulties, which one might not expect at first glance. This is discussed in detail in Chapter 8, which covers stiff equations. One should also keep in mind that a small value of the Lipschitz constant does not necessarily mean that two solution curves starting close together will stay close together forever. Example 5.8. Consider two solutions to the pendulum problem from Example 5.5, one with initial data.0/ D ; v.0/ D 0; and the other with

142 rjlfdm 2007/6/ page Chapter 5. The Initial Value Problem for Ordinary Differential Equations 2.0/ D C ; v 2.0/ D 0: The Lipschitz constant is and the data differ by 2, which can be arbitrarily small, and yet the solutions eventually diverge dramatically, as Solution falls toward D 0, while in Solution 2 the pendulum falls the other way, toward D 2. In this case the IVP is very ill conditioned: small changes in the data can lead to order changes in the solution. As always in numerical analysis, the solution of ill-conditioned problems can be very hard to compute accurately. 5.3 Some basic numerical methods We begin by listing a few standard approaches to discretizing (5.). Note that the IVP differs from the BVP considered before in that we are given all the data at the initial time t 0 D 0 and from this we should be able to march forward in time, computing approximations at successive times t ; t 2 ;:::. We will use k to denote the time step, so t n D nk for n 0. It is convenient to use the symbol k, which is different from the spatial grid size h, since we will soon study PDEs which involve both spatial and temporal discretizations. Often the symbols t and x are used. We are given initial data U 0 D (5.7) and want to compute approximations U ; U 2 ; ::: satisfying U n u.t n /: We will use superscripts to denote the time step index, again anticipating the notation of PDEs where we will use subscripts for spatial indices. The simplest method is Euler s method (also called forward Euler), based on replacing u 0.t n / with D C U n D.U nc U n /=k from (.). This gives the method U nc U n D f.u n /; n D 0; ; :::: (5.8) k Rather than viewing this as a system of simultaneous equations as we did for the BVP, it is possible to solve this explicitly for U nc in terms of U n : U nc D U n C kf.u n /: (5.9) From the initial data U 0 we can compute U, then U 2, and so on. This is called a timemarching method. The backward Euler method is similar but is based on replacing u 0.t nc / with D U nc : or U nc k U n D f.u nc / (5.20) U nc D U n C kf.u nc /: (5.2) Again we can march forward in time since computing U nc requires only that we know the previous value U n. In the backward Euler method, however, (5.2) is an equation that

143 rjlfdm 2007/6/ page Truncation errors 2 must be solved for U nc, and in general f.u/ is a nonlinear function. We can view this as looking for a zero of the function g.u/ D u kf.u/ U n ; which can be approximated using some iterative method such as Newton s method. Because the backward Euler method gives an equation that must be solved for U nc, it is called an implicit method, whereas the forward Euler method (5.9) is an explicit method. Another implicit method is the trapezoidal method, obtained by averaging the two Euler methods: U nc U n D k 2.f.U n / C f.u nc //: (5.22) As one might expect, this symmetric approximation is second order accurate, whereas the Euler methods are only first order accurate. The above methods are all one-step methods, meaning that U nc is determined from U n alone and previous values of U are not needed. One way to get higher order accuracy is to use a multistep method that involves other previous values. For example, using the approximation u.t C k/ u.t k/ D u 0.t/ C 2k 6 k2 u 000.t/ C O.k 3 / yields the midpoint method (also called the leapfrog method), U nc U n D f.u n / (5.23) 2k or U nc D U n C 2kf.U n /; (5.24) which is a second order accurate explicit 2-step method. The approximation D 2 u from (.), rewritten in the form 3u.t C k/ 4u.t/ C u.t k/ D u 0.t C k/ C 2k 2 k2 u 000.t C k/ C; yields a second order implicit 2-step method 3U nc 4U n C U n D f.u nc /: (5.25) 2k This is one of the backward differentiation formula (BDF) methods that will be discussed further in Chapter Truncation errors The truncation error for these methods is defined in the same way as in Chapter 2. We write the difference equation in the form that directly models the derivatives (e.g., in the form

144 rjlfdm 2007/6/ page Chapter 5. The Initial Value Problem for Ordinary Differential Equations (5.23) rather than (5.24)) and then insert the true solution to the ODE into the difference equation. We then use Taylor series expansion and cancel out common terms. Example 5.9. The local truncation error (LTE) of the midpoint method (5.23) is defined by n D u.t nc/ u.t n / f.u.t n // 2k D u 0.t n / C 6 k2 u 000.t n / C O.k 4 / u 0.t n / D 6 k2 u 000.t n / C O.k 4 /: Note that since u.t/ is the true solution of the ODE, u 0.t n / D f.u.t n //. The O.k 3 / term drops out by symmetry. The truncation error is O.k 2 / and so we say the method is second order accurate, although it is not yet clear that the global error will have this behavior. As always, we need some form of stability to guarantee that the global error will exhibit the same rate of convergence as the local truncation error. This will be discussed below. 5.5 One-step errors In much of the literature concerning numerical methods for ODEs, a slightly different definition of the local truncation error is used that is based on the form (5.24), for example, rather than (5.23). Denoting this value by L n, we have L n D u.t nc / u.t n / 2kf.u.t n // (5.26) D 3 k3 u 000.t n / C O.k 5 /: Since L n D 2k n, this local error is O.k 3 / rather than O.k 2 /, but of course the global error remains the same and will be O.k 2 /. Using this alternative definition, many standard results in ODE theory say that a pth order accurate method should have an LTE that is O.k pc /. With the notation we are using, a pth order accurate method has an LTE that is O.k p /. The notation used here is consistent with the standard practice for PDEs and leads to a more coherent theory, but one should be aware of this possible source of confusion. In this book L n will be called the one-step error, since this can be viewed as the error that would be introduced in one time step if the past values U n ; U n ; ::: were all taken to be the exact values from u.t/. For example, in the midpoint method (5.24) we suppose that U n D u.t n / and U n D u.t n / and we now use these values to compute U nc, an approximation to u.t nc /: Then the error is U nc D u.t n / C 2kf.u.t n // D u.t n / C 2ku 0.t n /: u.t nc / U nc D u.t nc / u.t n / 2ku 0.t n / D L n :

145 rjlfdm 2007/6/ page Taylor series methods 23 From (5.26) we see that in one step the error introduced is O.k 3 /. This is consistent with second order accuracy in the global error if we think of trying to compute an approximation to the true solution u.t / at some fixed time T > 0. To compute from time t D 0 up to time T, we need to take T =k time steps of length k. A rough estimate of the error at time T might be obtained by assuming that a new error of size L n is introduced in the nth time step and is then simply carried along in later time steps without affecting the size of future local errors and without growing or diminishing itself. Then we would expect the resulting global error at time T to be simply the sum of all these local errors. Since each local error is O.k 3 / and we are adding up T =k of them, we end up with a global error that is O.k 2 /. This viewpoint is in fact exactly right for the simplest ODE (5.9), in which f.u; t/ D g.t/ is independent of u and the solution is simply the integral of g, but it is a bit too simplistic for more interesting equations since the error at each time feeds back into the computation at the next step in the case where f.u; t/ depends on u. Nonetheless, it is essentially right in terms of the expected order of accuracy, provided the method is stable. In fact, it is useful to think of stability as exactly what is needed to make this naive analysis correct, by ensuring that the old errors from previous time steps do not grow too rapidly in future time steps. This will be investigated in detail in the following chapters. 5.6 Taylor series methods The forward Euler method (5.9) can be derived using a Taylor series expansion of u.t nc / about u.t n /: u.t nc / D u.t n / C ku 0.t n / C 2 k2 u 00.t n / C: (5.27) If we drop all terms of order k 2 and higher and use the differential equation to replace u 0.t n / with f.u.t n /; t n /, we obtain u.t nc / u.t n / C kf.u.t n /; t n /: This suggests the method (5.9). The -step error is O.k 2 / since we dropped terms of this order. A Taylor series method of higher accuracy can be derived by keeping more terms in the Taylor series. If we keep the first p C terms of the Taylor series expansion u.t nc / u.t n / C ku 0.t n / C 2 k2 u 00.t n / CC p! kp u.p/.t n / we obtain a pth order accurate method. The problem is that we are given only u 0.t/ D f.u.t/; t/ and we must compute the higher derivatives by repeated differentiation of this function. For example, we can compute u 00.t/ D f u.u.t/; t/u 0.t/ C f t.u.t/; t/ D f u.u.t/; t/f.u.t/; t/ C f t.u.t/; t/: (5.28)

146 rjlfdm 2007/6/ page Chapter 5. The Initial Value Problem for Ordinary Differential Equations This can result in very messy expressions that must be worked out for each equation, and as a result this approach is not often used in practice. However, it is such an obvious approach that it is worth mentioning, and in some cases it may be useful. An example should suffice to illustrate the technique and its limitations. Example 5.0. Suppose we want to solve the equation Then we can compute A second order method is given by u 0.t/ D t 2 sin.u.t//: (5.29) u 00.t/ D 2t sin.u.t// C t 2 cos.u.t// u 0.t/ D 2t sin.u.t// C t 4 cos.u.t// sin.u.t//: U nc D U n C kt 2 n sin.u n / C 2 k2 Œ2t n sin.u n / C t 4 n cos.u n / sin.u n / : Clearly higher order derivatives can be computed and used, but this is cumbersome even for this simple example. For systems of equations the method becomes still more complicated. This Taylor series approach does get used in some situations, however for example, in the derivation of the Lax Wendroff method for hyperbolic PDEs; see Section 0.3. See also Section Runge Kutta methods Most methods used in practice do not require that the user explicitly calculate higher order derivatives. Instead a higher order finite difference approximation is designed that typically models these terms automatically. A multistep method of the sort we will study in Section 5.9 can achieve high accuracy by using high order polynomial interpolation through several previous values of the solution and/or its derivative. To achieve the same effect with a -step method it is typically necessary to use a multistage method, where intermediate values of the solution and its derivative are generated and used within a single time step. Example 5.. A two-stage explicit Runge Kutta method is given by U D U n C 2 kf.u n /; U nc D U n C kf.u /: (5.30) In the first stage an intermediate value is generated that approximates u.t nc=2 / via Euler s method. In the second step the function f is evaluated at this midpoint to estimate the slope over the full time step. Since this now looks like a centered approximation to the derivative we might hope for second order accuracy, as we ll now verify by computing the LTE. Combining the two steps above, we can rewrite the method as U nc D U n C kf U n C 2 kf.u n / :

147 rjlfdm 2007/6/ page Runge Kutta methods 25 Viewed this way, this is clearly a -step explicit method. The truncation error is n D k.u.t nc/ u.t n // f u.t n / C 2 kf.u.t n// : (5.3) Note that f u.t n / C 2 kf.u.t n// Df u.t n /C 2 ku0.t n / Df.u.t n //C 2 ku0.t n /f 0.u.t n //C 8 k2.u 0.t n // 2 f 00.u.t n //C : Since f.u.t n // D u 0.t n / and differentiating gives f 0.u/u 0 D u 00, we obtain f u.t n / C 2 kf.u.t n// D u 0.t n / C 2 ku00.t n / C O.k 2 /: Using this in (5.3) gives n D ku 0.t n / C k 2 k2 u 00.t n / C O.k 3 / u 0.t n / C 2 ku00.t n / C O.k 2 / D O.k 2 / and the method is second order accurate. (Check the O.k 2 / term to see that this does not vanish.) Remark: Another way to determine the order of accuracy of this simple method is to apply it to the special test equation u 0 D u, which has solution u.t nc / D e k u.t n /, and determine the error on this problem. Here we obtain U nc D U n C k U n C 2 ku n D U n C.k/U n C 2.k/2 U n D e k U n C O.k 3 /: The one-step error is O.k 3 / and hence the LTE is O.k 2 /. Of course we have checked only that the LTE is O.k 2 / on one particular function u.t/ D e t, not on all smooth solutions, and for general Runge Kutta methods for nonautonomous problems this approach gives only an upper bound on the method s order of accuracy. Applying a method to this special equation is also a fundamental tool in stability analysis see Chapter 7. Example 5.2. The Runge Kutta method (5.30) can be extended to nonautonomous equations of the form u 0.t/ D f.u.t/; t/: U D U n C 2 kf.u n ; t n /; U nc D U n C kf U ; t n C k : 2 (5.32)

148 rjlfdm 2007/6/ page Chapter 5. The Initial Value Problem for Ordinary Differential Equations This is again second order accurate, as can be verified by expanding as above, but it is slightly more complicated since Taylor series in two variables must be used. Example 5.3. One simple higher order Runge Kutta method is the fourth order four-stage method given by Y D U n ; Y 2 D U n C 2 kf.y ; t n /; Y 3 D U n C 2 kf Y 2 ; t n C k ; 2 Y 4 D U n C kf Y 3 ; t n C k ; 2 U nc D U n C k f.y ; t n / C 2f Y 2 ; t n C k 6 2 C2f Y 3 ; t n C k C f.y 4 ; t n C k/ : 2 (5.33) Note that if f.u; t/ D f.t/ does not depend on u, then this reduces to Simpson s rule for the integral. This method was particularly popular in the precomputer era, when computations were done by hand, because the coefficients are so simple. Today there is no need to keep the coefficients simple and other Runge Kutta methods have advantages. A general r-stage Runge Kutta method has the form rx Y D U n C k a j f.y j ; t n C c j k/; Y 2 D U n C k : Y r D U n C k jd rx a 2j f.y j ; t n C c j k/; jd rx a rj f.y j ; t n C c j k/; jd (5.34) Consistency requires rx U nc D U n C k b j f.y j ; t n C c j k/: jd rx a ij D c i ; i D ; 2; :::; r; jd rx b j D : jd (5.35)

149 rjlfdm 2007/6/ page Runge Kutta methods 27 If these conditions are satisfied, then the method will be at least first order accurate. The coefficients for a Runge Kutta method are often displayed in a so-called Butcher tableau: c a ::: a r : : : c r a r ::: a rr (5.36) b ::: b r For example, the fourth order Runge Kutta method given in (5.33) has the following tableau (entries not shown are all 0): 0 /2 /2 /2 0 /2 0 0 /6 /3 /3 /6 An important class of Runge Kutta methods consists of the explicit methods for which a ij D 0 for j i. For an explicit method, the elements on and above the diagonal in the a ij portion of the Butcher tableau are all equal to zero, as, for example, with the fourth order method displayed above. With an explicit method, each of the Y i values is computed using only the previously computed Y j. Fully implicit Runge Kutta methods, in which each Y i depends on all the Y j, can be expensive to implement on systems of ODEs. For a system of s equations (where each Y i is in R s ), a system of sr equations must be solved to compute the r vectors Y i simultaneously. One subclass of implicit methods that are simpler to implement are the diagonally implicit Runge Kutta methods (DIRK methods) in which Y i depends on Y j for j i, i.e., a ij D 0 for j > i. For a system of s equations, DIRK methods require solving a sequence of r implicit systems, each of size s, rather than a coupled set of sr equations as would be required in a fully implicit Runge Kutta method. DIRK methods are so named because their tableau has zero values above the diagonal but possibly nonzero diagonal elements. Example 5.4. A second order accurate DIRK method is given by Y D U n ; Y 2 D U n C k f.y ; t n / C f Y 2 ; t n C k ; 4 2 Y 3 D U n C k f.y ; t n / C f Y 2 ; t n C k C f.y 3 ; t n C k/ ; 3 2 U nc D Y 3 D U n C k f.y ; t n / C f Y 2 ; t n C k C f.y 3 ; t n C k/ : 3 2 (5.37) This method is known as the TR-BDF2 method and is derived in a different form in Section 8.5. Its tableau is

150 rjlfdm 2007/6/ page Chapter 5. The Initial Value Problem for Ordinary Differential Equations if 0 /2 /4 /4 /3 /3 /3 /3 /3 /3 In addition to the conditions (5.35), a Runge Kutta method is second order accurate rx b j c j D 2 ; (5.38) jd as is satisfied for the method (5.37). Third order accuracy requires two additional conditions: rx b j cj 2 D 3 ; rx id jd jd rx b i a ij c j D (5.39) 6 : Fourth order accuracy requires an additional four conditions on the coefficients, and higher order methods require an exponentially growing number of conditions. An r-stage explicit Runge Kutta method can have order at most r, although for r 5 the order is strictly less than the number of stages. Among implicit Runge Kutta methods, r-stage methods of order 2r exist. There typically are many ways that the coefficients a ij and b j can be chosen to achieve a given accuracy, provided the number of stages is sufficiently large. Many different classes of Runge Kutta methods have been developed over the years with various advantages. The order conditions are quite complicated for higherorder methods and an extensive theory has been developed by Butcher for analyzing these methods and their stability properties. For more discussion and details see, for example, [3], [43], [44]. Using more stages to increase the order of a method is an obvious strategy. For some problems, however, we will also see that it can be advantageous to use a large number of stages to increase the stability of the method while keeping the order of accuracy relatively low. This is the idea behind the Runge Kutta Chebyshev methods, for example, discussed in Section Embedded methods and error estimation Most practical software for solving ODEs does not use a fixed time step but rather adjusts the time step during the integration process to try to achieve some specified error bound. One common way to estimate the error in the computation is to compute using two different methods and compare the results. Knowing something about the error behavior of each method often allows the possibility of estimating the error in at least one of the two results. A simple way to do this for ODEs is to take a time step with two different methods, one of order p and one of a different order, say, p C. Assuming that the time step is small enough that the higher order method is really generating a better approximation, then

151 rjlfdm 2007/6/ page Runge Kutta methods 29 the difference between the two results will be an estimate of the one-step error in the lower order method. This can be used as the basis for choosing an appropriate time step for the lower order approximation. Often the time step is chosen in this manner, but then the higher order solution is used as the actual approximation at this time and as the starting point for the next time step. This is sometimes called local extrapolation. Once this is done there is no estimate of the error, but presumably it is even smaller than the error in the lower order method and so the approximation generated will be even better than the required tolerance. For more about strategies for time step selection, see, for example, [5], [43], [78]. Note, however, that the procedure of using two different methods in every time step could easily double the cost of the computation unless we choose the methods carefully. Since the main cost in a Runge Kutta method is often in evaluating the function f.u; t/, it makes sense to reuse function values as much as possible and look for methods that provide two approximations to U nc of different order based on the same set of function evaluations, by simply taking different linear combinations of the f.y j ; t n Cc j k/ values in the final stage of the Runge Kutta method (5.34). So in addition to the value U nc given there we would like to also compute a value OU nc D U n C k rx Ob j f.y j ; t n C c j k/ (5.40) jd that gives an approximation of a different order that can be used for error estimation. These are called embedded Runge Kutta methods and are often displayed in a tableau of the form c a ::: a r : : : c r a r ::: a rr b ::: b r (5.4) Ob ::: O br As a very simple example, the second order Runge Kutta method (5.32) could be combined with the first order Euler method: Y D U n ; Y 2 D U n C 2 kf.y ; t n /; U nc D U n C kf Y 2 ; t n C k ; 2 (5.42) OU nc D U n C kf.y ; t n /: Note that the computation of OU nc reuses the value f.y ; t n / obtained in computing Y 2 and is essentially free. Also note that

152 rjlfdm 2007/6/ page Chapter 5. The Initial Value Problem for Ordinary Differential Equations OU nc U nc D k f.y ; t n / k u 0.t n / 2 k2 u 00.t n /; f u 0.t nc=2 / Y 2 ; t n C k 2 (5.43) which is approximately the one-step error for Euler s method. Most software based on Runge Kutta methods uses embedded methods of higher order. For example, the ode45 routine in MATLAB uses a pair of embedded Runge-Kutta methods of order 4 and 5 due to Dormand and Prince [25]. See Shampine and Reichelt [78] for implementation details (or typeode45 in MATLAB). 5.8 One-step versus multistep methods Taylor series and Runge Kutta methods are one-step methods; the approximation U nc depends on U n but not on previous values U n ; U n 2 ; :::. In the next section we will consider a class of multistep methods where previous values are also used (one example is the midpoint method (5.24)). One-step methods have several advantages over multistep methods: The methods are self-starting: from the initial data U 0 the desired method can be applied immediately. Multistep methods require that some other method be used initially, as discussed in Section The time step k can be changed at any point, based on an error estimate, for example. The time step can also be changed with a multistep method but more care is required since the previous values are assumed to be equally spaced in the standard form of these methods given below. If the solution u.t/ is not smooth at some isolated point t (for example, because f.u; t/ is discontinuous at t ), then with a one-step method it is often possible to get full accuracy simply by ensuring that t is a grid point. With a multistep method that uses data from both sides of t in approximating derivatives, a loss of accuracy may occur. On the other hand, one-step methods have some disadvantages. The disadvantage of Taylor series methods is that they require differentiating the given equation and are cumbersome and often expensive to implement. Runge Kutta methods only use evaluations of the function f, but a higher order multistage method requires evaluating f several times each time step. For simple equations this may not be a problem, but if function values are expensive to compute, then high order Runge Kutta methods may be quite expensive as well. This is particularly true for implicit methods, where an implicit nonlinear system must be solved in each stage. An alternative is to use a multistep method in which values of f already computed in previous time steps are reused to obtain higher order accuracy. Typically only one new f evaluation is required in each time step. The popular class of linear multistep methods is discussed in the next section.

153 rjlfdm 2007/6/ page Linear multistep methods Linear multistep methods All the methods introduced in Section 5.3 are members of a class of methods called linear multistep methods (LMMs). In general, an r-step LMM has the form rx rx j U ncj D k ˇjf.U ncj ; t ncj /: (5.44) jd0 jd0 The value U ncr is computed from this equation in terms of the previous values U ncr, U ncr 2, :::; U n and f values at these points (which can be stored and reused if f is expensive to evaluate). If ˇr D 0, then the method (5.44) is explicit; otherwise it is implicit. Note that we can multiply both sides by any constant and have essentially the same method, although the coefficients j and ˇj would change. The normalization r D is often assumed to fix this scale factor. There are special classes of methods of this form that are particularly useful and have distinctive names. These will be written out for the autonomous case where f.u; t/ D f.u/ to simplify the formulas, but each can be used more generally by replacing f.u ncj / with f.u ncj ; t ncj / in any of the formulas. Example 5.5. The Adams methods have the form rx U ncr D U ncr C k ˇj f.u ncj /: (5.45) These methods all have r D ; r D ; and j D 0 for j < r : The ˇj coefficients are chosen to maximize the order of accuracy. If we require ˇr D 0 so the method is explicit, then the r coefficients ˇ0; ˇ; :::; ˇr can be chosen so that the method has order r. This can be done by using Taylor series expansion of the local truncation error and then choosing the ˇj to eliminate as many terms as possible. This gives the explicit Adams Bashforth methods. Another way to derive the Adams Bashforth methods is by writing u.t ncr / D u.t ncr D u.t ncr / C / C jd0 Z tncr t ncr Z tncr t ncr u 0.t/ dt f.u.t// dt and then applying a quadrature rule to this integral to approximate Z tncr (5.46) Xr f.u.t// dt k ˇj f.u.t ncj //: (5.47) t ncr jd This quadrature rule can be derived by interpolating f.u.t// by a polynomial p.t/ of degree r at the points t n ; t nc ; :::;t ncr and then integrating the interpolating polynomial. Either approach gives the same r-step explicit method. The first few are given below.

154 rjlfdm 2007/6/ page Chapter 5. The Initial Value Problem for Ordinary Differential Equations Explicit Adams Bashforth methods -step: U nc D U n C kf.u n / (forward Euler) 2-step: U nc2 D U nc C k 2. f.u n / C 3f.U nc // 3-step: U nc3 D U nc2 C k 2.5f.U n / 6f.U nc / C 23f.U nc2 // 4-step: U nc4 D U nc3 C k 24. 9f.U n / C 37f.U nc / 59f.U nc2 / C 55f.U nc3 // If we allow ˇr to be nonzero, then we have one more free parameter and so we can eliminate an additional term in the LTE. This gives an implicit method of order r C called the r-step Adams Moulton. These methods can again be derived by polynomial interpolation, now using a polynomial p.t/ of degree r that interpolates f.u.t// at the points t n ; t nc ; :::;t ncr and then integrating the interpolating polynomial. Implicit Adams Moulton methods -step: U nc D U n C k 2.f.U n / C f.u nc // (trapezoidal method) 2-step: U nc2 D U nc C k 2. f.u n / C 8f.U nc / C 5f.U nc2 // 3-step: U nc3 D U nc2 C k 24.f.U n / 5f.U nc / C 9f.U nc2 / C 9f.U nc3 // 4-step: U nc4 D U nc3 C k f.U n / C 06f.U nc / 264f.U nc2 / C 646f.U nc3 / C 25f.U nc4 // Example 5.6. The explicit Nyström methods have the form rx U ncr D U ncr 2 C k ˇjf.U ncj / with the ˇj chosen to give order r. The midpoint method (5.23) is a two-step explicit Nyström method. A two-step implicit Nyström method is Simpson s rule, jd0 U nc2 D U n C 2k 6.f.U n / C 4f.U nc / C f.u nc2 //: This reduces to Simpson s rule for quadrature if applied to the ODE u 0.t/ D f.t/ Local truncation error For LMMs it is easy to derive a general formula for the LTE. We have 0.t ncr / D rx ju.t ncj / k ˇju 0.t ncj / A : k jd0 jd0

155 rjlfdm 2007/6/ page Linear multistep methods 33 We have used f.u.t ncj // D u 0.t ncj / since u.t/ is the exact solution of the ODE. Assuming u is smooth and expanding in Taylor series gives u.t ncj / D u.t n / C jku 0.t n / C 2.jk/2 u 00.t n / C; u 0.t ncj / D u 0.t n / C jku 00.t n / C 2.jk/2 u 000.t n / C; and so 0 0.t ncr / D rx A j u.t n / j k jd0 jd0 0 rx C jd0 2 j 2 j A jˇj u 00.t n / 0 rx CCk q! j q j jd0 ˇj/ A u 0.t n /.q /! j q ˇj A u.q/.t n / C: The method is consistent if! 0 as k! 0, which requires that at least the first two terms in this expansion vanish: rx j D 0 jd0 and rx rx j j D ˇj : (5.48) jd0 jd0 If the first p C terms vanish, then the method will be pth order accurate. Note that these conditions depend only on the coefficients j and ˇj of the method and not on the particular differential equation being solved Characteristic polynomials It is convenient at this point to introduce the so-called characteristic polynomials./ and./ for the LMM: rx rx./ D j j and./ D ˇj j : (5.49) jd0 The first of these is a polynomial of degree r. So is./ if the method is implicit; otherwise its degree is less than r. Note that./ D P j and also that 0./ D P j j j, so that the consistency conditions (5.48) can be written quite concisely as conditions on these two polynomials:./ D 0 and 0./ D./: (5.50) This, however, is not the main reason for introducing these polynomials. The location of the roots of certain polynomials related to and plays a fundamental role in stability theory as we will see in the next two chapters. jd0

156 rjlfdm 2007/6/ page Chapter 5. The Initial Value Problem for Ordinary Differential Equations Example 5.7. The two-step Adams Moulton method U nc2 D U nc C k 2. f.u n / C 8f.U nc / C 5f.U nc2 // (5.5) has characteristic polynomials Starting values./ D 2 ;./ D 2. C 8 C 52 /: (5.52) One difficulty with using LMMs if r > is that we need the values U 0 ; U ; :::; U r before we can begin to apply the multistep method. The value U 0 D is known from the initial data for the problem, but the other values are not and typically must be generated by some other numerical method or methods. Example 5.8. If we want to use the midpoint method (5.23), then we need to generate U by some other method before we begin to apply (5.23) with n D. We can obtain U from U 0 using any one-step method, such as Euler s method or the trapezoidal method, or a higher order Taylor series or Runge Kutta method. Since the midpoint method is second order accurate we need to make sure that the value U we generate is sufficiently accurate so that this second order accuracy will not be lost. Our first impulse might be to conclude that we need to use a second order accurate method such as the trapezoidal method rather than the first order accurate Euler method, but this is wrong. The overall method is second order in either case. The reason that we achieve second order accuracy even if Euler is used in the first step is exactly analogous to what was observed earlier for boundary value problems, where we found that we can often get away with one order of accuracy lower in the local error at a single point than what we have elsewhere. In the present context this is easiest to explain in terms of the one-step error. The midpoint method has a one-step error that is O.k 3 / and because this method is applied in O.T =k/ time steps, the global error is expected to be O.k 2 /. Euler s method has a one-step error that is O.k 2 /, but we are applying this method only once. If U 0 D D u.0/, then the error in U obtained with Euler will be O.k 2 /. If the midpoint method is stable, then this error will not be magnified unduly in later steps and its contribution to the global error will be only O.k 2 /. The overall second order accuracy will not be affected. More generally, with an r-step method of order p, we need r starting values U 0 ; U ; :::; U r and we need to generate these values using a method that has a one-step error that is O.k p / (corresponding to an LTE that is O.k p /). Since the number of times we apply this method (r ) is independent of k as k! 0, this is sufficient to give an O.k p / global error. Of course somewhat better accuracy (a smaller error constant) may be achieved by using a pth order accurate method for the starting values, which takes little additional work. In software for the IVP, multistep methods generally are implemented in a form that allows changing the time step during the integration process, as is often required to efficiently solve the problem. Typically the order of the method is also allowed to vary,

157 rjlfdm 2007/6/ page Linear multistep methods 35 depending on how the solution is behaving. In such software it is then natural to solve the starting-value problem by initially taking a small time step with a one-step method and then ramping up to higher order methods and longer time steps as the integration proceeds and more past data are available Predictor-corrector methods The idea of comparing results obtained with methods of different order as a way to choose the time step, discussed in Section 5.7. for Runge Kutta methods, is also used with LMMs. One approach is to use a predictor-corrector method, in which an explicit Adams Bashforth method of some order is used to predict a value OU nc and then the Adams Moulton method of the same order is used to correct this value. This is done by using OU nc on the right-hand side of the Adams Moulton method inside the f evaluation, so that the Adams Moulton formula is no longer implicit. For example, the one-step Adams Bashforth (Euler s method) and the one-step Adams Moulton method (the trapezoidal method) could be combined into OU nc D U n C kf.u n /; U nc D U n C 2 k.f.u n / C f. OU nc //: (5.53) It can be shown that this method is second order accurate, like the trapezoidal method, but it also generates a lower order approximation and the difference between the two can be used to estimate the error. The MATLAB routine ode3 uses this approach, with Adams Bashforth Moulton methods of orders 2; see [78].

158 rjlfdm 2007/6/ page 36

159 rjlfdm 2007/6/ page 37 Chapter 6 Zero-Stability and Convergence for Initial Value Problems 6. Convergence To discuss the convergence of a numerical method for the initial value problem, we focus on a fixed (but arbitrary) time T > 0 and consider the error in our approximation to u.t / computed with the method using time step k. The method converges on this problem if this error goes to zero as k! 0. Note that the number of time steps that we need to take to reach time T increases as k! 0. If we use N to denote this value (N D T =k), then convergence means that lim k!0 NkDT U N D u.t /: (6.) In principle a method might converge on one problem but not on another, or converge with one set of starting values but not with another set. To speak of a method being convergent in general, we require that it converges on all problems in a reasonably large class with all reasonable starting values. For an r-step method we need r starting values. These values will typically depend on k, and to make this clear we will write them as U 0.k/; U.k/; : : : ; U r.k/. While these will generally approximate u.t/ at the times t 0 D 0; t D k, :::; t r D.r /k, respectively, as k! 0, each of these times approaches t 0 D 0. So the weakest condition we might put on our starting values is that they converge to the correct initial value as k! 0: lim U.k/ D for D 0; ; :::; r : (6.2) k!0 We can now state the definition of convergence. Definition 6.. An r-step method is said to be convergent if applying the method to any ODE (5.) with f.u; t/ Lipschitz continuous in u, and with any set of starting values satisfying (6.2), we obtain convergence in the sense of (6.) for every fixed time T > 0 at which the ODE has a unique solution. To be convergent, a method must be consistent, meaning as before that the local truncation error (LTE) is o./ as k! 0, and also zero-stable, as described later in this 37

160 rjlfdm 2007/6/ page Chapter 6. Zero-Stability and Convergence for Initial Value Problems chapter. We will begin to investigate these issues by first proving the convergence of onestep methods, which turn out to be zero-stable automatically. We start with Euler s method on linear problems, then consider Euler s method on general nonlinear problems and finally extend this to a wide class of one-step methods. 6.2 The test problem Much of the theory presented below is based on examining what happens when a method is applied to a simple scalar linear equation of the form with initial data u.t 0 / D : The solution is then given by Duhamel s principle (5.8), u 0.t/ D u.t/ C g.t/ (6.3) u.t/ D e.t t 0/ C Z t t 0 e.t / g./ d: (6.4) 6.3 One-step methods 6.3. Euler s method on linear problems If we apply Euler s method to (6.3), we obtain U nc D U n C k.u n C g.t n // D. C k/u n C kg.t n /: (6.5) The LTE for Euler s method is given by n u.tnc / u.t n / D.u.t n / C g.t n // k D u 0.t n / C 2 ku00.t n / C O.k 2 / u 0.t n / (6.6) Rewriting this equation as D 2 ku00.t n / C O.k 2 /: u.t nc / D. C k/u.t n / C kg.t n / C k n and subtracting this from (6.5) gives a difference equation for the global error E n : E nc D. C k/e n k n : (6.7) Note that this has exactly the same form as (6.5) but with a different nonhomogeneous term: n in place of g.t n /. This is analogous to equation (2.5) in the boundary value theory

161 rjlfdm 2007/6/ page One-step methods 39 and again gives the relation we need between the local truncation error n (which is easy to compute) and the global error E n (which we wish to bound). Note again that linearity plays a critical role in making this connection. We will consider nonlinear problems below. Because the equation and method we are now considering are both so simple, we obtain an equation (6.7) that we can explicitly solve for the global error E n. Applying the recursion (6.7) repeatedly we see what form the solution should take: E n D. C k/e n k n D. C k/œ. C k/e n 2 k n 2 k n D: By induction we can easily confirm that in general nx E n D. C k/ n E 0 k. C k/ n m m : (6.8) (Note that some of the superscripts are powers while others are indices!) This has a form that is very analogous to the solution (6.4) of the corresponding ordinary differential equation (ODE), where now. C k/ n m plays the role of the solution operator of the homogeneous problem it transforms data at time t m to the solution at time t n. The expression (6.8) is sometimes called the discrete form of Duhamel s principle. We are now ready to prove that Euler s method converges on (6.3). We need only observe that j C kj e kjj (6.9) and so. C k/ n m e.n m/kjj e nkjj e jjt ; (6.0) provided that we restrict our attention to the finite time interval 0 t T, so that t n D nk T. It then follows from (6.8) that! nx je n je jjt je 0 jck j m j (6.) md md e jjt je 0 jcnk max mn j m j : Let N D T =k be the number of time steps needed to reach time T and set From (6.6) we expect kk D max j n j: 0nN kk 2 kku00 k D O.k/; where ku 00 k is the maximum value of the function u 00 over the interval Œ0; T. Then for t D nk T, we have from (6.) that je n je jjt.je 0 jct kk /:

162 rjlfdm 2007/6/ page Chapter 6. Zero-Stability and Convergence for Initial Value Problems If (6.2) is satisfied then E 0! 0 as k! 0. In fact for this one-step method we would generally take U 0 D u.0/ D, in which case E 0 drops out and we are left with je n je jjt T kk D O.k/ as k! 0 (6.2) and hence the method converges and is in fact first order accurate. Note where stability comes into the picture. The one-step error L m D k m introduced in the mth step contributes the term. C k/ n m L m to the global error. The fact that j. C k/ n m j < e jjt is uniformly bounded as k! 0 allows us to conclude that each contribution to the final error can be bounded in terms of its original size as a one-step error. Hence the naive analysis of Section 5.5 is valid, and the global error has the same order of magnitude as the local truncation error Relation to stability for boundary value problems To see how this ties in with the definition of stability used in Chapter 2 for the BVP, it may be useful to view Euler s method as giving a linear system in matrix form, although this is not the way it is used computationally. If we view the equations (6.5) for n D 0; ; :::; N as a linear system AU D F for U D ŒU ; U 2 ; :::; U N T, then and 2 A D k 6 4 U D 6 4. C k/. C k/ 2 U U 2 U 3 : U N U N 3 2 ; F D : :: 7. C k/ 5. C k/.=k C /U 0 C g.t 0 / g.t / g.t 2 / : g.t N 2 / g.t N / We have divided both sides of (6.5) by k to conform to the notation of Chapter 2. Since the matrix A is lower triangular, this system is easily solved by forward substitution, which results in the iterative equation (6.5). If we now let OU be the vector obtained from the true solution as in Chapter 2, then subtracting A OU D F C from AU D F, we obtain (2.5) (the matrix form of (6.7)) with solution (6.8). We are then in exactly the same framework as in Chapter 2. So we have convergence and a global error with the same magnitude as the local error provided that the method is stable in the sense of Definition 2., i.e., that the inverse of the matrix A is bounded independent of k for all k sufficiently small. The inverse of this matrix is easy to compute. In fact we can see from the solution (6.8) that 3 : 7 5

163 rjlfdm 2007/6/ page One-step methods 4 2 A D k 6 4. C k/. C k/ 2. C k/. C k/ 3. C k/ 2. C k/ : : ::. C k/ N. C k/ N 2. C k/ N 3. C k/ 3 : 7 5 We easily compute using (A.0a) that ka k D k NX j. C k/ N m j md and so ka k kne jjt D Te jjt : This is uniformly bounded as k! 0 for fixed T. Hence the method is stable and kek ka k kk Te jjt kk, which agrees with the bound (6.2) Euler s method on nonlinear problems So far we have focused entirely on linear equations. Practical problems are almost always nonlinear, but for the initial value problem it turns out that it is not significantly more difficult to handle this case if we assume that f.u/ is Lipschitz continuous, which is reasonable in light of the discussion in Section 5.2. Euler s method on u 0 D f.u/ takes the form and the truncation error is defined by U nc D U n C kf.u n / (6.3) n D k.u.t nc/ u.t n // f.u.t n // D 2 ku00.t n / C O.k 2 /; just as in the linear case. So the true solution satisfies and subtracting this from (6.3) gives u.t nc / D u.t n / C kf.u.t n // C k n E nc D E n C k.f.u n / f.u.t n /// k n : (6.4) In the linear case f.u n / f.u.t n // D E n and we get the relation (6.7) for E n. In the nonlinear case we cannot express f.u n / f.u.t n // directly in terms of the error E n in general. However, using the Lipschitz continuity of f we can get a bound on this in terms of E n : jf.u n / f.u.t n //j LjU n u.t n /jdlje n j:

164 rjlfdm 2007/6/ page Chapter 6. Zero-Stability and Convergence for Initial Value Problems Using this in (6.4) gives je nc jje n jcklje n jckj n jd. C kl/je n jckj n j: (6.5) From this inequality we can show by induction that nx je n j.ckl/ n je 0 jck. C kl/ n m j m j and so, using the same steps as in obtaining (6.2) (and again assuming E 0 D 0), we obtain je n je LT T kk D O.k/ as k! 0 (6.6) for all n with nk T, proving that the method converges. In the linear case L Djjand this reduces to exactly (6.2) General one-step methods A general explicit one-step method takes the form U nc D U n C k.u n ; t n ; k/ (6.7) for some function, which depends on f of course. We will assume that.u; t; k/ is continuous in t and k and Lipschitz continuous in u, with Lipschitz constant L 0 that is generally related to the Lipschitz constant of f. Example 6.. For the two-stage Runge Kutta method of Example 5., we have.u; t; k/ D f u C 2 kf.u/ : (6.8) If f is Lipschitz continuous with Lipschitz constant L, then has Lipschitz constant L 0 D L C 2 kl2. The one-step method (6.7) is consistent if.u; t; 0/ D f.u; t/ for all u; t, and is continuous in k. The local truncation error is n u.tnc / u.t n / D.u.t n /; t n ; k/: k We can show that any one-step method satisfying these conditions is convergent. We have u.t nc / D u.t n / C k.u.t n /; t n ; k/ C k n and subtracting this from (6.7) gives E nc D E n C k..u n ; t n ; k/.u.t n /; t n ; k// k n : Using the Lipschitz condition we obtain je nc jje n jckl 0 je n jckj n j: This has exactly the same form as (6.5) and the proof of convergence proceeds exactly as from there. md

165 rjlfdm 2007/6/ page Zero-stability of linear multistep methods Zero-stability of linear multistep methods The convergence proof of the previous section shows that for one-step methods, each onestep error k m has an effect on the global error that is bounded by e L0T jk m j. Although the error is possibly amplified by a factor e L0T, this factor is bounded independent of k as k! 0. Consequently the method is stable: the global error can be bounded in terms of the sum of all the one-step errors and hence has the same asymptotic behavior as the LTE as k! 0. This form of stability is often called zero-stability in ODE theory, to distinguish it from other forms of stability that are of equal importance in practice. The fact that a method is zero-stable (and converges as k! 0) is no guarantee that it will give reasonable results on the particular grid with k > 0 that we want to use in practice. Other stability issues of a different nature will be taken up in the next chapter. But first we will investigate the issue of zero-stability for general LMMs, where the theory of the previous section does not apply directly. We begin with an example showing a consistent LMM that is not convergent. Examining what goes wrong will motivate our definition of zero-stability for LMMs. Example 6.2. The LMM has an LTE given by U nc2 3U nc C 2U n D kf.u n / (6.9) n D k Œu.t nc2/ 3u.t nc / C 2u.t n / C ku 0.t n / D 5 2 ku00.t n / C O.k 2 /; so the method is consistent and first order accurate. But in fact the global error will not exhibit first order accuracy, or even convergence, in general. This can be seen even on the trivial initial-value problem u 0.t/ D 0; u.0/ D 0 (6.20) with solution u.t/ 0. In this problem, equation (6.9) takes the form U nc2 3U nc C 2U n D 0: (6.2) We need two starting values U 0 and U. If we take U 0 D U D 0, then (6.2) generates U n D 0 for all n and in this case we certainly converge to correct solution, and in fact we get the exact solution for any k. But in general we will not have the exact value U available and will have to approximate this, introducing some error into the computation. Table 6. shows results obtained by applying this method with starting data U 0 D 0, U D k. Since U.k/! 0 as k! 0, this is valid starting data in the context of Definition 6. of convergence. If the method is convergent, we should see that U N, the computed solution at time T D, converges to zero as k! 0. Instead it blows up quite dramatically. Similar results would be seen if we applied this method to an arbitrary equation u 0 D f.u/ and used any one-step method to compute U from U 0. The homogeneous linear difference equation (6.2) can be solved explicitly for U n in terms of the starting values U 0 and U. We obtain U n D 2U 0 U C 2 n.u U 0 /: (6.22)

166 rjlfdm 2007/6/ page Chapter 6. Zero-Stability and Convergence for Initial Value Problems Table 6.. Solution U N to (6.2) with U 0 D 0, U D k and various values of k D =N. N U N :4 0 4 It is easy to verify that this satisfies (6.2) and also the starting values. (We ll see how to solve general linear difference equations in the next section.) Since u.t/ D 0, the error is E n D U n and we see that any initial errors in U or U 0 are magnified by a factor 2 n in the global error (except in the special case U D U 0 ). This exponential growth of the error is the instability that leads to nonconvergence. To rule out this sort of growth of errors, we need to be able to solve a general linear difference equation Solving linear difference equations We briefly review one solution technique for linear difference equations; see Section D.2. for a different approach. Consider the general homogeneous linear difference equation rx ju ncj D 0: (6.23) jd0 Eventually we will look for a particular solution satisfying given initial conditions U 0 ; U ;:::;U r ; but to begin with we will find the general solution of the difference equation in terms of r free parameters. We will hypothesize that this equation has a solution of the form U n D n (6.24) for some value of (here n is the nth power!). Plugging this into (6.23) gives rx j ncj D 0 and dividing by n yields jd0 rx j j D 0: (6.25) jd0 We see that (6.24) is a solution of the difference equation if satisfies (6.25), i.e., if is a root of the polynomial rx./ D j j : jd0

167 rjlfdm 2007/6/ page Zero-stability of linear multistep methods 45 Note that this is just the first characteristic polynomial of the LMM introduced in (5.49). In general./ has r roots ; 2 ; :::; r and can be factored as./ D r. /. 2 /. r /: Since the difference equation is linear, any linear combination of solutions is again a solution. If ; 2 ; :::; r are distinct ( i j for i j ), then the r distinct solutions n i are linearly independent and the general solution of (6.23) has the form U n D c n C c 2 n 2 CCc r n r ; (6.26) where c ; :::; c r are arbitrary constants. In this case, every solution of the difference equation (6.23) has this form. If initial conditions U 0 ; U ; :::; U r are specified, then the constants c ; :::; c r can be uniquely determined by solving the r r linear system c C c 2 CCc r D U 0 ; c C c 2 2 CCc r r D U ; (6.27) c r C c 2 r 2 CCc r r r D U r : Example 6.3. The characteristic polynomial for the difference equation (6.2) is :./ D 2 3 C 2 D. /. 2/ (6.28) with roots D ; 2 D 2. The general solution has the form U n D c C c 2 2 n and solving for c and c 2 from U 0 and U gives the solution (6.22). This example indicates that if./ has any roots that are greater than one in modulus, the method will not be convergent. It turns out that the converse is nearly true: if all the roots have modulus no greater than one, then the method is convergent, with one proviso. There must be no repeated roots with modulus equal to one. The next two examples illustrate this. If the roots are not distinct, say, D 2 for simplicity, then n and n 2 are not linearly independent and the U n given by (6.26), while still a solution, is not the most general solution. The system (6.27) would be singular in this case. In addition to n there is also a solution of the form n n and the general solution has the form U n D c n C c 2n n C c 3 n 3 CCc r n r : If in addition 3 D, then the third term would be replaced by c 3 n 2 n. Similar modifications are made for any other repeated roots. Note how similar this theory is to the standard solution technique for an rth order linear ODE. Example 6.4. Applying the consistent LMM U nc2 2U nc C U n D 2 k.f.u nc2 / f.u n // (6.29) :

168 rjlfdm 2007/6/ page Chapter 6. Zero-Stability and Convergence for Initial Value Problems to the differential equation u 0.t/ D 0 gives the difference equation U nc2 2U nc C U n D 0: The characteristic polynomial is./ D 2 2 C D. / 2 (6.30) so D 2 D. The general solution is U n D c C c 2 n: For particular starting values U 0 and U the solution is U n D U 0 C.U U 0 /n: Again we see that the solution grows with n, although not as dramatically as in Example 6.2 (the growth is linear rather than exponential). But this growth is still enough to destroy convergence. If we take the same starting values as before, U 0 D 0 and U D k, then U n D knand so lim U N D kn D T: k!0 NkDT The method converges to the function v.t/ D t rather than to u.t/ D 0, and hence the LMM (6.29) is not convergent. This example shows that if./ has a repeated root of modulus, then the method cannot be convergent. Example 6.5. Now consider the consistent LMM U nc3 2U nc2 C 5 4 U nc 4 U n D 4 hf.u n /: (6.3) Applying this to (6.20) gives U nc3 2U nc2 C 5 4 U nc 4 U n D 0 and the characteristic polynomial is./ D C D. /. 0:5/2 : (6.32) So D ; 2 D 3 D =2 and the general solution is n n U n D c C c 2 C c 3 n : 2 2 Here there is a repeated root but with modulus less than. The linear growth of n is then overwhelmed by the decay of.=2/ n. For this three-step method we need three starting values U 0 ; U, U 2 and we can find c ; c 2 ; c 3 in terms of them by solving a linear system similar to (6.27). Each c i will

169 rjlfdm 2007/6/ page Zero-stability of linear multistep methods 47 be a linear combination of U 0 ; U ; U 2 and so if U.k/! 0 as k! 0, then c i.k/! 0 as k! 0 also. The value U N computed at time T with step size k (where kn D T ) has the form U N D c.k/ C c 2.k/ N C c 3.k/N 2 N : (6.33) 2 Now we see that lim U N D 0 k!0 NkDT and so the method (6.3) converges on u 0 D 0 with arbitrary starting values U.k/ satisfying U.k/! 0 as k! 0. (In fact, this LMM is convergent in general.) More generally, if./ has a root j that is repeated m times, then U N will involve terms of the form N s j N for s D 0; ;:::; m. This converges to zero as N! provided j j j <. The algebraic growth of N s is overwhelmed by the exponential decay of j N. This shows that repeated roots are not a problem as long as they have magnitude strictly less than. With the above examples as motivation, we are ready to state the definition of zerostability. Definition 6.2. An r-step LMM is said to be zero-stable if the roots of the characteristic polynomial./ defined by (5.49) satisfy the following conditions: j j j for j D ; 2; ;:::; r: If j is a repeated root, then j j j < : (6.34) If the conditions (6.34) are satisfied for all roots of, then the polynomial is said to satisfy the root condition. Example 6.6. The Adams methods have the form U ncr D U ncr rx C k ˇj f.u ncj / jd and hence./ D r r D. / r : The roots are D and 2 DD r D 0. The root condition is clearly satisfied and all the Adams Bashforth and Adams Moulton methods are zero-stable. The given examples certainly do not prove that zero-stability as defined above is a sufficient condition for convergence. We looked at only the simplest possible ODE u 0.t/ D 0 and saw that things could go wrong if the root condition is not satisfied. It turns out, however, that the root condition is all that is needed to prove convergence on the general initial value problem (in the sense of Definition 6.). Theorem 6.3 (Dahlquist [22]). For LMMs applied to the initial value problem for u 0.t/ D f.u.t/; t/, consistency C zero-stability () convergence: (6.35)

170 rjlfdm 2007/6/ page Chapter 6. Zero-Stability and Convergence for Initial Value Problems This is the analogue of the statement (2.2) for the BVP. A proof of this result can be found in [43]. Note: A consistent LMM always has one root equal to, say, D, called the principal root. This follows from (5.50). Hence a consistent one-step LMM (such as Euler, backward Euler, trapezoidal) is certainly zero-stable. More generally we have proved in Section that any consistent one-step method (that is a Lipschitz continuous) is convergent. Such methods are automatically zero-stable and behave well as k! 0. We can think of zero-stability as meaning stable in the limit as k! 0. Although a consistent zero-stable method is convergent, it may have other stability problems that show up if the time step k is chosen too large in an actual computation. Additional stability considerations are the subject of the next chapter.

171 rjlfdm 2007/6/ page 49 Chapter 7 Absolute Stability for Ordinary Differential Equations 7. Unstable computations with a zero-stable method In the last chapter we investigated zero-stability, the form of stability needed to guarantee convergence of a numerical method as the grid is refined (k! 0). In practice, however, we are not able to compute this limit. Instead we typically perform a single calculation with some particular nonzero time step k (or some particular sequence of time steps with a variable step size method). Since the expense of the computation increases as k decreases, we generally want to choose the time step as large as possible consistent with our accuracy requirements. How can we estimate the size of k required? Recall that if the method is stable in an appropriate sense, then we expect the global error to be bounded in terms of the local truncation errors at each step, and so we can often use the local truncation error to estimate the time step needed, as illustrated below. But the form of stability now needed is something stronger than zero-stability. We need to know that the error is well behaved for the particular time step we are now using. It is little help to know that things will converge in the limit for k sufficiently small. The potential difficulties are best illustrated with some examples. Example 7.. Consider the initial value problem (IVP) u 0.t/ D sin t; u.0/ D with solution u.t/ D cos t: Suppose we wish to use Euler s method to solve this problem up to time T D 2. The local truncation error (LTE) is.t/ D 2 ku00.t/ C O.k 2 / (7.) D 2 k cos.t/ C O.k2 /: Since the function f.t/ D sin t is independent of u, it is Lipschitz continuous with Lipschitz constant L D 0, and so the error estimate (6.2) shows that 49

172 rjlfdm 2007/6/ page Chapter 7. Absolute Stability for Ordinary Differential Equations je n jtkk D k max j cos tj Dk: 0tT Suppose we want to compute a solution with jej 0 3. Then we should be able to take k D 0 3 and obtain a suitable solution after T =k D 2000 time steps. Indeed, calculating using k D 0 3 gives a computed value U 2000 D 0:45692 with an error E 2000 D U 2000 cos.2/ D 0: Example 7.2. Now suppose we modify the above equation to u 0.t/ D.u cos t/ sin t; (7.2) where is some constant. If we take the same initial data as before, u.0/ D, then the solution is also the same as before, u.t/ D cos t. As a concrete example, let s take D 0. Now how small do we need to take k to get an error that is 0 3? Since the LTE (7.) depends only on the true solution u.t/, which is unchanged from Example 7., we might hope that we could use the same k as in that example, k D 0 3. Solving the problem using Euler s method with this step size now gives U 2000 D 0:4663 with an error E 2000 D 0: We are again successful. In fact, the error is considerably smaller in this case than in the previous example, for reasons that will become clear later. Example 7.3. Now consider the problem (7.2) with D 200 and the same data as before. Again the solution is unchanged and so is the LTE. But now if we compute with the same step size as before, k D 0 3, we obtain U 2000 D 0: with an error of magnitude The computation behaves in an unstable manner, with an error that grows exponentially in time. Since the method is zero-stable and f.u; t/ is Lipschitz continuous in u (with Lipschitz constant L D 200), we know that the method is convergent, and indeed with sufficiently small time steps we achieve very good results. Table 7. shows the error at time T D 2 when Euler s method is used with various values of k. Clearly something dramatic happens between the values k D 0: and k D 0: For smaller values of k we get very good results, whereas for larger values of k there is no accuracy whatsoever. The equation (7.2) is a linear equation of the form (6.3) and so the analysis of Section 6.3. applies directly to this problem. From (6.7) we see that the global error E n satisfies the recursion relation E nc D. C k/e n k n ; (7.3) where the local error n D.t n / from (7.). The expression (7.3) reveals the source of the exponential growth in the error in each time step the previous error is multiplied by a factor of. C k/. For the case D 200 and k D 0 3, we have C k D : and so we expect the local error introduced in step m to grow by a factor of. :/ n m by the end of n steps (recall (6.8)). After 2000 steps we expect the truncation error introduced in the first step to have grown by a factor of roughly. :/ , which is consistent with the error actually seen. Note that in Example 7.2 with D 0, we have C k D 0:99, causing a decay in the effect of previous errors in each step. This explains why we got a reasonable result in Example 7.2 and in fact a better result than in Example 7., where C k D.

173 rjlfdm 2007/6/ page Absolute stability 5 Table 7.. Errors in the computed solution using Euler s method for Example 7:3, for different values of the time step k. Note the dramatic change in behavior of the error for k < 0: k Error E E E E E-07 Returning to the case D 200, we expect to observe exponential growth in the error for any value of k greater than 2=200 D 0: , since for any k larger than this we have j C kj >. For smaller time steps j C kj < and the effect of each local error decays exponentially with time rather than growing. This explains the dramatic change in the behavior of the error that we see as we cross the value k D 0: in Table 7.. Note that the exponential growth of errors does not contradict zero-stability or convergence of the method in any way. The method does converge as k! 0. In fact the bound (6.2), je n je jjt T kk D O.k/ as k! 0; that we used to prove convergence allows the possibility of exponential growth with time. The bound is valid for all k, but since Te jjt D 2e 4200 D while kk D k, this 2 bound does not guarantee any accuracy whatsoever in the solution until k < This is a good example of the fact that a mathematical convergence proof may be a far cry from what is needed in practice. 7.2 Absolute stability To determine whether a numerical method will produce reasonable results with a given value of k > 0, we need a notion of stability that is different from zero-stability. There are a wide variety of other forms of stability that have been studied in various contexts. The one that is most basic and suggests itself from the above examples is absolute stability. This notion is based on the linear test equation (6.3), although a study of the absolute stability of a method yields information that is typically directly useful in determining an appropriate time step in nonlinear problems as well; see Section We can look at the simplest case of the test problem in which g.t/ D 0 and we have simply u 0.t/ D u.t/: Euler s method applied to this problem gives U nc D. C k/u n

174 rjlfdm 2007/6/ page Chapter 7. Absolute Stability for Ordinary Differential Equations and we say that this method is absolutely stable when jckj; otherwise it is unstable. Note that there are two parameters k and, but only their product z k matters. The method is stable whenever 2 z 0, and we say that the interval of absolute stability for Euler s method is Œ 2; 0. It is more common to speak of the region of absolute stability as a region in the complex z plane, allowing the possibility that is complex (of course the time step k should be real and positive). The region of absolute stability (or simply the stability region) for Euler s method is the disk of radius centered at the point, since within this disk we have j C kj (see Figure 7.a). Allowing to be complex comes from the fact that in practice we are usually solving a system of ordinary differential equations (ODEs). In the linear case it is the eigenvalues of the coefficient matrix that are important in determining stability. In the nonlinear case we typically linearize (see Section 7.4.3) and consider the eigenvalues of the Jacobian matrix. Hence represents a typical eigenvalue and these may be complex even if the matrix is real. For some problems, looking at the eigenvalues is not sufficient (see Section 0.2., for example), but eigenanalysis is generally very revealing. 2 Forward Euler 2 Backward Euler (a) (b) Trapezoidal 2 Midpoint (c) (d) Figure 7.. Stability regions for (a) Euler, (b) backward Euler, (c) trapezoidal, and (d) midpoint (a segment on imaginary axis).

175 rjlfdm 2007/6/ page Stability regions for linear multistep methods Stability regions for linear multistep methods For a general linear multistep method (LMM) of the form (5.44), the region of absolute stability is found by applying the method to u 0 D u, obtaining rx ju ncj D k jd0 rx ˇj U ncj ; jd0 which can be rewritten as rx. j zˇj /U ncj D 0: (7.4) jd0 Note again that it is only the product z D k that is important, not the values of k or separately, and that this is a dimensionless quantity since the decay rate has dimensions time, while the time step has dimensions of time. This makes sense if we change the units of time (say, from seconds to milliseconds), then the parameter will decrease by a factor of 000 and we may be able to increase the numerical value of k by a factor of 000 and still be stable. But then we also have to solve out to time 000T instead of to time T, so we haven t really changed the numerical problem or the number of time steps required. The recurrence (7.4) is a homogeneous linear difference equation of the same form considered in Section The solution has the general form (6.26), where the j are now the roots of the characteristic polynomial P r jd0. j zˇj / j. This polynomial is often called the stability polynomial and denoted by.i z/. It is a polynomial in but its coefficients depend on the value of z. The stability polynomial can be expressed in terms of the characteristic polynomials for the LMM as.i z/ D./ z./: (7.5) The LMM is absolutely stable for a particular value of z if errors introduced in one time step do not grow in future time steps. According to the theory of Section 6.4., this requires that the polynomial.i z/ satisfy the root condition (6.34). Definition 7.. The region of absolute stability for the LMM (5.44) is the set of points z in the complex plane for which the polynomial.i z/ satisfies the root condition (6.34). Note that an LMM is zero-stable if and only if the origin z D 0 lies in the stability region. Example 7.4. For Euler s method,.i z/ D. C z/ with the single root D C z. We have already seen that the stability region is the disk in Figure 7.(a). Example 7.5. For the backward Euler method (5.2),.I z/ D. z/

176 rjlfdm 2007/6/ page Chapter 7. Absolute Stability for Ordinary Differential Equations with root D. z/.wehave j. z/ j () j zj so the stability region is the exterior of the disk of radius centered at z D, as shown in Figure 7.(b). Example 7.6. For the trapezoidal method (5.22),.I z/ D 2 z C 2 z with root D C 2 z 2 z : This is a linear fractional transformation and it can be shown that j j () Re.z/ 0; where Re.z/ is the real part. So the stability region is the left half-plane as shown in Figure 7.(c). Example 7.7. For the midpoint method (5.23),.I z/ D 2 2z : The roots are ;2 D z pz 2 C. It can be shown that if z is a pure imaginary number of the form z D i with j j <, then j jdj 2 jdand 2, and hence the root condition is satisfied. For any other z the root condition is not satisfied. In particular, if z D i, then D 2 is a repeated root of modulus. So the stability region consists only of the open interval from i to i on the imaginary axis, as shown in Figure 7.(d). Since k is always real, this means the midpoint method is useful only on the test problem u 0 D u if is pure imaginary. The method is not very useful for scalar problems where is typically real, but the method is of great interest in some applications with systems of equations. For example, if the matrix is real but skew symmetric (A T D A), then the eigenvalues are pure imaginary. This situation arises naturally in the discretization of hyperbolic partial differential equations (PDEs), as discussed in Chapter 0. Example 7.8. Figures 7.2 and 7.3 show the stability regions for the r-step Adams Bashforth and Adams Moulton methods for various values of r. For an r-step method the polynomial.i z/ has degree r and there are r roots. Determining the values of z for which the root condition is satisfied does not appear simple. However, there is a simple technique called the boundary locus method that makes it possible to determine the regions shown in the figures. This is briefly described in Section Note that for many methods the shape of the stability region near the origin z D 0 is directly related to the accuracy of the method. Recall that the stability polynomial./ for a consistent LMM always has a principal root D. It can be shown that for z near 0 the polynomial.i z/ has a corresponding principal root with behavior.z/ D e z C O.z pc / as z! 0 (7.6)

177 rjlfdm 2007/6/ page Stability regions for linear multistep methods Stability region of Adams Bashforth 2 step method (a) Stability region of Adams Bashforth 4 step method (c) Stability region of Adams Bashforth 3 step method 2 2 (b) Stability region of Adams Bashforth 5 step method (d) Figure 7.2. Stability regions for some Adams Bashforth methods. The shaded region just to the left of the origin is the region of absolute stability. See Section 7.6. for a discussion of the other loops seen in figures (c) and (d). if the method is pth order accurate. We can see this in the examples above for one-step methods, e.g., for Euler s method.z/ D C z D e z C O.z 2 /. It is this root that is giving the appropriate behavior U nc e z U n over a time step. Since this root is on the unit circle at the origin z D 0, and since je z j < only when Re.z/ <0, we expect the principal root to move inside the unit circle for small z with Re.z/ <0 and outside the unit circle for small z with Re.z/ >0. This suggests that if we draw a small circle around the origin, then the left half of this circle will lie inside the stability region (unless some other root moves outside, as happens for the midpoint method), while the right half of the circle will lie outside the stability region. Looking at the stability regions in Figure 7. we see that this is indeed true for all the methods except the midpoint method. Moreover, the higher the order of accuracy in general, the larger a circle around the origin where this will approximately hold, and so the boundary of the stability region tends to align with the imaginary axis farther and farther from the origin as the order of the method increases, as observed in Figures 7.2 and 7.3. (The trapezoidal method is a bit of an anomaly, as its stability region exactly agrees with that of e z for all z.)

178 rjlfdm 2007/6/ page Chapter 7. Absolute Stability for Ordinary Differential Equations 4 Stability region of Adams Moulton 2 step method 4 Stability region of Adams Moulton 3 step method (a) Stability region of Adams Moulton 4 step method 4 (c) (b) Stability region of Adams Moulton 5 step method 4 (d) Figure 7.3. Stability regions for some Adams Moulton methods. See Section 7.6 for a discussion of ways in which stability regions can be determined and plotted. 7.4 Systems of ordinary differential equations So far we have examined stability theory only in the context of a scalar differential equation u 0.t/ D f.u.t// for a scalar function u.t/. In this section we will look at how this stability theory carries over to systems of m differential equations where u.t/ 2 R m.for a linear system u 0 D Au, where A is an m m matrix, the solution can be written as u.t/ D e At u.0/ and the behavior is largely governed by the eigenvalues of A. A necessary condition for stability is that k be in the stability region for each eigenvalue of A. For general nonlinear systems u 0 D f.u/, the theory is more complicated, but a good rule of thumb is that k should be in the stability region for each eigenvalue of the Jacobian matrix f 0.u/. This may not be true if the Jacobian is rapidly changing with time, or even for constant coefficient linear problems in some highly nonnormal cases (see [47] and Section 0.2. for an example), but most of the time eigenanalysis is surprisingly effective.

179 rjlfdm 2007/6/ page Systems of ordinary differential equations 57 Before discussing this theory further we will review the theory of chemical kinetics, a field where the solution of systems of ODEs is very important, and where the eigenvalues of the Jacobian matrix often have a physical interpretation in terms of reaction rates Chemical kinetics Let A and B represent chemical compounds and consider a reaction of the form A K! B: This represents a reaction in which A is transformed into B with rate K > 0. If we let u represent the concentration of A and u 2 represent the concentration of B (often denoted by u D ŒA ; u 2 D ŒB ), then the ODEs for u and u 2 are u 0 D K u ; u 0 2 D K u : If there is also a reverse reaction at rate K 2, we write A K K2 B and the equations then become u 0 D K u C K 2 u 2 ; (7.7) u 0 2 D K u K 2 u 2 : More typically, reactions involve combinations of two or more compounds, e.g., A C B K K2 AB: Since A and B must combine to form AB, the rate of the forward reaction is proportional to the product of the concentrations u and u 2, while the backward reaction is proportional to u 3 D ŒAB. The equations become u 0 D K u u 2 C K 2 u 3 ; u 0 2 D K u u 2 C K 2 u 3 ; (7.8) u 0 3 D K u u 2 K 2 u 3 : Note that this is a nonlinear system of equations, while (7.7) are linear. Often several reactions take place simultaneously, e.g., A C B K K2 AB; 2A C C K 3 K4 A 2 C:

180 rjlfdm 2007/6/ page Chapter 7. Absolute Stability for Ordinary Differential Equations If we now let u 4 D ŒC ; u 5 D ŒA 2 C, then the equations are u 0 D K u u 2 C K 2 u 3 2K 3 u 2 u 4 C 2K 4 u 5 ; u 0 2 D K u u 2 C K 2 u 3 ; u 0 3 D K u u 2 K 2 u 3 ; (7.9) u 0 4 D K 3u 2 u 4 C K 4 u 5 ; u 0 5 D K 3u 2 u 4 K 4 u 5 : Interesting kinetics problems can give rise to very large systems of ODEs. Frequently the rate constants K ; K 2 ; ::: are of vastly different orders of magnitude. This leads to stiff systems of equations, as discussed in Chapter 8. Example 7.9. One particularly simple system arises from the decay process A K! B K 2! C: Let u D ŒA ; u 2 D ŒB ; u 3 D ŒC. Then the system is linear and has the form u 0 D Au, where 2 K A D 4 K K : (7.0) 0 K 2 0 Note that the eigenvalues are K ; K 2, and 0. The general solution thus has the form (assuming K K 2 ) u j.t/ D c j e K t C c j2 e K 2t C c j3 : In fact, on physical grounds (since A decays into B which decays into C ), we expect that u simply decays to 0 exponentially, u.t/ D e K t u.0/ (which clearly satisfies the first ODE), and also that u 2 ultimately decays to 0 (although it may first grow if K is larger than K 2 ), while u 3 grows and asymptotically approaches the value u.0/ C u 2.0/ C u 3.0/ as t!. A typical solution for K D 3 and K 2 D with u.0/ D 3; u 2.0/ D 4, and u 3.0/ D 2 is shown in Figure Linear systems Consider a linear system u 0 D Au, where A is a constant m m matrix, and suppose for simplicity that A is diagonalizable, which means that it has a complete set of m linearly independent eigenvectors r p satisfying Ar p D p r p for p D ; 2; :::; m. Let R D Œr ; r 2 ;:::;r m be the matrix of eigenvectors and ƒ D diag. ; 2 ; :::; m / be the diagonal matrix of eigenvectors. Then we have A D RƒR and ƒ D R AR: Now let v.t/ D R u.t/. Multiplying u 0 D Au by R on both sides and introducing I D RR gives the equivalent equations R u 0.t/ D.R AR/.R u.t//;

181 rjlfdm 2007/6/ page Systems of ordinary differential equations 59 concentrations A B C time Figure 7.4. Sample solution for the kinetics problem in Example 7:9. i.e., v 0.t/ D ƒv.t/: This is a diagonal system of equations that decouples into m independent scalar equations, one for each component of v. The pth such equation is v 0 p.t/ D pv p.t/: A linear multistep method applied to the linear ODE can also be decoupled in the same way. For example, if we apply Euler s method, we have U nc D U n C kau n ; which, by the same transformation, can be rewritten as V nc D V n C kƒv n ; where V n D R U n. This decouples into m independent numerical methods, one for each component of V n. These take the form V nc p D. C k p /V n p : We can recover U n from V n using U n D RV n. For the overall method to be stable, each of the scalar problems must be stable, and this clearly requires that k p be in the stability region of Euler s method for all values of p. The same technique can be used more generally to show that an LMM can be absolutely stable only if k p is in the stability region of the method for each eigenvalue p of the matrix A.

182 rjlfdm 2007/6/ page Chapter 7. Absolute Stability for Ordinary Differential Equations Example 7.0. Consider the linear kinetics problem with A given by (7.0). Since this matrix is upper triangular, the eigenvalues are the diagonal elements D K, 2 D K 2, and 3 D 0. The eigenvalues are all real and we expect Euler s method to be stable provided k max.k ; K 2 / 2. Numerical experiments easily confirm that this is exactly correct: when this condition is satisfied the numerical solution behaves well, and if k is slightly larger there is explosive growth of the error. Example 7.. Consider a linearized model for a swinging pendulum, this time with frictional forces added, 00.t/ D a.t/ b 0.t/; which is valid for small values of. If we introduce u D and u 2 D 0 then we obtain a first order system u 0 D Au with 0 A D : (7.) a b The eigenvalues of this matrix are D 2 b pb 2 4a. Note in particular that if b D 0 (no damping), then D p a are pure imaginary. For b > 0 the eigenvalues shift into the left half-plane. In the undamped case the midpoint method would be a reasonable choice, whereas Euler s method might be expected to have difficulties. In the damped case the opposite is true Nonlinear systems Now consider a nonlinear system u 0 D f.u/. The stability analysis we have developed for the linear problem does not apply directly to this system. However, if the solution is slowly varying relative to the time step, then over a small time interval we would expect a linearized approximation to give a good indication of what is happening. Suppose the solution is near some value Nu, and let v.t/ D u.t/ Nu. Then v 0.t/ D u 0.t/ D f.u.t// D f.v.t/ CNu/: Taylor series expansion about Nu (assuming v is small) gives v 0.t/ D f.nu/ C f 0. Nu/v.t/ C O.kvk 2 /: Dropping the O.kvk 2 / terms gives a linear system v 0.t/ D Av.t/ C b; where A D f 0. Nu/ is the Jacobian matrix evaluated at Nu and b D f.nu/. Examining how the numerical method behaves on this linear system (for each relevant value of Nu) gives a good indication of how it will behave on the nonlinear system. Example 7.2. Consider the kinetics problem (7.8). The Jacobian matrix is 2 A D 4 K u 2 K u K 2 K u 2 K u K 2 K u 2 K u K 2 3 5

183 rjlfdm 2007/6/ page Practical choice of step size 6 with eigenvalues D K.u C u 2 / K 2 and 2 D 3 D 0. Since u C u 2 is simply the total quantity of species A and B present, this can be bounded for all time in terms of the initial data. (For example, we certainly have u.t/ C u 2.t/ u.0/ C u 2.0/ C 2u 3.0/.) So we can determine the possible range of along the negative real axis and hence how small k must be chosen so that k stays within the region of absolute stability. 7.5 Practical choice of step size As the examples at the beginning of this chapter illustrated, obtaining computed results that are within some error tolerance requires two conditions:. The time step k must be small enough that the local truncation error is acceptably small. This gives a constraint of the form k k acc, where k acc depends on several things: What method is being used, which determines the expansion for the local truncation error; How smooth the solution is, which determines how large the high order derivatives occurring in this expansion are; and What accuracy is required. 2. The time step k must be small enough that the method is absolutely stable on this particular problem. This gives a constraint of the form k k stab that depends on the magnitude and location of the eigenvalues of the Jacobian matrix f 0.u/. Typically we would like to choose our time step based on accuracy considerations, so we hope k stab > k acc. For a given method and problem, we would like to choose k so that the local error in each step is sufficiently small that the accumulated error will satisfy our error tolerance, assuming some reasonable growth of errors. If the errors grow exponentially with time because the method is not absolutely stable, however, then we would have to use a smaller time step to get useful results. If stability considerations force us to use a much smaller time step than the local truncation error indicates should be needed, then this particular method is probably not optimal for this problem. This happens, for example, if we try to use an explicit method on a stiff problem as discussed in Chapter 8, for which special methods have been developed. As already noted in Chapter 5, most software for solving initial value problems does a very good job of choosing time steps dynamically as the computation proceeds, based on the observed behavior of the solution and estimates of the local error. If a time step is chosen for which the method is unstable, then the local error estimate will typically indicate a large error and the step size will be automatically reduced. Details of the shape of the stability region and estimates of the eigenvalues are typically not used in the course of a computation to choose time steps. However, the considerations of this chapter play a big role in determining whether a given method or class of methods is suitable for a particular problem. We will also see in Chapters 9 and 0 that a knowledge of the stability regions of ODE methods is necessary in order to develop effective methods for solving time-dependent PDEs.

184 rjlfdm 2007/6/ page Chapter 7. Absolute Stability for Ordinary Differential Equations 7.6 Plotting stability regions 7.6. The boundary locus method for linear multistep methods A point z 2 C is in the stability region S of an LMM if the stability polynomial.i z/ satisfies the root condition for this value of z. It follows that if z is on the boundary of the stability region, then.i z/ must have at least one root j with magnitude exactly equal to. This j is of the form j D e i for some value of in the interval Œ0; 2. (Beware of the two different uses of.) Since j is a root of.i z/, we have.e i I z/ D 0 for this particular combination of z and. Recalling the definition of, this gives.e i / z.e i / D 0 (7.2) and hence z D.ei /.e i / : If we know, then we can find z from this. Since every point z on the boundary of S must be of this form for some value of in Œ0; 2, we can simply plot the parametrized curve Qz./.ei /.e i / (7.3) for 0 2 to find the locus of all points which are potentially on the boundary of S. For simple methods this yields the region S directly. Example 7.3. For Euler s method we have./ D and./ D, and so Qz./ D e i : This function maps Œ0; 2 to the unit circle centered at z D, which is exactly the boundary of S as shown in Figure 7.(a). To determine which side of this curve is the interior of S, we need only evaluate the roots of.i z/ at some random point z on one side or the other and see if the polynomial satisfies the root condition. Alternatively, as noted on page 55, most methods are stable just to the left of the origin on the negative real axis and unstable just to the right of the origin on the positive real axis. This is often enough information to determine where the stability region lies relative to the boundary locus. For some methods the boundary locus may cross itself. In this case we typically find that at most one of the regions cut out of the plane corresponds to the stability region. We can determine which region is S by evaluating the roots at some convenient point z within each region. Example 7.4. The five-step Adams Bashforth method gives the boundary locus seen in Figure 7.2(d). The stability region is the small semicircular region to the left of the

185 rjlfdm 2007/6/ page Plotting stability regions 63 origin where all roots are inside the unit circle. As we cross the boundary of this region one root moves outside. As we cross the boundary locus again into one of the loops in the right half-plane another root moves outside and the method is still unstable in these regions (two roots are outside the unit circle) Plotting stability regions of one-step methods If we apply a one-step method to the test problem u 0 D u, we typically obtain an expression of the form U nc D R.z/U n ; (7.4) where R.z/ is some function of z D k (typically a polynomial for an explicit method or a rational function for an implicit method). If the method is consistent, then R.z/ will be an approximation to e z near z D 0, and if it is pth order accurate, then R.z/ e z D O.z pc / as z! 0: (7.5) Example 7.5. The pth order Taylor series method, when applied to u 0 D u, gives (since the j th derivative of u is u.j/ D j u) U nc D U n C ku n C 2 k2 2 U n CC p! kp p U n D C z C 2 z2 CC p! zp U n : (7.6) In this case R.z/ is the polynomial obtained from the first p C terms of the Taylor series for e z. Example 7.6. If the fourth order Runge Kutta method (5.33) is applied to u 0 D u, we find that R.z/ D C z C 2 z2 C 6 z3 C 24 z4 ; (7.7) which agrees with R.z/ for the fourth order Taylor series method. Example 7.7. For the trapezoidal method (5.22), R.z/ D C 2 z 2 z (7.8) is a rational approximation to e z with error O.z 3 / (the method is second order accurate). Note that this is also the root of the linear stability polynomial that we found by viewing this as an LMM in Example 7.6. Example 7.8. The TR-BDF2 method (5.37) has R.z/ D C 5 2 z 7 2 z C : (7.9) z2 2 This agrees with e z to O.z 3 / near z D 0.

186 rjlfdm 2007/6/ page Chapter 7. Absolute Stability for Ordinary Differential Equations From the definition of absolute stability given at the beginning of this chapter, we see that the region of absolute stability for a one-step method is simply S Dfz 2 C WjR.z/j g: (7.20) This follows from the fact that iterating a one-step method on u 0 D u gives ju n j D jr.z/j n ju 0 j and this will be uniformly bounded in n if z lies in S. One way to attempt to compute S would be to compute the boundary locus as described in Section 7.6. by setting R.z/ D e i and solving for z as varies. This would give the set of z for which jr.z/j D, the boundary of S. There s a problem with this, however: when R.z/ is a higher order polynomial or rational function there will be several solutions z for each and it is not clear how to connect these to generate the proper curve. Another approach can be taken graphically that is more brute force, but effective. If we have a reasonable idea of what region of the complex z-plane contains the boundary of S, we can sample jr.z/j on a fine grid of points in this region and approximate the level set where this function has the value and plot this as the boundary of S. This is easily done with a contour plotter, for example, using the ك command in MATLAB. Or we can simply color each point depending on whether it is inside S or outside. For example, Figure 7.5 shows the stability regions for the Taylor series methods of orders 2 and 4, for which R.z/ D C z C 2 z2 ; R.z/ D C z C 2 z2 C 6 z3 C 24 z4 ; (7.2) respectively. These are also the stability regions of the second order Runge Kutta method (5.30) and the fourth order accurate Runge Kutta method (5.33), which are easily seen to have the same stability functions. Note that for a one-step method of order p, the rational function R.z/ must agree with e z to O.z pc /. As for LMMs, we thus expect that points very close to the origin will lie in the stability region S for Re.z/ <0 and outside of S for Re.z/ > Relative stability regions and order stars Recall that for a one-step method the stability region S (more properly called the region of absolute stability) is the region S Dfz 2 C W jr.z/j g, where U nc D R.z/U n is the relation between U n and U nc when the method is applied to the test problem u 0 D u. For z D k in the stability region the numerical solution does not grow, and hence the method is absolutely stable in the sense that past errors will not grow in later time steps. On the other hand, the true solution to this problem, u.t/ D e t u.0/, is itself exponentially growing or decaying. One might argue that if u.t/ is itself decaying, then it isn t good enough to simply have the past errors decaying, too they should be decaying at a faster rate. Or conversely, if the true solution is growing exponentially, then perhaps it is fine for the error also to be growing, as long as it is not growing faster. This suggests defining the region of relative stability as the set of z 2 C for which jr.z/j je z j. In fact this idea has not proved to be particularly useful in terms of judging

187 rjlfdm 2007/6/ page Relative stability regions and order stars 65 Figure 7.5. Stability regions for the Taylor series methods of order 2 (left) and 4 (right). the practical stability of a method for finite-size time steps; absolute stability is the more useful concept in this regard. Relative stability regions also proved hard to plot in the days before good computer graphics, and so they were not studied extensively. However, a pivotal 978 paper by Wanner, Hairer, and Nørsett [99] showed that these regions are very useful in proving certain types of theorems about the relation between stability and the attainable order of accuracy for broad classes of methods. Rather than speaking in terms of regions of relative stability, the modern terminology concerns the order star of a rational function R.z/, which is the set of three regions.a ; A 0 ; A C /: A Dfz 2 C W jr.z/j < je z jgdfz 2 C W je z R.z/j < g; A 0 Dfz 2 C W jr.z/j Dje z jg D fz 2 C W je z R.z/j Dg; A C Dfz 2 C W jr.z/j > je z jgdfz 2 C W je z R.z/j > g: (7.22) These sets turn out to be much more strange looking than regions of absolute stability. As their name implies, they have a star-like quality, as seen, for example, in Figure 7.6, which shows the order stars for the same two Taylor polynomials (7.2), and Figure 7.7, which shows the order stars for two implicit methods. In each case the shaded region is A C, while the white region is A and the boundary between them is A 0. Their behavior near the origin is directly tied to the order of accuracy of the method, i.e., the degree to which R.z/ matches e z at the origin. If R.z/ D e z C Cz pc C higher order terms, then since e z near the origin, e z R.z/ C Cz pc : (7.23) As z traces out a small circle around the origin (say, z D ıe 2i for some small ı), the function z pc D ı pc e 2.pC/i goes around a smaller circle about the origin p C times and hence crosses the imaginary axis 2.p C / times. Each of these crossings corresponds to z moving across A 0. So in a disk very close to the origin the order star must consist of p C wedgelike sectors of A C separated by p C sectors of A. This is apparent in Figures 7.6 and 7.7.

rjlfdm 2007/6/ page 66 66 Chapter 7. Absolute Stability for Ordinary Differential Equations (a) (b) Figure 7.6. Order stars for the Taylor series methods of order (a) 2 and (b) 4.

9), and (b) the fifth-order accurate Radau5 method [44], for which R.

It can also be shown that each bounded finger of A contains at least one root of the rational function R.

) Moreover, certain stability properties of the method can be related to the geometry of the order star, facilitating the proof of

188 rjlfdm 2007/6/ page Chapter 7. Absolute Stability for Ordinary Differential Equations (a) (b) Figure 7.6. Order stars for the Taylor series methods of order (a) 2 and (b) 4. (a) (b) Figure 7.7. Order stars for two A-stable implicit methods, (a) the TR-BDF2 method (5.37) with R.z/ given by (7.9), and (b) the fifth-order accurate Radau5 method [44], for which R.z/ is a rational function with degree 2 in the numerator and 3 in the denominator. It can also be shown that each bounded finger of A contains at least one root of the rational function R.z/ and each bounded finger of A C contains at least one pole. (There are no poles for an explicit method; see Figure 7.6.) Moreover, certain stability properties of the method can be related to the geometry of the order star, facilitating the proof of some barrier theorems on the possible accuracy that might be obtained. This is just a hint of the sort of question that can be tackled with order stars. For a better introduction to their power and beauty, see, for example, [44], [5], [98].

189 rjlfdm 2007/6/ page 67 Chapter 8 Stiff Ordinary Differential Equations The problem of stiffness leads to computational difficulty in many practical problems. The classic example is the case of a stiff ordinary differential equation (ODE), which we will examine in this chapter. In general a problem is called stiff if, roughly speaking, we are attempting to compute a particular solution that is smooth and slowly varying (relative to the time interval of the computation), but in a context where the nearby solution curves are much more rapidly varying. In other words, if we perturb the solution slightly at any time, the resulting solution curve through the perturbed data has rapid variation. Typically this takes the form of a short-lived transient response that moves the solution back toward a smooth solution. Example 8.. Consider the ODE (7.2) from the previous chapter, u 0.t/ D.u cos t/ sin t: (8.) One particular solution is the function u.t/ D cos t, and this is the solution with the initial data u.0/ D considered previously. This smooth function is a solution for any value of. If we consider initial data of the form u.t 0 / D that does not lie on this curve, then the solution through this point is a different function, of course. However, if <0 (or Re./ < 0 more generally), this function approaches cos t exponentially quickly, with decay rate. It is easy to verify that the solution is u.t/ D e.t t 0/. cos.t 0 // C cos t: (8.2) Figure 8. shows a number of different solution curves for this equation with different choices of t 0 and, with the fairly modest value D. Figure 8.b shows the corresponding solution curves when D 0. In this scalar example, when we perturb the solution at some point it quickly relaxes toward the particular solution u.t/ D cos t. In other stiff problems the solution might move quickly toward some different smooth solution, as seen in the next example. Example 8.2. Consider the kinetics model A! B! C developed in Example 7.9. The system of equations is given by (7.0). Suppose that K K 2 so that a typical solution appears as in Figure 8.2(a). (Here K D 20 and K 2 D. Compare this to 67

190 rjlfdm 2007/6/ page Chapter 8. Stiff Ordinary Differential Equations u (a) time u (b) time Figure 8.. Solution curves for the ODE (8.) for various initial values. (a) With D. (b) With D 0 and the same set of initial values. concentrations A B C (a) time Figure 8.2. Solution curves for the kinetics problem in Example 7:9 with K D 20 and K 2 D. In(b) a perturbation has been made by adding one unit of species A at time t D. Figure 8.3 shows similar solutions for the case K D 0 6. concentrations A B C (b) time Figure 7.4.) Now suppose at time t D we perturb the system by adding more of species A. Then the solution behaves as shown in Figure 8.2(b). The additional A introduced is rapidly converted into B (the fast transient response) and then slowly from B into C. After the rapid transient the solution is again smooth, although it differs from the original solution since the final asymptotic value of C must be higher than before by the same magnitude as the amount of A introduced. 8. Numerical difficulties Stiffness causes numerical difficulties because any finite difference method is constantly introducing errors. The local truncation error acts as a perturbation to the system that moves us away from the smooth solution we are trying to compute. Why does this cause more difficulty in a stiff system than in other systems? At first glance it seems like the stiffness might work to our advantage. If we are trying to compute the solution u.t/ D cos t to the ODE (8.) with initial data u.0/ D, for example, then the fact that any errors introduced decay exponentially should help us. The true solution is very robust and the solution is almost completely insensitive to errors made in the past. In fact, this stability of the true

191 rjlfdm 2007/6/ page Characterizations of stiffness 69 solution does help us, as long as the numerical method is also stable. (Recall that the results in Example 7.2 were much better than those in Example 7..) The difficulty arises from the fact that many numerical methods, including all explicit methods, are unstable (in the sense of absolute stability) unless the time step is small relative to the time scale of the rapid transient, which in a stiff problem is much smaller than the time scale of the smooth solution we are trying to compute. In the terminology of Section 7.5, this means that k stab k acc. Although the true solution is smooth and it seems that a reasonably large time step would be appropriate, the numerical method must always deal with the rapid transients introduced by truncation error in every time step and may need a very small time step to do so stably. 8.2 Characterizations of stiffness A stiff ODE can be characterized by the property that f 0.u/ is much larger (in absolute value or norm) than u 0.t/. The latter quantity measures the smoothness of the solution being computed, while f 0.u/ measures how rapidly f varies as we move away from this particular solution. Note that stiff problems typically have large Lipschitz constants too. For systems of ODEs, stiffness is sometimes defined in terms of the stiffness ratio of the system, which is the ratio max j p j (8.3) min j p j over all eigenvalues of the Jacobian matrix f 0.u/. If this is large, then a large range of time scales is present in the problem, a necessary component for stiffness to arise. While this is often a useful quantity, one should not rely entirely on this measure to determine whether a problem is stiff. For one thing, it is possible even for a scalar problem to be stiff (as we have seen in Example 8.), although for a scalar problem the stiffness ratio is always since there is only one eigenvalue. Still, more than one time scale can be present. In (8.) the fast time scale is determined by, the eigenvalue, and the slow time scale is determined by the inhomogeneous term sin.t/. For systems of equations there also may be additional time scales arising from inhomogeneous forcing terms or other time-dependent coefficients that are distinct from the scales imposed by the eigenvalues. We also know that for highly nonnormal matrices the eigenvalues don t always tell the full story (see Section D.4). Often they give adequate guidance, but there are examples where the problem is more stiff than (8.3) would suggest. An example arises when certain spectral approximations to spatial derivatives are used in discretizing hyperbolic equations; see Section 0.3. On the other hand, it is also important to note that a system of ODEs which has a large stiffness ratio is not necessarily stiff! If the eigenvalue with large amplitude lies close to the imaginary axis, then it leads to highly oscillatory behavior in the solution rather than rapid damping. If the solution is rapidly oscillating, then it will probably be necessary to take small time steps for accuracy reasons and k acc may be roughly the same magnitude as k stab even for explicit methods, at least with the sort of methods discussed here. (Special methods for highly oscillatory problems have been developed that allow one to take larger time steps; see, e.g., [50], [74].)

192 rjlfdm 2007/6/ page Chapter 8. Stiff Ordinary Differential Equations concentrations (a) time concentrations (b) time Figure 8.3. Solution curves for the kinetics problem in Example 7:9 with K D 0 6 and K 2 D. In(b) a perturbation has been made by adding one unit of species A at time t D. Finally, note that a particular problem may be stiff over some time intervals and nonstiff elsewhere. In particular, if we are computing a solution that has a rapid transient, such as the kinetics problem shown in Figure 8.3(a), then the problem is not stiff over the initial transient period where the true solution is as rapidly varying as nearby solution curves. Only for times greater than 0 6 or so does the problem become stiff, once the desired solution curve is much smoother than nearby curves. For the problem shown in Figure 8.3(b), there is another time interval just after t D over which the problem is again not stiff since the solution again exhibits rapid transient behavior and a small time step would be needed on the basis of accuracy considerations. 8.3 Numerical methods for stiff problems Over time intervals where a problem is stiff, we would like to use a numerical method that has a large region of absolute stability, extending far into the left half-plane. The problem with a method like Euler s, with a stability region that extends only out to Re./ D 2, is that the time step k is severely limited by the eigenvalue with largest magnitude, and we need to take k 2=j max j. Over time intervals where this fastest time scale does not appear in the solution, we would like to be able to take much larger time steps. For example, in the problems shown in Figure 8.3, where K D 0 6, we would need to take k 20 6 with Euler s method, requiring 4 million time steps to compute over the time interval shown in the figure, although the solution is very smooth over most of this time. An analysis of stability regions shows that there are basically two different classes of methods: those for which the stability region is bounded and extends a distance O./ from the origin, such as Euler s method, the midpoint method, or any of the Adams methods (see Figures 7. and 8.5), and those for which the stability region is unbounded, such as backward Euler or trapezoidal. Clearly the first class of methods is inappropriate for stiff problems. Unfortunately, all explicit methods have bounded stability regions and hence are generally quite inefficient on stiff problems. An exception is methods such as the Runge Kutta Chebyshev methods described in Section 8.6 that may work well for mildly stiff problems. Some implicit methods also have bounded stability regions, such as the Adams Moulton methods, and are not useful for stiff problems.

193 rjlfdm 2007/6/ page Numerical methods for stiff problems A-stability and A( )-stability It seems like it would be optimal to have a method whose stability region contains the entire left half-plane. Then any time step would be allowed, provided that all the eigenvalues have negative real parts, as is often the case in practice. The backward Euler and trapezoidal methods have this property, for example. We give methods with this property a special name, as follows. Definition 8.. An ODE method is said to be A-stable if its region of absolute stability S contains the entire left half-plane fz 2 C W Re.z/ 0g. For LMMs it turns out that this is quite restrictive. A theorem of Dahlquist [2] (the paper in which the term A-stability was introduced) states that any A-stable LMM is at most second order accurate, and in fact the trapezoidal method is the A-stable method with smallest truncation error. This is Dahlquist s second barrier theorem, proved, for example, in [44] and by using order stars in [5]. Higher order A-stable implicit Runge Kutta methods do exist, including diagonally implicit (DIRK) methods; see, for example, [3], [44]. For many stiff problems the eigenvalues are far out in the left half-plane but near (or even exactly on) the real axis. For such problems there is no reason to require that the entire left half-plane lie in the region of absolute stability. If arg.z/ represents the argument of z with arg.z/ D on the negative real axis, and if the wedge arg.z/ C is contained in the stability region, then we say the method is A( )-stable. An A-stable method is A(=2)-stable. A method is A(0)-stable if the negative real axis itself lies in the stability region. Note that in general it makes sense to require a wedge to lie in the stability regions, since adjusting the time step k causes z D k to move in toward the origin on a ray through each eigenvalue L-stability Notice a major difference between the stability regions for trapezoidal and backward Euler: the trapezoidal method is stable only in the left half-plane, whereas backward Euler is also stable over much of the right half-plane. The point at infinity (if we view these stability regions on the Riemann sphere) lies on the boundary of the stability region for the trapezoidal method but in the interior of the stability region for backward Euler. These are both one-step methods and so on the test problem u 0 D u we have U nc D R.z/U n, where R.z/ D. z/ and jr.z/j!0 as jzj! for backward Euler, while R.z/ D C 2 z 2 z and jr.z/j! as jzj! for the trapezoidal method.

194 rjlfdm 2007/6/ page Chapter 8. Stiff Ordinary Differential Equations This difference can have a significant effect on the quality of solutions in some situations, in particular if there are rapid transients in the solution that we are not interested in resolving accurately with very small time steps. For these transients we want more than just stability we want them to be effectively damped in a single time step since we are planning to use a time step that is much larger than the true decay time of the transient. For this purpose a method like backward Euler will perform better than the trapezoidal method. The backward Euler method is said to be L-stable. Definition 8.2. A one-step method is L-stable if it is A-stable and lim z! jr.z/j D0, where the stability function R.z/ is defined in Section The value of L-stability is best illustrated with an example. Example 8.3. Consider the problem (8.) with D 0 6. We will see how the trapezoidal and backward Euler methods behave in two different situations. Case : Take data u.0/ D, so that u.t/ D cos.t/ and there is no initial transient. Then both trapezoidal and backward Euler behave reasonably and the trapezoidal method gives smaller errors since it is second order accurate. Table 8. shows the errors at T D 3 with various values of k. Case 2: Now take data u.0/ D :5 so there is an initial rapid transient toward u D cos.t/ on a time scale of about 0 6. Both methods are still absolutely stable, but the results in Table 8. show that backward Euler works much better in this case. To understand what is happening, see Figure 8.4, which shows the true and computed solutions with each method if we use k D 0:. The trapezoidal method is stable and the results stay bounded, but since k D 0 5 we have 2 k D :99996 and the C 2 k initial deviation from the smooth curve cos.t/ is essentially negated in each time step. Backward Euler, on the other hand, damps the deviation very effectively in the first time step, since. C k/ 0 6. This is the proper behavior since the true rapid transient decays in a period much shorter than a single time step. If we are solving a stiff equation with initial data for which the solution is smooth from the beginning (no rapid transients), or if we plan to compute rapid transients accurately by taking suitably small time steps in these regions, then it may be fine to use a method such as the trapezoidal method that is not L-stable. Table 8.. Errors at time T D 3 for Example 8:3. k Backward Euler Trapezoidal e e 02 Case e e e e e e 0 Case e e e e 0

195 rjlfdm 2007/6/ page BDF methods 73 Trapezoidal.5o o o o o o o o 0.5 o o o o o o o o 0 o o o o o o o -0.5 o o o - o o o (a) -.5 o o (b) - Backward Euler.5o o o o o o o o o o o 0.5 o o o o o 0 o o o o o -0.5 o o o o o o o o o o Figure 8.4. Comparison of (a) trapezoidal method and (b) backward Euler on a stiff problem with an initial transient (Case 2 of Example 8:3). 8.4 BDF methods One class of very effective methods for stiff problems consists of the backward differentiation formula (BDF) methods. These were introduced by Curtiss and Hirschfelder [20]. See e.g., [33], [44], or [59] for more about these methods. These methods result from taking./ D ˇr r, which has all its roots at the origin, so that the point at infinity is in the interior of the stability region. The method thus has the form 0U n C U nc CC ru ncr D kˇrf.u ncr / (8.4) with ˇ0 D ˇ DDˇr D 0. Since f.u/ D u 0, this form of method can be derived by approximating u 0.t ncr / by a backward difference approximation based on u.t ncr / and r additional points going backward in time. It is possible to derive an r-step method that is r th order accurate. The one-step BDF method is simply the backward Euler method, U nc D U n C kf.u nc /, which is first order accurate. The other useful BDF methods are below: r D 2 W 3U nc2 4U nc C U n D 2kf.U nc2 / r D 3 W U nc3 8U nc2 C 9U nc 2U n D 6kf.U nc3 / r D 4 W 25U nc4 48U nc3 C 36U nc2 6U nc C 3U n D 2kf.U nc4 / r D 5 W 37U nc5 300U nc4 C 300U nc3 200U nc2 C 75U nc 2U n D 60kf.U nc5 / r D 6 W 47U nc6 360U nc5 C 450U nc4 400U nc3 C 225U nc2 72U nc C 0U n D 60kf.U nc6 / These methods have the proper behavior on eigenvalues for which Re./ is very negative, but of course we also have other eigenvalues for which z D k is closer to the origin, corresponding to the active time scales in the problem. So deciding its suitability for a particular problem requires looking at the full stability region of a method. This is shown in Figure 8.5 for each of the BDF methods.

196 rjlfdm 2007/6/ page Chapter 8. Stiff Ordinary Differential Equations 20 Stability region of step BDF method 20 Stability region of 2 step BDF method Stability region of 3 step BDF method 20 Stability region of 4 step BDF method Stability region of 5 step BDF method 20 Stability region of 6 step BDF method Figure 8.5. Stability regions for the BDF methods. The stability region is the shaded region exterior to the curves.

197 rjlfdm 2007/6/ page Runge Kutta Chebyshev explicit methods 75 In particular, we need to make sure that the method is zero-stable. Otherwise it would not be convergent. This is not guaranteed from our derivation of the methods, since zerostability depends only on the polynomial./, whose coefficients j are determined by considering the local truncation error and not stability considerations. It turns out that the BDF methods are zero-stable only for r 6. Higher order BDF methods cannot be used in practice. For r > 2 the methods are not A-stable but are A( )-stable for the following values of : r D W D 90 ı ; r D 4 W D 73 ı r D 2 W D 90 ı ; r D 5 W D 5 ı ; (8.5) r D 3 W D 88 ı ; r D 6 W D 8 ı : 8.5 The TR-BDF2 method There are often situations in which it is useful to have a one-step method that is L-stable. The backward Euler method is one possibility, but it is only first order accurate. It is possible to derive higher order implicit Runge Kutta methods that are L-stable. As one example, we mention the two-stage second order accurate diagonally implicit method U D U n C k 4.f.U n / C f.u //; U nc D 3.4U U n C kf.u nc //: (8.6) Each stage is implicit. The first stage is simply the trapezoidal method (or trapezoidal rule, hence TR in the name) applied over time k=2. This generates a value U at t nc=2. Then the two-step BDF method is applied to the data U n and U with time step k=2 to obtain U nc. This method is written in a different form in (5.37). 8.6 Runge Kutta Chebyshev explicit methods While conventional wisdom says that implicit methods should be used for stiff problems, there s an important class of mildly stiff problems for which special explicit methods have been developed with some advantages. In this section we take a brief look at the Runge Kutta Chebyshev methods that are applicable to problems where the eigenvalues of the Jacobian matrix are on the negative real axis and not too widely distributed. This type of problem arises, for example, when a parabolic heat equation or diffusion equation is discretized in space, giving rise to a large system of ordinary differential equations in time. This problem is considered in detail in Chapter 9. The idea is to develop an explicit method whose stability region stretches out along the negative real axis as far as possible, without much concern with what s happening away from the real axis. We want the region of absolute stability S to contain a stability interval Œ ˇ; 0 that is as long as possible. To do this we will consider multistage explicit Runge Kutta methods as introduced in Section 5.7, but now as we increase the number of stages we will choose the coefficients to make ˇ as large as possible rather than to increase the order of accuracy of the method.

198 rjlfdm 2007/6/ page Chapter 8. Stiff Ordinary Differential Equations We will consider only first order methods, because they are easiest to study. Second order methods are commonly used in practice, and this is often quite sufficient for the applications we have in mind, such as integrating method of lines (MOL) discretizations of the heat equation. The system of ODEs is obtained by discretizing the original partial differential equations PDE in space, and often this discretization is only second order accurate spatially. Second order methods similar to those described here can be found in [80], [97], [2] and higher order methods in [], [69], for example. Consider an explicit r-stage Runge Kutta method of the form (5.34) with a ij D 0 for i j. If we apply this method to u 0 D u we obtain U nc D R.z/U n ; where z D k and R.z/ is a polynomial of degree r in z, the stability polynomial. We will write this as R.z/ D d 0 C d z C d 2 z 2 CCd r z r : (8.7) Consistency and first order accuracy require that The region of absolute stability is d 0 D ; d D : (8.8) S Dfz 2 C W jr.z/j g: (8.9) The coefficients d j depend on the coefficients A and b defining the Runge Kutta method, but for the moment suppose that for any choice of d j we can find suitable coefficients A and b that define an explicit r-stage method with stability polynomial (8.7). Then our goal is to choose the d j coefficients so that S contains an interval Œ ˇ; 0 of the negative real axis with ˇ as large as possible. We also require (8.8) for a first order accurate method. Note that the condition (8.8) amounts to requiring R.0/ D and R 0.0/ D : (8.0) Then we want to choose R.z/ as a polynomial of degree r that satisfies jr.z/j for as long as possible as we go out on the negative real axis. The solution is simply a scaled and shifted Chebyshev polynomial. Figure 8.6 shows the optimal polynomials of degrees 3 and 6. Note that these are plots of R.x/ for x real. We have seen several other examples where Chebyshev polynomials solve problems of this sort, and as usual the optimal solution equioscillates at a set of points, in this case r C 2 points (including x D ˇ and x D 0), where jr.x/j D. If we try to perturb the polynomial so that jr. ˇ/j < (and hence the method will be stable further out on the negative real axis), one of the other extrema of the polynomial will move above, losing stability for some z closer to the origin. The optimal polynomial is R.z/ D T r C z=r 2 ; (8.) where T r.z/ is the Chebyshev polynomial of degree r, as in Section B.3.2. For this choice R. 2r 2 / D T r. / D, while jr.x/j > for x < 2r 2, and so ˇ.r/ D 2r 2 : (8.2)

199 rjlfdm 2007/6/ page Runge Kutta Chebyshev explicit methods (a) (b) Figure 8.6. Shifted Chebyshev polynomials corresponding to R.x/ along the real axis for the first order Runge Kutta Chebyshev methods. (a) r D 3. (b) r D 6. We have optimized along the negative real axis, but it is interesting to see what the stability region S from (8.9) looks like in the complex plane. Figure 8.7 shows the stability region for the two polynomials shown in Figure 8.6 of degrees 3 and 6. They do include the negative real axis out to ˇ.r/, but notice that everywhere jr.x/j D on the real axis is a point where the region S contracts to the axis. Perturbing such an x away from the axis into the complex plane gives a point z where jr.z/j >. So these methods are suitable only for problems with real eigenvalues, such as MOL discretizations of the heat equation. Even for these problems it is generally preferable to perturb the polynomial R.z/ a bit so that jr.z/j is bounded slightly below on the real axis between ˇ and the origin, which can be done at the expense of decreasing ˇ slightly. This is done by using the shifted Chebyshev polynomial R.z/ D T r.w 0 C w z/ ; where w D T r.w 0 / T r.w 0 / Tr 0.w 0/ : Here w 0 > is the damping parameter and w is chosen so that (8.8) holds and hence the method remains first order accurate. Now R.x/ alternates between.t r.w 0 // < on the interval Œ ˇ; 0, where ˇ is now found by solving w 0 ˇw D,so ˇ D.w 0 C /Tr 0.w 0/ : T r.w 0 / Verwer [97] suggests taking w 0 D C =r 2 with D 0:05, for which 4 ˇ 2 3 r 2 :93r 2 : This gives about 5% damping with little reduction in ˇ. Figure 8.8 shows the stability regions for the damped first order methods, again for r D 3 and r D 6. Once we have determined the desired stability polynomial R.z/, we must still develop an r-stage Runge Kutta method with this stability polynomial. In general there are infinitely many such Runge Kutta methods. Here we discuss one approach that has several desirable properties, the Runge Kutta Chebyshev method introduced by van der Houwen

200 rjlfdm 2007/6/ page Chapter 8. Stiff Ordinary Differential Equations (a) (b) Figure 8.7. Absolute stability regions for the first order Runge Kutta Chebyshev methods. (a) r D 3. (b) r D (a) (b) Figure 8.8. Absolute stability regions for the first order Runge Kutta Chebyshev methods with damping. (a) r D 3. (b) r D 6. and Sommeijer [94]. See the review paper of Verwer [97] for a derivation of this method and a discussion and comparison of some other approaches, and see [80] for a discussion of software that implements these methods. The Runge Kutta Chebyshev methods are based on the recurrence relation of Chebyshev polynomials and have the general form

201 rjlfdm 2007/6/ page Runge Kutta Chebyshev explicit methods 79 Y 0 D U n ; Y D Y 0 CQ kf.y 0 ; t n /; Y j D. j j /Y 0 C j Y j (8.3) C j Y j 2 CQ j kf.y j ; t n C c j k/ CQ kf.y 0 ; t n /; j D 2; :::; r; U nc D Y r : The various parameters in these formulas are given in the references above. This is an r-stage explicit Runge Kutta method, although written in a different form than (5.34). An advantage of this recursive form is that only three intermediate solution vectors Y 0 ; Y j, and Y j 2 are needed to compute the next Y j, no matter how many stages r are used. This is important since large values of r (e.g., r D 50) are often used in practice, and for PDE applications each Y j is a grid function over the entire (possibly multidimensional) spatial domain. Another advantage of the Runge Kutta Chebyshev methods is that they have good internal stability properties, as discussed in the references above. We know the stability region of the overall stability polynomial R.z/, but some r-stage methods for large r suffer from exponential growth of the solution from one stage to the next in the early stages before ultimately decaying even when jr.z/j < (analogous to the transient growth sometimes seen when iterating with a nonnormal matrix even if it is asymptotically stable; see Section D.4). This growth can cause numerical difficulties if methods are not carefully designed.

202 Chapter 6 Two-point Boundary Value Problems for Ordinary Differential Equations A two-point boundary value problem (BVP) is a second-order differential equation in scalar or in system form in which we do not have two initial conditions given. (Note here that we need two conditions because we are considering a second-order ODE. In general, a kth order ODE requires k side conditions in order to determine its solution.) Instead, in BVP, we have such side conditions given as the boundary conditions at the two boundary points of an interval over which we would like to seek for the solution. There are a couple of methods available for solving BVPs including shooting methods, finite difference methods, finite element methods (or Galerkin methods), collocation methods, relaxation methods (Newton-Raphson-Kantorovich), eigenvalue problems We are going to study only the first two approaches, shooting methods and finite difference methods, briefly in this chapter.. Shooting Method Example: Consider Newton s second law F = ma which can be specifically written as ODE: F (t, x(t), x (t)) = m d2 x dt 2 (6.) 202

203 where t is time chosen in an interval [t a, t b ], x(t) is a position, dx/dt is a velocity, and d 2 /dt 2 is an acceleration. If we introduce two new variables y and y 2 : 203 y (t) = x(t), y 2 (t) = y (t) = dx dt, (6.2) we can then convert this second-order ODE in to a first-order ODE system of two equations y (t) = y 2(t). (6.3) y 2 (t) F/m Note that we would solve Eq. 6.3 as an IVP if we are given the two side conditions as initial conditions: y(t a ) = y (t a ) = x(t a) = α (6.4) y 2 (t a ) x (t a ) β by which we know the initial conditions of the value x(t a ) itself and the slope x (t a ). This IVP can uniquely determine the solution x(t) over the entire interval [t a, t b ] using the various time-marching methods we learn Chapter 5, by integrating the numerical schemes from t = t a to t = t b. Instead, if we happen to know the two side conditions as boundary conditions rather than initial conditions, e.g., so that and x(t a ) = α, x(t b ) = β, (6.5) y(t a ) = x(t a) α = (6.6) x (t a ) unknown y(t b ) = x(t b) β =, (6.7) x (t b ) unknown 2 we suddenly encounter a situation which lacks information on how to proceed time-marching using the techniques from IVPs because we do not know the slope information at t = t a, which is essential for all time-marching schemes. Therefore, we need to develop different approaches to solve such BVPs. This is a main topic in this chapter. Remark: There are many important physical problems that have this form, including a few examples such as the bending of an elastic beam under a distributed transverse load, the temperature distribution over a rod whose end points are maintained at fixed temperature,

204 204 the steady-state solution of parabolic PDEs which is equivalent to the corresponding elliptic PDEs. As just introduced we can convert the two-point BVP for the second-order scalar ODE u (t) = f(t, u, u ), t a < t < t b, (6.8) with boundary contitions u(t a ) = α, u(t b ) = β, (6.9) to the equivalent first-order system of ODEs y (t) y 2 (t) =, t a < t < t b, (6.0) y 2 (t) f(t, y, y 2 ) where y (t) = u(t), y 2 (t) = y (t) = u (t). (6.) The boundary conditions can be separated into a linear form, [ ] [ ] [ ] [ ] [ 0 y (t a ) 0 0 y (t + b ) α = 0 y 2 (t a ) 0 y 2 (t b ) β In general, we can express Eq. 6.0 as where ]. (6.2) y (t) = f(t, y(t)), t a < t < t b, (6.3) y(t) = y (t), (6.4) y 2 (t) with boundary conditions g(y(t a ), y(t b )) y (t a ) α = 0, g : R 2n R n. (6.5) y (t b ) β As noted in Eq. 6.6 and Eq. 6.7 the full initial and boundary conditions of y are y(t a ) = y (t a ) = u(t a) α =, (6.6) y 2 (t a ) u (t a ) unknown and y(t b ) = y (t b ) = u(t b) β =. (6.7) y 2 (t b ) u (t b ) unknown 2 Now let us consider the following approach to solve the BVP ODE system in Eq. 6.3 along with Eq. 6.6 and Eq. 6.7.

205 205. If we assume we somehow know unknown in Eq. 6.6, for instance, we make an initial guess for it and set unknown = y (0) 2 (t a). (6.8) 2. With this guess, we can now successfully proceed to solve the IVP ODE system in Eq. 6.3 with our guessed initial condition y(t a ) = y (t a ) = u(t a) = y 2 (t a ) u (t a ) α y (0) 2 (t a). (6.9) This is nothing but making an initial guess about the slope u (t a ) at t = t a, with which we now have full information to embark on a time-marching of y successively over [t a, t b ]. 3. Check how well the time-marching solution y (0) ( ) t b ; y (0) 2 (t a) at t = t b, which has been just produced by using the initial guess y (0) 2 (t a), compares with the true boundary value y (t b ) = u(t b ) = β. 4. If the resulting value y (0) (t b) is close to y (t b ) = u(t b ) = β, the search is successful and exit. Otherwise, the process is repeated until the search is successful with a new slope guess y (k) 2 (t a), k =, 2, (6.20) which will produce a new IVP solution ( ) y (k) t b ; y (k) 2 (t a) β, k =, 2, (6.2) Remark: If we ever can solve the IVP back in time over [t a, t b ], integrating from t b to t a, we would make an initial guess on unknown 2 and solve the IVP instead. But we don t want to do this reverse solve in normal situations and don t consider that way. The above procedure can be thought as a root finding problem of a function h given by ( ) ( ) h y (k) (t b); y (k) 2 (t a) y (k) t b ; y (k) 2 (t a) β = 0 (6.22) For this we can use Newton s root finding method which can be written as the following: Algorithm: Newton s method for finding root: y (0) 2 (t a) = initial guess estimator = largenumber (e.g., 0 0 )

206 206 while estimator > ɛ y (k+) 2 (t a ) = y (k) 2 (t a) h (k) /(h (k) ) estimator = h (k) /(h (k) ) endwhile # [if estimator becomes close to zero # it implies it has found the root and converged] Remark: In the above algorithm, we have ( ) h (k) = h y (k) (t b); y (k) 2 (t a), (6.23) and (h (k) ) = h ( ) y (k) (t b); y (k) 2 (t a) [ = y (k) Example: Consider the BVP given by with boundary conditions ( ) t b ; y (k) 2 (t a) ] (k) β = y 2 (t b). (6.24) u (t) = 6t, 0 < t <, (6.25) u(0) = 0, u() =. (6.26) We now convert this into a system of first-order ODE equations y (t) = y 2(t), (6.27) y 2 (t) 6t where y = u and y 2 = y = u. Let s make an initial guess on the slope of u at t = t a, so that we solve the IVP using the first guess y (0) 2 (t a): y(t a ) = y (t a ) 0 = (6.28) y 2 (t a ) y (0) 2 (t a) We also take the forward Euler method to integrate the IVP over the temporal domain [0, ]. The function h for root finding becomes ( ) ( ) h y (k) (); y(k) 2 (0) y (k) ; y (k) 2 (0) = 0. (6.29) For each guess for y (k) 2 (0), we will integrate the ODE using the forward Euler method, for instance, the first integration with the first initial guess y (0) 2 (0) becomes:

207 k = 0: For n =, or t = t (recall t n = n t): For n = 2, or t 2 = 2 t: 207 y = y 0 + t(y 0 2) (6.30) y 2 = y t(6t 0 ) = y (0) 2 + t(6 0) = y (0) 2 (6.3) y 2 = y + t(y 2) (6.32) y 2 2 = y 2 + t(6t ) = y t 2 (6.33) Continue the time-marching until n = N such that t N = t b = is reached: For n = N, or t N = N t = t b = : y N = y N + t(y2 N ) (6.34) y2 N = y2 N + t(6t N ) = y2 N + 6(N ) t 2 (6.35) These final values at t = t b = we just obtain are y (0) () = yn (6.36) y (0) 2 () = yn 2 (6.37) Perform the root finding for the next guess y () 2 (0): y () 2 h(0) (0) = y(0) 2 (0) (h (0) ) = y (0) 2 (0) y(0) () y (0) 2 () (6.38) Repeat this process with the new initial guess on the slope y (k) 2 (0) until the estimator h (k) /(h (k) ) gets closer to zero: h (k) (h (k) ) = y(k) () y (k) 2 () 0. (6.39) Remark: The pros and cons in the shooting method approach include: the shooting method is conceptually simple and is easy to implement using existing software for IVPs and for root finding methods, the shooting method inherits the stability of the associated IVP and might become unstable even when the BVP is stable and thus results in extreme difficulties in convergence, for some initial guesses for the IVP, the solution of the IVP may not exist over the entire interval (or domain) of integration in that the solution may become unbounded even before reaching the right-hand endpoint of the BVP.

208 Finite Difference Method In solving a BVP the shooting method approximately satisfies the ODE from the outset (by using an IVP solver) and iterates until the boundary conditions are met. There are potential issues as discussed in the shooting method approach. An alternative is to satisfy the boundary conditions from the outset and iterate until the ODE is approximately satisfied. This approach is taken in finite difference methods, which convert a BVP directly into a system of algebraic equations rather than a sequence of IVPs as in the shooting method. In a finite difference method, a set of mesh of points is introduced within the interval of integration and then any derivatives appearing in the ODE or boundary conditions are replaced by finite difference approximations at the mesh points. For a scalar two-point BVP with boundary conditions we introduce mesh points u (t) = f(t, u, u ), t a < t < t b, (6.40) u(t a ) = α, u(t b ) = β, (6.4) t n = t a + n t, n = 0,,, N +, (6.42) where t = (t b t a )/N + (note that we obtain t N+ = t b ), and we seek for approximate solution values U n u(t n ), n =,, N. (6.43) Next we use finite difference approximations to replace the first and second derivatives u (t n ) U n+ U n 2 t (6.44) u (t n ) U n+ 2U n + U n t 2. (6.45) Note that these finite difference approximations are of second-order accurate having the local truncation errors of order O( t 2 ). As a result the difference relation becomes a system of algebraic equations U n+ 2U n + U n t 2 = f (t n, U n, U n+ U n ), n =,, N. (6.46) 2 t

209 209 Example: We illustrate the finite difference method on the two-point BVP from the previous example with boundary conditions We find that the difference equations over [0, ] are U n+ 2U n + U n with the boundary conditions u (t) = 6t, 0 < t <, (6.47) u(0) = 0, u() =. (6.48) t 2 = 6t n, n =,, N, (6.49) U 0 = U(0) = 0, U N+ = U(t N+ ) = U() =. (6.50) The system of equations can be then written as a linear system Ax = b, where A =... t 2..., (6.5) and b = with the solution vector f(t ) α/ t 2 f(t 2 ). f(t N ) f(t N ) β/ t 2 x = = U U 2. U N U N 6t 6t 2. 6t N 6t N / t 2, (6.52), (6.53) This tridiagonal linear system is nonsingular and can be easily solved for x using the methods we learned in Chapter 2.

210 Chapter 7 Reviews on Partial Differential Equations and Difference Equations. Properties of PDEs In this chapter, we study the key defining properties of partial differential equations (PDEs). First of all, there are more than one independent variables t, x, y, z,... Associated to these is so called a dependent variable u (of course there could be more than one dependent variables) which is a function of those independent variables, u = u(x, y, z, t,...) (7.) We now provide a bunch of basic definitions and examples on PDEs. Definition: A PDE is a relation between the independent variables and the dependent variable u via the partial derivatives of u. Definition: The order of PDE is the highest derivative that appears. Example: F (x, y, u, u x, u y ) = 0 is the most general form of first-order PDE in two independent variables x and y. Example: F (t, x, y, u, u t, u xx, u xy, u yy ) = 0 is the most general form of secondorder PDE in three independent variables t, x and y. Example: u t u xx = 0 is a second-order PDE in two independent variables t and x. Example: u xxxx +(u y ) 3 = 0 is a fourth-order PDE in two independent variables x and y. 20

211 Definition: L is called a linear operator if L(u+v) = Lu+Lv for any functions u and v. Definition: A PDE Lu = 0 is called a linear PDE if L is a linear derivative operator. 2 Definition: A PDE Lu = g is called an inhomogeneous linear PDE if L is a linear derivative operator and if g 0 is a given function of the independent variables. If g = 0, it is called a homogeneous linear PDE. Example: The following PDEs are homogeneous linear: u x + u y = 0 (transport); u x + yu y = 0 (transport); u xx + u yy = 0 (Laplace s equation) Example: The following PDEs are homogeneous nonlinear: u x + uu y = 0 (shock wave); u tt + u xx + u 3 = 0 (wave with interaction); u t + uu x + u xxx = 0 (dispersive wave); Example: The following PDEs are inhomogeneous linear: cos(xy 2 )u x y 2 u y = tan(x 2 + y 2 ) 2. Well-posedness of PDEs When solving PDEs, one often encounters a problem that has more than one solution (non-uniqueness) if few auxiliary conditions are imposed. Then the problem is called underdetermined. On the other hand, if too many conditions are given, there may be no solution at all (non-existence) and in this case, the problem is overdetermined. The well-posedness property of PDEs is therefore required in order for us to enable to solve the given PDE system successfully. Well-posed PDEs of proper initial and boundary conditions follows the following fundamental properties:. Existence: There exists at least one solution u(x, t) satisfying all these conditions, 2. Uniqueness: There is at most one solution, 3. Stability: The unique solution u(x, t) depends in a stable manner on the data of the problem. This means that if the data are changed a little, the corresponding solution changes only a little as well. 3. Classifications of Second-order PDEs PDEs arise in a number of physical phenomena to describe their natures. Some of the most popular types of such problems include fluid flows, heat transfer, solid mechanics and biological processes. These types of equations often fall

212 22 into one of three types, (i) hyperbolic PDEs that are associated with advection, (ii) parabolic PDEs that are most commonly associated with diffusion, and (iii) elliptic PDEs that most commonly describe steady states of either parabolic or hyperbolic problems. In reality, not many problems fall simply into one of these three types, rather most of them involve combined types, e.g., advection-diffusion problems. Mathematically, however, we can rather easily determine the type of a general second-order PDEs, which we are going to briefly discuss here. In general, let s consider the PDE of form with nonzero constants a, a 2, and a 22 : a u xx + 2a 2 u xy + a 22 u yy + a u x + a 2 u y + a 0 u = 0, (7.2) which is a second-order linear equation in two independent variables x and y with six constant coefficients. Theorem: By a linear transformation of the independent variables, the equation can be reduced to one of three forms:. Elliptic PDE: if a 2 2 < a a 22, it is reducible to u xx + u yy + L.O.T = 0 (7.3) where L.O.T denotes all the lower order terms (first or zeroth order terms). 2. Hyperbolic PDE: if a 2 2 > a a 22, it is reducible to u xx u yy + L.O.T = 0 (7.4) 3. Parabolic PDE: if a 2 2 = a a 22 (the condition for parabolic is in between those of elliptic and hyperbolic), it is reducible to u xx + L.O.T = 0 (7.5) Remark: Notice the similarity between the above classification and the one in analytic geometry. We know from analytic geometry that, given (again assuming nonzero constants a, a 2, and a 22 ) Then Eq. 7.6 becomes. Ellipsoid if a 2 2 < a a 22 a x 2 + 2a 2 xy + a 22 y 2 + a x + a 2 y + a 0 = 0, (7.6) 2. Hyperbola if a 2 2 > a a Parabola if a 2 2 = a a 22.

23 Figure. Three major types of conic section from analytic geometry Image source: Wikipedia Note again that parabola is in between ellipsoid and hyperbola. See Fig. for an illustration.

213 23 Figure. Three major types of conic section from analytic geometry Image source: Wikipedia Note again that parabola is in between ellipsoid and hyperbola. See Fig. for an illustration. Example: u xx 5u xy = 0 is hyperbolic; 4u xx 2u xy + 9u yy + u y = 0 is parabolic; 4u xx + 6u xy + 9u yy = 0 is elliptic. Example: The wave equation is one of the most famous examples in hyperbolic PDEs. We write the wave equation as Factoring the derivative operator, we get u tt = c 2 u xx for < x <, c 0. (7.7) ( t c )( x t + c ) u = 0 (7.8) x Considering the characteristic coordinates ξ = x + ct and η = x ct, we obtain 0 = ( t c )( x t + c ) ( u = 2c )( 2c ) u (7.9) x ξ η Hence, we conclude that the general solution must have a form u(x, t) = f(x + ct)+g(x ct), the sum of two functions, one (g) is a wave of any shape traveling to the the right at speed c, and the other (f) with another arbitrary shape traveling to the the left at speed c. We call the two families of lines, x±ct = constant, the characteristic lines of the wave equation. Example: One very simple and famous example in the parabolic PDEs is so called the diffusion equation u t = ku xx, with k constant and (x, t) D T (7.0)

214 24 One of the important properties in the diffusion equations is to have the maximum principle. Recall that the maximum principle says if u(x, t) is the solution of Eq. 7.0 on D T = [x min, x max ] [T 0, T ] in space-time, then the maximum value of u(x, t) is assumed only on the initial and domain boundary of D T. That is, the maximum value only occurs either initially at t = T 0 or on the sides x = x min or x = x max. Remark: The fundamental properties of the two types of PDEs can be briefly compared in the following table. The physical meanings in Table are also illustrated in Fig. 2 and Fig. 3. Table. Comparison of Waves and Diffusions: Fundamental properties of the wave and diffusion equations are summarized. Property Waves Diffusions () speed of propagation finite ( c) (2) singularities for t > 0? transported along characteristics lost immediately (with speed = c) (3) well-posed for t > 0? yes yes (at least for bounded solutions) (4) well-posed for t < 0? yes no (5) maximum principle? no yes (6) behavior as t energy is constant so does decays to zero not decay (i.e., simple advection without diffusion) (7) information transported lost gradually 4. Discretization We consider the cell-centered (rather than cell interface-centered) notation for discrete cells x i and the conventional temporal discretization t n : x i = (i ) x, i =,..., N, 2 (7.) t n = n t, n = 0,...M. (7.2) Then the cell interface-centered grid points are written using the half-integer indices: x i+ 2 = x i + x 2. (7.3) Definition: Let u n i = u(x i, t n ) be the pointwise values of the exact solution of Eq.?? at the discrete points (x i, t n ). This is the analytical solution of the PDE and satisfies it without any form of numerical errors.

215 25 Figure 2. Domain and boundaries for the solution of hyperbolic PDEs in 2D. Note that any information or disturbance introduced at p is going to affect only the region called the region of influence but nowhere. Such information is propagated with the finite advection speed along the characteristic surface which forms the conic region of influence. On the other hand, if the characteristic surface can be extended backward in time to the place where the initial data is imposed. This also forms another conic section on the lower part of the figure which is called the domain of dependence. Definition: Let Ui n be the numerical approximations to the exact solution of the PDE. For instance, Ui n represents Ui n u n i for FDM. (7.4) Definition: Let Di n be the exact solution of the associated difference equation (DE) of the PDE, e.g., the forward in time backward in space (FTBS): D n+ i Di n t = a Dn i Dn i. (7.5) x Since Di n is the exact solution of the DE, there is no round-off errors involved. When we study numerical solution of PDEs, the solutions are affected by numerical errors. They mainly come from two sources of numerical errors, and we are now ready to define them. Definition: The discretization error E n d at (x i, t n ) is defined by E n d,i = un i D n i. (7.6)

216 26 Figure 3. Domain and boundaries for the solution of parabolic PDEs in 2D. Note that from a given point p in the mid plane, there is only one physically meaningful direction that is positive in t. Therefore, any information at p influences the entire region onward from p, called the region of influence. Such information can only marches forward in time under the assumption that all boundary conditions around the surface and the initial condition are known. Definition: The round-off error E n r,i at (x i, t n ) is defined by E n r,i = D n i U n i. (7.7) Definition: The global error E n g,i at (x i, t n ) is defined by Note by definition, E n g,i = En d,i + En r,i. E n g,i = u n i U n i. (7.8) Definition: We say that the numerical method is convergent at t n in a given norm if lim x, t 0 En g = 0. (7.9) Remark: We note that the discretization error Ed,i n is the sum of the truncation error ET,i n for the DE Eq. 7.5 and any numerical errors En B,i introduced by the numerical handling of boundary conditions.

217 27 Remark: We define the round-off error Er,i n by the numerical errors introduced after a repetitive number of arithmetic computer operations in which the computer constantly rounds off the numbers to some significant digits. 5. The Fundamental Theorem of Numerical Methods The Lax Equivalence Theorem for Linear PDEs The ultimate goal in this chapter is to show (at least partially) one of the theorems that is very powerful to provide us great levels of insights in numerical differential equations. Briefly speaking, the theorem says, for linear PDEs, consistency + stability convergence Let us take a moment to think about the meaning of this theorem. It says that if the numerical scheme converges to a (weak) solution provided the scheme is proven to be consistent (we are going to define it shortly) and stable. So, what is good about it? The good news is that in numerically solving many PDE systems, it is often very difficult to directly show convergence of a given numerical method because not many PDEs have their exact analytical solutions available (see the definition of convergence in Eq. 7.9). Without guaranteeing the existence of such analytical solutions, one cannot possibly say her/his numerical scheme converges to a mathematically meaningful and correct solution at all. A nice workaround is instead to look at numerical stability and consistency that are based on a recurrence property of the numerical method acting on the discrete grid data. The Lax Equivalence theorem then indicates that such numerical method is indeed a convergent method that produces a well-defined weak solution. Now let s take a look at this nice theorem in more details. First, we define few more things. Definition: Let N be the (linear) numerical operator mapping the approximate solution at one time step to the approximate solution at the next time step. Then a general explicit numerical method can be written as We define the one-step error E n step,i by and the local truncation error E n LT,i by U n+ i = N (U n i ). (7.20) E n step,i = u n i N (u n i ), (7.2) E n LT,i = t En step,i. (7.22) We have already discussed the the order of method previously, and we now can

218 28 define it again using the local truncation error. Definition: We say that the numerical method is of order p (or pth order accurate) if for all sufficiently smooth data with compact support, the local truncation error is given as E n LT,i = O( t p, x p ). (7.23) Remark: One can obviously introduce a method that has different orders of accuracy in space and time, i.e., a method that is of p-th order accurate in time and r-th order accurate in space can be defined as E n LT,i = O( t p, x r ). (7.24) In this case, the numerical solution in a fully resolved state both temporally and spatially will exhibit its convergence rate dominated by the lower one between the two, i.e., [ ] ELT,i n = O( t s ) = min O( t p ), O( x r ). (7.25) 5.. Consistency Let s now formally define consistency of the numerical methods. Definition: We say the numerical method is consistent in with a proposed DE if lim t, x 0 En LT = 0 (7.26) for all smooth functions u(x, t) that satisfies the given PDE. Remark: In words, the numerical consistency is a measure to see if the numerical operator N is in fact consistent with the DE of interest in a sense that the method should introduce a small error in any one step. Remark: On the other hand, the numerical stability is a property that the numerical method does not produce any local errors that grow catastrophically and hence a bound on the global error can be obtained in terms of these local errors Stability Theory The form of stability bounds in this section provides a useful information in analyzing linear methods. It has to be emphasized that for nonlinear methods, the same technique we adopt for the linear method becomes hard to apply, and therefore one has to provide a different approach to discuss nonlinear stability

219 (we will study such approach(es) later!). We limit our interest in the linear stability theory in this chapter. 29 In order to assess stability of the linear PDEs, we essentially need to bound the global error Eg,i n = un i U i n using a recurrence relation. Applying the linear numerical operator N to Ui n, we obtain The global error at t n+ is now U n+ i = N (U n i ) = N (u n i E n g,i). (7.27) Eg,i n+ = u n+ i Ui n+ (7.28) = u n+ i N (u n i Eg,i) n (7.29) = u n+ i ( ) N (u n i ) N (u n i Eg,i) n N (u n i ) (7.30) ). (7.3) = te n+ LT,i (N (u n i E n g,i) N (u n i ) Note that the first term in Eq. 7.3 is the new one-step error introduced in this time step, and this term is therefore related to the consistency control of the numerical method. On the other hand, the second term in the parenthesis is the effect of the numerical method on the previous global error Eg,i n and this is the term that is to do with the stability control. Definition: We say the linear numerical method defined by the linear operator N is stable in if there is a constant C such that for each time T. N n C, n t T, (7.32) Note: We note here that the superscript n on N represents powers of the matrix (or linear operator) obtained by repeated applications of the linear operator N. This is, however, not true for nonlinear operators. Remark: In particular, the numerical method is stable if N <, since in this case, we have N n N n <. (7.33) Theorem: The Lax Equivalence Theorem for linear difference methods states that, for a well-posed consistent, linear method, stability is necessary and sufficient for convergence. A full proof can be found in a book by Richtmyer and Morton, Difference Methods for Initial-Value Problems, Wiley-Interscience, 967, and we only partially prove the sufficient part of the claim: consistency + stability = convergence

220 220 Proof: We are going to show Since N is linear, Eq. 7.3 becomes, recursively, lim t, x 0 En+ g = 0. (7.34) Eg n+ t E n+ LT + N (un Eg n ) N (u n ) (7.35) = t E n+ LT + N (En g ) (7.36) t E n+ LT + N En g (7.37) t E n+ LT + C En g (7.38) ( ) t E n+ LT + C N Eg n + t ELT n (7.39) (7.40) C n+ j E j LT + Cn+ Eg 0 (7.4) n+ t j= D(n + ) t E LT + C E 0 g (7.42) = Dt n+ E LT + C E 0 g, (7.43) where E LT = max j n+ E j LT, and for some C and D. Now if we let t 0, then Eg 0 0, since it is the global error on resolving the discrete initial data. It has to go to zero when the grid gets more and more refined unless the initial data has some numerical error to start with (i.e., ill-posed problems). Also, if we let t 0, then E LT 0, since the method is consistent by assumption. Therefore, we prove Eg n+ 0 as x, t 0, and the method is convergent. Note: It is not hard to show that the the sufficient condition also holds when N is contractive, i.e., N (P ) N (Q) P Q. (7.44) Remark: One can also say the method is stable in if U n+ U n, (7.45) for all n. To show this, let us assume Eq Recalling U n+ = N (U n ), we have N (U n ) U n = U n+ U n, (7.46) for U n 0. Since Eq is true for all n, we can take sup to get N (U) sup (7.47) U 0 U

221 22 which gives N. (7.48) Hence Eq implies the method is stable.

222 Chapter 8 Parabolic PDEs We selectively follow Chapter 9 of the book Finite Difference Methods for Ordinary and Partial Differential Equations by Prof. Randy LeVeque, University of Washington. 222

223 rjlfdm 2007/6/ page 8 Chapter 9 Diffusion Equations and Parabolic Problems We now begin to study finite difference methods for time-dependent partial differential equations (PDEs), where variations in space are related to variations in time. We begin with the heat equation (or diffusion equation) introduced in Appendix E, u t D u xx : (9.) This is the classical example of a parabolic equation, and many of the general properties seen here carry over to the design of numerical methods for other parabolic equations. We will assume D for simplicity, but some comments will be made about how the results scale to other values of >0. (If <0, then (9.) would be a backward heat equation, which is an ill-posed problem.) Along with this equation we need initial conditions at some time t 0, which we typically take to be t 0 D 0, u.x; 0/ D.x/; (9.2) and also boundary conditions if we are working on a bounded domain, e.g., the Dirichlet conditions u.0; t/ D g 0.t/ for t > 0; (9.3) u.; t/ D g.t/ for t > 0 if 0 x. We have already studied the steady-state version of this equation and spatial discretizations of u xx in Chapter 2. We have also studied discretizations of the time derivatives and some of the stability issues that arise with these discretizations in Chapters 5 through 8. Next we will put these two types of discretizations together. In practice we generally apply a set of finite difference equations on a discrete grid with grid points.x i ; t n /, where x i D ih; t n D nk: Here h D x is the mesh spacing on the x-axis and k D t is the time step. Ui n u.x i ; t n / represent the numerical approximation at grid point.x i ; t n /. Let 8

224 rjlfdm 2007/6/ page Chapter 9. Diffusion Equations and Parabolic Problems (a) (b) t nc t n x j x j x jc Figure 9.. Stencils for the methods (9.5) and (9.7). Since the heat equation is an evolution equation that can be solved forward in time, we set up our difference equations in a form where we can march forward in time, determining the values U nc i for all i from the values Ui n at the previous time level, or perhaps using also values at earlier time levels with a multistep formula. As an example, one natural discretization of (9.) would be U nc i k U n i D h.u n 2 i 2Ui n C UiC n /: (9.4) This uses our standard centered difference in space and a forward difference in time. This is an explicit method since we can compute each U nc i explicitly in terms of the previous data: U nc i D U n i C k h.u n 2 i 2Ui n C UiC n /: (9.5) Figure 9.(a) shows the stencil of this method. This is a one-step method in time, which is also called a two-level method in the context of PDEs since it involves the solution at two different time levels. Another one-step method, which is much more useful in practice, as we will see below, is the Crank Nicolson method, U nc i k U n i D 2.D2 U n i C D 2 U nc i / (9.6) D 2h.U n 2 i 2Ui n C UiC n C U nc i 2U nc i C U nc ic /; which can be rewritten as or U nc i D Ui n C k 2h.U n 2 i 2Ui n C UiC n C U nc i 2U nc i C U nc ic / (9.7) ru nc i C. C 2r/U nc i ru nc ic D run n i C. 2r/Ui C ru n ic ; (9.8) where r D k=2h 2. This is an implicit method and gives a tridiagonal system of equations to solve for all the values U nc i simultaneously. In matrix form this is

225 rjlfdm 2007/6/ page Local truncation errors and order of accuracy C 2r/ r U nc 3 r. C 2r/ r U nc 2 r. C 2r/ r U nc : :: : :: : :: 3: : 7 4 r. C 2r/ r 5 4 U nc 5 m r. C 2r/ Um nc 2 D 6 4 r.g 0.t n / C g 0.t nc // C. 2r/U n C run 2 ru n C. 2r/U n 2 C run 3 ru n 2 C. 2r/U n 3 C run 4 : ru n m 2 C. 2r/U n m C run m ru n m C. 2r/U n m C r.g.t n / C g.t nc // 3 : 7 5 (9.9) Note how the boundary conditions u.0; t/ D g 0.t/ and u.; t/ D g.t/ come into these equations. Since a tridiagonal system of m equations can be solved with O.m/ work, this method is essentially as efficient per time step as an explicit method. We will see in Section 9.4 that the heat equation is stiff, and hence this implicit method, which allows much larger time steps to be taken than an explicit method, is a very efficient method for the heat equation. Solving a parabolic equation with an implicit method requires solving a system of equations with the same structure as the two-point boundary value problem we studied in Chapter 2. Similarly, a multidimensional parabolic equation requires solving a problem with the structure of a multidimensional elliptic equation in each time step; see Section Local truncation errors and order of accuracy We can define the local truncation error as usual we insert the exact solution u.x; t/ of the PDE into the finite difference equation and determine by how much it fails to satisfy the discrete equation. Example 9.. The local truncation error of the method (9.5) is based on the form (9.4): n i D.x i ; t n /, where.x; t/ D u.x; t C k/ u.x; t/ k.u.x h; t/ 2u.x; t/ C u.x C h; t//: h2 Again we should be careful to use the form that directly models the differential equation in order to get powers of k and h that agree with what we hope to see in the global error. Although we don t know u.x; t/ in general, if we assume it is smooth and use Taylor series expansions about u.x; t/, we find that.x; t/ D u t C 2 ku tt C 6 k2 u ttt C u xx C 2 h2 u xxxx C :

226 rjlfdm 2007/6/ page Chapter 9. Diffusion Equations and Parabolic Problems Since u t D u xx, the O./ terms drop out. By differentiating u t D u xx we find that u tt D u txx D u xxxx and so.x; t/ D 2 k 2 h2 u xxxx C O.k 2 C h 4 /: This method is said to be second order accurate in space and first order accurate in time since the truncation error is O.h 2 C k/. The Crank Nicolson method is centered in both space and time, and an analysis of its local truncation error shows that it is second order accurate in both space and time,.x; t/ D O.k 2 C h 2 /: A method is said to be consistent if.x; t/! 0 as k; h! 0. Just as in the other cases we have studied (boundary value problems and initial value problems for ordinary differential equations (ODEs)), we expect that consistency, plus some form of stability, will be enough to prove that the method converges at each fixed point.x; T / as we refine the grid in both space and time. Moreover, we expect that for a stable method the global order of accuracy will agree with the order of the local truncation error, e.g., for Crank Nicolson we expect that Ui n u.x; T / D O.k 2 C h 2 / as k; h! 0 when ih X and nk T are fixed. For linear PDEs, the fact that consistency plus stability is equivalent to convergence is known as the Lax equivalence theorem and is discussed in Section 9.5 after an introduction of the proper concept of stability. As usual, it is the definition and study of stability that is the hard (and interesting) part of this theory. 9.2 Method of lines discretizations To understand how stability theory for time-dependent PDEs relates to the stability theory we have already developed for time-dependent ODEs, it is easiest to first consider the socalled method of lines (MOL) discretization of the PDE. In this approach we first discretize in space alone, which gives a large system of ODEs with each component of the system corresponding to the solution at some grid point, as a function of time. The system of ODEs can then be solved using one of the methods for ODEs that we have previously studied. This system of ODEs is also often called a semidiscrete method, since we have discretized in space but not yet in time. For example, we might discretize the heat equation (9.) in space at grid point x i by U 0 i.t/ D h 2.U i.t/ 2U i.t/ C U ic.t// for i D ; 2; :::; m; (9.0) where prime now means differentiation with respect to time. We can view this as a coupled system of m ODEs for the variables U i.t/, which vary continuously in time along the lines shown in Figure 9.2. This system can be written as U 0.t/ D AU.t/ C g.t/; (9.)

227 rjlfdm 2007/6/ page Method of lines discretizations 85 U 0.t/ U.t/ U 2.t/ U m.t/ U mc.t/ t x 0 x x 2 x m x m x mc Figure 9.2. Method of lines interpretation. U i.t/ is the solution along the line forward in time at the grid point x i. where the tridiagonal matrix A is exactly as in (2.9) and g.t/ includes the terms needed for the boundary conditions, U 0.t/ g 0.t/ and U mc.t/ g.t/, 2 A D h : :: : :: : :: ; g.t/ D h g 0.t/ 0 0 : 0 g.t/ 3 : (9.2) 7 5 This MOL approach is sometimes used in practice by first discretizing in space and then applying a software package for systems of ODEs. There are also packages that are specially designed to apply MOL. This approach has the advantage of being relatively easy to apply to a fairly general set of time-dependent PDEs, but the resulting method is often not as efficient as specially designed methods for the PDE. See Section.2 for more discussion of this. As a tool in understanding stability theory, however, the MOL discretization is extremely valuable, and this is the main use we will make of it. We know how to analyze the stability of ODE methods applied to a linear system of the form (9.) based on the eigenvalues of the matrix A, which now depend on the spatial discretization. If we apply an ODE method to discretize the system (9.), we will obtain a fully discrete method which produces approximations Ui n U i.t n / at discrete points in time which are exactly the points.x i ; t n / of the grid that we introduced at the beginning of this chapter. For example, applying Euler s method U nc D U n C kf.u n / to this linear system results in the fully discrete method (9.5). Applying instead the trapezoidal method (5.22) results in the Crank Nicolson method (9.7). Applying a higher order linear multistep or Runge Kutta method would give a different method, although with the spatial discretization (9.0) the overall method would be only second order accurate in space. Replacing

228 rjlfdm 2007/6/ page Chapter 9. Diffusion Equations and Parabolic Problems the right-hand side of (9.0) with a higher order approximation to u xx.x i / and then using a higher order time discretization would give a more accurate method. 9.3 Stability theory We can now investigate the stability of schemes like (9.5) or (9.7) since these can be interpreted as standard ODE methods applied to the linear system (9.). We expect the method to be stable if k 2 S, i.e., if the time step k multiplied by any eigenvalue of A lies in the stability region of the ODE method, as discussed in Chapter 7. (Note that A is symmetric and hence normal, so eigenvalues are the right thing to look at.) We have determined the eigenvalues of A in (2.23), p D 2.cos.ph/ / for p D ; 2; :::; m; (9.3) h2 where again m and h are related by h D =.m C /. Note that there is a new wrinkle here relative to the ODEs we considered in Chapter 7: the eigenvalues p depend on the mesh width h. As we refine the grid and h! 0, the dimension of A increases, the number of eigenvalues we must consider increases, and the values of the eigenvalues change. This is something we must bear in mind when we attempt to prove convergence as k; h! 0. To begin, however, let s consider the simpler question of how the method behaves for some fixed k and h, i.e., the question of absolute stability in the ODE sense. Then it is clear that the method is absolutely stable (i.e., the effect of past errors will not grow exponentially in future time steps) provided that k p 2 S for each p, where S is the stability region of the ODE method, as discussed in Chapter 7. For the matrix (9.2) coming from the heat equation, the eigenvalues lie on the negative real axis and the one farthest from the origin is m 4=h 2. Hence we require that 4k=h 2 2 S (assuming the stability region is connected along the negative real axis up to the origin, as is generally the case). Example 9.2. If we use Euler s method to obtain the discretization (9.5), then we must require j C kj for each eigenvalue (see Chapter 7) and hence we require 2 4k=h 2 0. This limits the time step allowed to k h 2 2 : (9.4) This is a severe restriction: the time step must decrease at the rate of h 2 as we refine the grid, which is much smaller than the spatial width h when h is small. Example 9.3. If we use the trapezoidal method, we obtain the Crank Nicolson discretization (9.6). The trapezoidal method for the ODE is absolutely stable in the whole left half-plane and the eigenvalues (9.3) are always negative. Hence the Crank Nicolson method is stable for any time step k > 0. Of course it may not be accurate if k is too large. Generally we must take k D O.h/ to obtain a reasonable solution, and the unconditional stability allows this. 9.4 Stiffness of the heat equation Note that the system of ODEs we are solving is quite stiff, particularly for small h. The eigenvalues of A lie on the negative real axis with one fairly close to the origin, 2

229 rjlfdm 2007/6/ page Stiffness of the heat equation 87 for all h, while the largest in magnitude is m 4=h 2. The stiffness ratio of the system is 4= 2 h 2, which grows rapidly as h! 0. As a result the explicit Euler method is stable only for very small time steps k 2 h2. This is typically much smaller than what we would like to use over physically meaningful times, and a method designed for stiff problems will be more efficient. The stiffness is a reflection of the very different time scales present in solutions to the physical problem modeled by the heat equation. High frequency spatial oscillations in the initial data will decay very rapidly due to rapid diffusion over very short distances, while smooth data decay much more slowly since diffusion over long distances takes much longer. This is apparent from the Fourier analysis of Section E.3.3 or is easily seen by writing down the exact solution to the heat equation on 0 x with g 0.t/ D g.t/ 0 as a Fourier sine series: X u.x; t/ D Ou j.t/ sin.j x/: jd Inserting this in the heat equation gives the ODEs Ou 0 j.t/ D j 2 2 Ou j.t/ for j D ; 2; ;::: (9.5) and so Ou j.t/ D e j 2 2t Ou j.0/ with the Ou j.0/ determined as the Fourier coefficients of the initial data.x/. We can view (9.5) as an infinite system of ODEs, but which are decoupled so that the coefficient matrix is diagonal, with eigenvalues j 2 2 for j D ; 2; :::. By choosing data with sufficiently rapid oscillation (large j ), we can obtain arbitrarily rapid decay. For general initial data there may be some transient period when any high wave numbers are rapidly damped, but then the long-time behavior is dominated by the slower decay rates. See Figure 9.3 for some examples of the time evolution with different sets of data. If we are solving the problem over the long periods needed to track this slow diffusion, then we would ultimately (after any physical transients have decayed) like to use rather large time steps, since typically the variation in time is then on roughly the same scale as variations in space. We would generally like to have k h so that we have roughly the same resolution in time as we do in space. A method that requires k h 2 forces us to take a much finer temporal discretization than we should need to represent smooth solutions. If h D 0:00, for example, then if we must take k D h 2 rather than k D h we would need to take 000 time steps to cover each time interval that should be well modeled by a single time step. This is the same difficulty we encountered with stiff ODEs in Chapter 8. Note: The remark above that we want k h is reasonable assuming the method we are using has the same order of accuracy in both space and time. The method (9.5) does not have this property. Since the error is O.k C h 2 / we might want to take k D O.h 2 / just to get the same level of accuracy in both space and time. In this sense the stability restriction k D O.h 2 / may not seem unreasonable, but this is simply another reason for not wanting to use this particular method in practice. Note: The general diffusion equation is u t D u xx and in practice the diffusion coefficient may be different from by many orders of magnitude. How does this affect our conclusions above? We would expect by scaling considerations that we should take

230 rjlfdm 2007/6/ page Chapter 9. Diffusion Equations and Parabolic Problems u u u x x x u u u u.x; t 0 / u.x; t / u.x; t 2 / u 2.x; t 0 / u 2.x; t / u 2.x; t 2 / x x x u 3.x; t 0 / u 3.x; t / u 3.x; t 2 / u u u x x x Figure 9.3. Solutions to the heat equation at three different times (columns) shown for three different sets of initial conditions (rows). In the top row u.x; t 0 / consists of only a low wave number, which decays slowly. The middle row shows data consisting of a higher wave number, which decays more quickly. The bottom row shows data u 3.x; t 0 / that contains a mixture of wave numbers. The high wave numbers are most rapidly damped (an initial rapid transient), while at later times only the lower wave numbers are still visible and decaying slowly. k h= in order to achieve comparable resolution in space and time, i.e., we would like to take k=h. (Note that Ou j.t/ D exp. j 2 2 t/ Ou j.0/ in this case.) With the MOL discretization we obtain the system (9.) but A now has a factor in front. For stability we thus require 4k=h 2 2 S, which requires k=h 2 to be order for any explicit method. This is smaller than what we wish to use by a factor of h, regardless of the magnitude of. So our conclusions on stiffness are unchanged by. In particular, even when the diffusion coefficient is very small it is best to use an implicit method because we then want to take very long time steps k h=. These comments apply to the case of pure diffusion. If we are solving an advectiondiffusion or reaction-diffusion equation where there are other time scales determined by other phenomena, then if the diffusive term has a very small coefficient we may be able to use an explicit method efficiently because of other restrictions on the time step. Note: The physical problem of diffusion is infinitely stiff in the sense that (9.5) has eigenvalues j 2 2 with arbitrarily large magnitude, since j can be any integer. Luckily the discrete problem is not this stiff. It is not stiff because, once we discretize in space, only a finite number of spatial wave numbers can be represented and we obtain the finite

231 rjlfdm 2007/6/ page Convergence 89 set of eigenvalues (9.3). As we refine the grid we can represent higher and higher wave numbers, leading to the increasing stiffness ratio as h! Convergence So far we have only discussed absolute stability and determined the relation between k and h that must be satisfied to ensure that errors do not grow exponentially as we march forward in time on this fixed grid. We now address the question of convergence at a fixed point.x; T / as the grid is refined. It turns out that in general exactly the same relation between k and h must now be required to hold as we vary k and h, letting both go to zero. In other words, we cannot let k and h go to zero at arbitrary independent rates and necessarily expect the resulting approximations to converge to the solution of the PDE. For a particular sequence of grids.k ; h /,.k 2 ; h 2 /; : : :, with k j! 0 and h j! 0, we will expect convergence only if the proper relation ultimately holds for each pair. For the method (9.5), for example, the sequence of approximations will converge only if k j =h 2 j =2 for all j sufficiently large. It is sometimes easiest to think of k and h as being related by some fixed rule (e.g., we might choose k D 0:4h 2 for the method (9.5)), so that we can speak of convergence as k! 0 with the understanding that this relation holds on each grid. The methods we have studied so far can be written in the form U nc D B.k/U n C b n.k/ (9.6) for some matrix B.k/ 2 R mm on a grid with h D =.m C / and b n.k/ 2 R m. In general these depend on both k and h, but we will assume some fixed rule is specified relating h to k as k! 0. For example, applying forward Euler to the MOL system (9.) gives B.k/ D I C ka; (9.7) where A is the tridiagonal matrix in (9.2). The Crank Nicolson method results from applying the trapezoidal method to (9.), which gives k B.k/ D I I 2 A C k2 A : (9.8) To prove convergence we need consistency and a suitable form of stability. As usual, consistency requires that the local truncation error vanishes as k! 0. The form of stability that we need is often called Lax Richtmyer stability. Definition 9.. A linear method of the form (9.6) is Lax Richtmyer stable if, for each time T, there is a constant C T > 0 such that for all k > 0 and integers n for which kn T. kb.k/ n kc T (9.9) Theorem 9.2 (Lax Equivalence Theorem). A consistent linear method of the form (9.6) is convergent if and only if it is Lax Richtmyer stable.

232 rjlfdm 2007/6/ page Chapter 9. Diffusion Equations and Parabolic Problems For more discussion and a proof see [75]. The main idea is the same as our proof in Section 6.3. that Euler s method converges on a linear problem. If we apply the numerical method to the exact solution u.x; t/, we obtain u nc D Bu n C b n C k n ; (9.20) where we suppress the dependence on k for clarity and where 2 u n D 6 4 u.x ; t n / u.x 2 ; t n / : u.x m ; t n / ; n D 6 4.x ; t n /.x 2 ; t n / :.x m ; t n / : Subtracting (9.20) from (9.6) gives the difference equation for the global error E n U n u n : E nc D BE n k n ; and hence, after N time steps, D E N D B N E 0 NX k B N n n ; nd from which we obtain ke N kkb N kke 0 kck NX kb N n kk n k: (9.2) nd If the method is Lax Richtmyer stable, then for Nk T, ke N kc T ke 0 kctc T max nn k n! 0 as k! 0 for Nk T; provided the method is consistent (kk!0) and we use appropriate initial data (ke 0 k! 0 as k! 0). Example 9.4. For the heat equation the matrix A from (9.2) is symmetric, and hence the 2-norm is equal to the spectral radius, and the same is true of the matrix B from (9.7). From (9.3) we see that kb.k/k 2, provided (9.4) is satisfied, and so the method is Lax Richtmyer stable and hence convergent under this restriction on the time step. Similarly, the matrix B of (9.8) is symmetric and has eigenvalues. C k p =2/=. k p =2/, and so the Crank Nicolson method is stable in the 2-norm for any k > 0. For the methods considered so far we have obtained kbk. This is called strong stability. But note that this is not necessary for Lax Richtmyer stability. If there is a constant so that a bound of the form kb.k/k C k (9.22) k

233 rjlfdm 2007/6/ page Convergence 9 holds in some norm (at least for all k sufficiently small), then we will have Lax Richtmyer stability in this norm, since kb.k/ n k. C k/ n e T for nk T. Note that the matrix B.k/ depends on k, and its dimension m D O.=h/ grows as k; h! 0. The general theory of stability in the sense of uniform power boundedness of such families of matrices is often nontrivial PDE versus ODE stability theory It may bother you that the stability we need for convergence now seems to depend on absolute stability, and on the shape of the stability region for the time-discretization, which determines the required relationship between k and h. Recall that in the case of ODEs all we needed for convergence was zero-stability, which does not depend on the shape of the stability region except for the requirement that the point z D 0 must lie in this region. Here is the difference: with ODEs we were studying a fixed system of ODEs of fixed dimension, and the fixed set of eigenvalues was independent of k. For convergence we needed k in the stability region as k! 0, but since these values all converge to 0 it is only the origin that is important, at least to prove convergence as k! 0. Hence the need for zero-stability. With PDEs, on the other hand, in our MOL discretization the system of ODEs grows as we refine the grid, and the eigenvalues grow in magnitude as k and h go to zero. So it is not clear that k will go to zero, and zero-stability is not sufficient. For the heat equation with k=h 2 fixed, these values do not go to zero as k! 0. For convergence we must now require that these values at least lie in the region of absolute stability as k! 0, and this gives the stability restriction relating k and h. If we keep k=h fixed as k; h! 0, then k! for the eigenvalues of the matrix A from (9.2). We must use an implicit method that includes the entire negative real axis in its stability region. We also notice another difference between stability theory for ODEs and PDEs that for the ODE u 0.t/ D f.u.t// we could prove convergence of standard methods for any Lipschitz continuous function f.u/. For example, the proof of convergence of Euler s method for the linear case, found in Section 6.3., was easily extended to nonlinear functions in Section In the PDE case, the Lax equivalence theorem is much more limited: it applies only to linear methods (9.6), and such methods typically only arise when discretizing linear PDEs such as the heat equation. It is possible to prove stability of many methods for nonlinear PDEs by showing that a suitable form of stability holds, but a variety of different techniques must be used, depending on the character of the differential equation, and there is no general theory of the sort obtained for ODEs. The essential difficulty is that even a linear PDE such as the heat equation u t 2 x u involves an operator on the right-hand side that is not Lipschitz continuous in a function space norm of the sort introduced in Section A.4. Discretizing on a grid 2 x u by f.u / D AU, which is Lipschitz continuous, but the Lipschitz constant kak grows at the rate of =h 2 as the grid is refined. In the nonlinear case it is often difficult to obtain the sort of bounds needed to prove convergence. See [40], [68], [75], or [84] for further discussions of stability.

234 rjlfdm 2007/6/ page Chapter 9. Diffusion Equations and Parabolic Problems 9.6 Von Neumann analysis Although it is useful to go through the MOL formulation to understand how stability theory for PDEs is related to the theory for ODEs, in practice there is another approach that will sometimes give the proper stability restrictions more easily. The von Neumann approach to stability analysis is based on Fourier analysis and hence is generally limited to constant coefficient linear PDEs. For simplicity it is usually applied to the Cauchy problem, which is the PDE on all space with no boundaries, < x < in the one-dimensional case. Von Neumann analysis can also be used to study the stability of problems with periodic boundary conditions, e.g., in 0 x with u.0; t/ D u.; t/ imposed. This is generally equivalent to a Cauchy problem with periodic initial data. Stability theory for PDEs with more general boundary conditions can often be quite difficult, as the coupling between the discretization of the boundary conditions and the discretization of the PDE can be very subtle. Von Neumann analysis addresses the issue of stability of the PDE discretization alone. Some discussion of stability theory for initial boundary value problems can be found in [84], [75]. See also Section 0.2. The Cauchy problem for linear PDEs can be solved using Fourier transforms see Section E.3 for a review. The basic reason this works is that the functions e ix with wave number D constant are eigenfunctions of the differential x e ix D ie ix ; and hence of any constant coefficient linear differential operator. Von Neumann analysis is based on the fact that the related grid function W j D e ij h is an eigenfunction of any translation-invariant finite difference operator. For example, if we approximate v 0.x j / by D 0 V j D 2h.V jc V j /, then in general the grid function D 0 V is not a scalar multiple of V. But for the special case of W, we obtain D 0 W j D e i.jc/h i.j e /h 2h D 2h e ih D i h sin.h/eij h e ih ij h e (9.23) D i h sin.h/w j : So W is an eigengridfunction of the operator D 0, with eigenvalue i h sin.h/. Note the relation between these and the eigenfunctions and eigenvalues of the x found earlier: W j is simply the eigenfunction w.x/ x evaluated at the point x j, and for small h we can approximate the eigenvalue of D 0 by In this section i D p and the index j is used on the grid functions.

235 rjlfdm 2007/6/ page Von Neumann analysis 93 i h sin.h/ D i h h D i 6 h3 3 C O.h 5 5 / i 6 h2 3 C: This agrees with the eigenvalue i x to O.h 2 3 /. Suppose we have a grid function V j defined at grid points x j D jh for j D 0; ; 2; :::, which is an l 2 function in the sense that the 2-norm 0 ku k 2 X jd =2 ju j j 2 A is finite. Then we can express V j as a linear combination of the grid functions e ij h for all in the range =h =h. Functions with larger wave number cannot be resolved on this grid. We can write where V j D p 2 Z =h OV./ D h p 2 =h V O./e ij h d; X jd V j e ij h : These are direct analogue of the formulas for a function v.x/ in the discrete case. Again we have Parseval s relation, k OV k 2 DkV k 2, although the 2-norms used for the grid function V j and the function OV./ defined on Œ =h; =h are different: 0 kv k 2 X jd =2 jv j j 2 A ; k OV k 2 D Z =h =h =2 jv O./j d! 2 : To show that a finite difference method is stable in the 2-norm by the techniques discussed earlier in this chapter, we would have to show that kbk 2 C k in the notation of (9.22). This amounts to showing that there is a constant such that ku nc k 2. C k/ku n k 2 for all U n. This can be difficult to attack directly because of the fact that computing ku k 2 requires summing over all grid points, and each U nc j depends on values of U n at neighboring grid points so that all grid points are coupled together. In some cases one can work with these infinite sums directly, but it is rare that this can be done. Alternatively one can work with the matrix B itself, as we did above in Section 9.5, but this matrix is growing as we refine the grid. Using Parseval s relation, we see that it is sufficient to instead show that k OU nc k 2. C k/k OU n k 2 ;

236 rjlfdm 2007/6/ page Chapter 9. Diffusion Equations and Parabolic Problems where OU n is the Fourier transform of the grid function U n. The utility of Fourier analysis now stems from the fact that after Fourier transforming the finite difference method, we obtain a recurrence relation for each OU n./ that is decoupled from all other wave numbers. For a two-level method this has the form OU nc./ D g./ OU n./: (9.24) The factor g./, which depends on the method, is called the amplification factor for the method at wave number. If we can show that jg./j C k; where is independent of, then it follows that the method is stable, since then j OU nc./j. C k/j OU n./j for all and so k OU nc k 2. C k/k OU n k 2 : Fourier analysis allows us to obtain simple scalar recursions of the form (9.24) for each wave number separately, rather than dealing with a system of equations for Uj n that couples together all values of j. Note: Here we are assuming that u.x; t/ is a scalar, so that g./ is a scalar. For a system of s equations we would find that g./ is an ss matrix for each value of, so some analysis of matrix eigenvalues is still required to investigate stability. But the dimension of the matrices is s, independent of the grid spacing, unlike the MOL analysis, where the matrix dimension increases as k! 0. Example 9.5. Consider the method (9.5). To apply von Neumann analysis we consider how this method works on a single wave number, i.e., we set U n j D eij h : (9.25) Then we expect that U nc j D g./e ij h ; (9.26) where g./ is the amplification factor for this wave number. Inserting these expressions into (9.5) gives g./e ij h D e ij h C k e i.j /h 2e ij h C e i.jc/h h 2 D C kh e ih 2 C e ih e ij h ; 2 and hence g./ D C 2 k.cos.h/ /: h2 Since cos.h/ for any value of, we see that 4 k h 2 g./ for all. We can guarantee that jg./j for all if we require

237 rjlfdm 2007/6/ page Multidimensional problems 95 4 k h 2 2: This is exactly the stability restriction (9.4) we found earlier for this method. If this restriction is violated, then the Fourier components with some wave number will be amplified (and, as expected, it is the largest wave numbers that become unstable first as k is increased). Example 9.6. The fact that the Crank Nicolson method is stable for all k and h can also be shown using von Neumann analysis. Substituting (9.25) and (9.26) into the difference equations (9.7) and canceling the common factor of e ij h gives the following relation for g g./: g D C k e ih 2 C e ih. C g/; 2h 2 and hence g D C 2 z 2 z ; (9.27) where z D k h 2.e ih 2 C e ih / D 2k.cos.h/ /: (9.28) h2 Since z 0 for all, we see that jgj and the method is stable for any choice of k and h. Note that (9.27) agrees with the root found for the trapezoidal method in Example 7.6, while the z determined in (9.28), for certain values of, is simply k times an eigenvalue p from (9.3), the eigenvalues of the MOL matrix (9.). In general there is a close connection between the von Neumann approach and the MOL reduction of a periodic problem to a system of ODEs. 9.7 Multidimensional problems In two space dimensions the heat equation takes the form u t D u xx C u yy (9.29) with initial conditions u.x; y; 0/ D.x; y/ and boundary conditions all along the boundary of our spatial domain. We can discretize in space using a discrete Laplacian of the form considered in Chapter 3, say, the 5-point Laplacian from Section 3.2: r 2 h U ij D h 2.U i ;j C U ic;j C U i;j C U i;jc 4U ij /: (9.30) If we then discretize in time using the trapezoidal method, we will obtain the two-dimensional version of the Crank Nicolson method, U nc ij D Uij n C k h i r 2 h 2 U ij n Cr2 h U nc ij : (9.3)

238 rjlfdm 2007/6/ page Chapter 9. Diffusion Equations and Parabolic Problems Since this method is implicit, we must solve a system of equations for all the U ij where the matrix has the same nonzero structure as for the elliptic systems considered in Chapters 3 and 4. This matrix is large and sparse, and we generally do not want to solve the system by a direct method such as Gaussian elimination. This is even more true for the systems we are now considering than for the elliptic equation, because of the slightly different nature of this system, which makes other approaches even more efficient relative to direct methods. It is also extremely important now that we use the most efficient method possible, because we must now solve a linear system of this form in every time step, and we may need to take thousands of time steps to solve the time-dependent problem. We can rewrite the equations (9.3) as I k 2 r2 h U nc ij D I C k 2 r2 h Uij n : (9.32) The matrix for this linear system has the same pattern of nonzeros as the matrix for r 2 h (see Chapter 3), but the values are scaled by k=2 and then subtracted from the identity matrix, so that the diagonal elements are fundamentally different. If we call this matrix A, A D I then we find that the eigenvalues of A are k 2 r2 h ; p;q D k.cos.ph/ / C.cos.qh/ / h 2 for p; q D ; 2; :::; m, where we have used the expression for the eigenvalues of rh 2 from Section 3.4. Now the largest eigenvalue of the matrix A thus has magnitude O.k=h 2 / while the ones closest to the origin are at CO.k/. As a result the condition number of A is O.k=h 2 /. By contrast, the discrete Laplacian rh 2 alone has condition number O.=h2 / as we found in Section 3.4. The smaller condition number in the present case can be expected to lead to faster convergence of iterative methods. Moreover, we have an excellent starting guess for the solution U nc to (9.3), a fact that we can use to good advantage with iterative methods but not with direct methods. Since U nc ij D U n ij values U Œ0 ij time, using, say, U Œ0 ij n C O.k/, we can use Uij, the values from the previous time step, as initial for an iterative method. We might do even better by extrapolating forward in D 2U n ij Uij n U Œ0 ij, or by using an explicit method, say, D.I C kr 2 h /U n ij : This explicit method (forward Euler) would probably be unstable as a time-marching procedure if we used only this with the value of k we have in mind, but it can sometimes be used successfully as a way to generate initial data for an iterative procedure. Because of the combination of a reasonably well-conditioned system and very good initial guess, we can often get away with taking only one or two iterations in each time step, and still get global second order accuracy.

239 rjlfdm 2007/6/ page The locally one-dimensional method The locally one-dimensional method Rather than solving the coupled sparse matrix equation for all the unknowns on the grid simultaneously as in (9.32), an alternative approach is to replace this fully coupled single time step with a sequence of steps, each of which is coupled in only one space direction, resulting in a set of tridiagonal systems which can be solved much more easily. One example is the locally one-dimensional (LOD) method: Uij D U ij n C k 2.D2 x U ij n C D2 x U ij /; (9.33) U nc ij D U ij C k 2.D2 y U ij C D2 y U nc ij /; (9.34) or, in matrix form, I I k 2 D2 x k 2 D2 y U D U nc D I C k 2 D2 x U n ; (9.35) I C k 2 D2 y U : (9.36) In (9.33) we apply Crank Nicolson in the x-direction only, solving u t D u xx alone over time k, and we call the result U. Then in (9.34) we take this result and apply Crank Nicolson in the y-direction to it, solving u t D u yy alone, again over time k. Physically this corresponds to modeling diffusion in the x- and y-directions over time k as a decoupled process in which we first allow u to diffuse only in the x-direction and then only in the y-direction. If the time steps are very short, then this might be expected to give similar physical behavior and hence convergence to the correct behavior as k! 0. In fact, for the constant coefficient diffusion problem, it can even be shown that (in the absence of boundaries at least) this alternating diffusion approach gives exactly the same behavior as the original two-dimensional diffusion. Diffusing first in x alone over time k and then in y alone over time k gives the same result as if the diffusion occurs simultaneously in both directions. (This is because the differential 2 x y commute, as discussed further in Example..) Numerically there is a great advantage in using (9.35) and (9.36) rather than the fully coupled (9.32). In (9.35) the unknowns Uij are coupled together only across each row of the grid. For any fixed value of j we have a tridiagonal system of equations to solve for Uij.i D ; 2; :::; m/. The system obtained for each value of j is completely decoupled from the system obtained for other values of j. Hence we have a set of m C 2 tridiagonal systems to solve (for j D 0; ; :::; m C ), each of dimension m, rather than a single coupled system with m 2 unknowns as in (9.32). Since each of these systems is tridiagonal, it is easily solved in O.m/ operations by Gaussian elimination and there is no need for iterative methods. (In the next section we will see why we need to solve these for j D 0 and j D m C as well as at the interior grid points.) Similarly, (9.34) decouples into a set of m tridiagonal systems in the y-direction for i D ; 2; :::; m. Hence taking a single time step requires solving 2m C 2 tridiagonal systems of size m, and thus O.m 2 / work. Since there are m 2 grid points, this is the optimal order and no worse than an explicit method, except for a constant factor.

240 rjlfdm 2007/6/ page Chapter 9. Diffusion Equations and Parabolic Problems 9.8. Boundary conditions nc and Ui0 along i;mc i;mc along the top boundary, for terms that go on the right-hand side of each tridiagonal system. The values at level n C are available from the given boundary data for the heat equation, by evaluating the boundary conditions at In solving the second set of systems (9.34), we need boundary values Ui0 the bottom boundary and U nc and U time t nc (assuming Dirichlet boundary conditions are given). To obtain the values Ui0 we solve (9.33) for j D 0 and j D m C (along the boundaries) in addition to the systems along each row interior to the grid. To solve the first set of systems (9.33), we need boundary values U0j n and U 0j along the left boundary and values UmC;j n and U mc;j along the right boundary. The values at level n come from the given boundary conditions, but we must determine the intermediate boundary conditions at level along these boundaries. It is not immediately clear what values should be used. One might be tempted to think of level as being halfway between t n and t nc, since U is generated in the middle of the two-step procedure used to obtain U nc from U n. If this were valid, then evaluating the given boundary data at time t nc=2 D t n C k=2 might provide values for U on the boundary. This is not a good idea, however, and would lead to a degradation of accuracy. The problem is that in the first step, (9.33) does not model the full heat equation over time k=2 but rather models part of the equation (diffusion in x alone) over the full time step k. The values along the boundary will in general evolve quite differently in the two different cases. To determine proper values for U0j and U mc;j, we can use (9.34) along the left and right boundaries. At i D 0, for example, this equation gives a system of equations along the left boundary that can be viewed as a tridiagonal linear system or the unknowns U0j in terms of the values U nc 0j, which are already known from the boundary conditions at time t nc. Note that we are solving this equation backward from the way it will be used in the second step of the LOD process on the interior of the grid, and this works only because we already know U nc 0j from boundary data. Since we are solving this equation backward, we can view this as solving the diffusion equation u t D u yy over a time step of length k, backward in time. This makes sense physically the intermediate solution U represents what is obtained from U n by doing diffusion in x alone, with no diffusion yet in y. There are in principle two ways to get this, either by starting with U n and diffusing in x or by starting with U nc and undiffusing in y. We are using the latter approach along the boundaries to generate data for U. Equivalently we can view this as solving the backward heat equation u t D u yy over time k. This may be cause for concern, since the backward heat equation is ill posed (see Section E.3.4). However, since we are doing this only over one time step starting with given values U nc 0j in each time step, this turns out to be a stable procedure. There is still a difficulty at the corners. To solve (9.34) for U0j, j D ; 2; :::; m, we need to know the values of U00 and U 0;mC that are the boundary values for this system. These can be approximated using some sort of explicit and uncentered approximation to either u t D u xx starting with U n, or to u t D u yy starting with U nc. For example, we might use

241 rjlfdm 2007/6/ page The locally one-dimensional method 99 U00 D U nc k nc 00.U h2 00 2U nc 0 C U nc 02 /; which uses the approximation to u yy centered at.x 0 ; y /. Alternatively, rather than solving the tridiagonal systems obtained from (9.34) for U0j, we could simply use an explicit approximation to the backward heat equation along this boundary, U0j D U nc k nc 0j.U h2 0;j 2U nc 0j C U nc 0;jC / (9.37) for j D ; 2; :::; m. This eliminates the need for values of U in the corners. Again, since this is not iterated but done only starting with given (and presumably smooth) boundary data U nc in each time step, this yields a stable procedure. With proper treatment of the boundary conditions, it can be shown that the LOD method is second order accurate (see Example.). It can also be shown that this method, like full Crank Nicolson, is unconditionally stable for any time step The alternating direction implicit method A modification of the LOD method is also often used, in which the two steps each involve discretization in only one spatial direction at the advanced time level (giving decoupled tridiagonal systems again) but coupled with discretization in the opposite direction at the old time level. The classical method of this form is Uij D U ij n C k 2.D2 y U ij n C D2 x U ij /; (9.38) U nc ij D U ij C k 2.D2 x U ij C D2 y U nc ij /: (9.39) This is called the alternating direction implicit (ADI) method and was first introduced by Douglas and Rachford [26]. This again gives decoupled tridiagonal systems to solve in each step: I I k 2 D2 x k 2 D2 y U D U nc D I C k 2 D2 y U n ; (9.40) I C k 2 D2 x U : (9.4) With this method, each of the two steps involves diffusion in both the x- and the y-direction. In the first step the diffusion in x is modeled implicitly, while diffusion in y is modeled explicitly, with the roles reversed in the second step. In this case each of the two steps can be shown to give a first order accurate approximation to the full heat equation over time k=2, so that U represents a first order accurate approximation to the solution at time t nc=2. Because of the symmetry of the two steps, however, the local error introduced in the second step almost exactly cancels the local error introduced in the first step, so that the combined method is in fact second order accurate over the full time step. Because U does approximate the solution at time t nc=2 in this case, it is possible to simply evaluate the given boundary conditions at time t nc=2 to generate the necessary boundary values for U. This will maintain second order accuracy. A better error constant

242 rjlfdm 2007/6/ page Chapter 9. Diffusion Equations and Parabolic Problems can be achieved by using slightly modified boundary data which introduces the expected error in U into the boundary data that should be canceled out by the second step. 9.9 Other discretizations For illustration purposes we have considered only the classic Crank Nicolson method consisting of second order centered approximation to u xx coupled with the trapezoidal method for time stepping. However, an infinite array of other combinations of spatial approximation and time stepping methods could be considered, some of which may be preferable. The following are a few possibilities: The second order accurate spatial difference operator could be replaced by a higher order method, such as the fourth order accurate approximations of Section in one dimension of Section 3.5 in more dimensions. A spectral method could be used in the spatial dimension(s), as discussed in Section 2.2. Note that in this case the linear system that must be solved in each time step will be dense. On the other hand, for many problems it is possible to use a much coarser grid for spectral methods, leading to relatively small linear algebra problems. The time-stepping procedure could be replaced by a different implicit method suitable for stiff equations, of the sort discussed in Chapter 8. In particular, for some problems it is desirable to use an L-stable method. While the trapezoidal method is stable, it does not handle underresolved transients well (recall Figure 8.4). For some problems where diffusion is coupled with other processes there are constantly high-frequency oscillations or discontinuities introduced that should be smoothed by diffusion, and Crank Nicolson can suffer from oscillations in time. The time stepping could be done by using a method such as the Runge Kutta Chebyshev method described in Section 8.6. This is an explicit method that works for mildly stiff problems with real eigenvalues, such as the heat equation. The time stepping could be done using the exponential time differencing (ETD) methods described in Section.6. The heat equation with constant coefficients and time-varying boundary conditions leads to a MOL discretization of the form (9.), where A is a constant matrix. If the centered difference approximation is used in one dimension, then (9.2) holds, but even with other discretizations, or in more dimensions, the semidiscrete system still has the form U 0.t/ D AU.t/ C g.t/. The exact solution can be written in terms of the matrix exponential e At and this form is used in the ETD methods. The manner in which this is computed depends on whether A is large and sparse (the typical case with a finite difference discretization) or small and dense (as it might be if a spectral discretization is used in space). See Section.6. for more discussion of this.

243 243 Parabolic PDE Example Problem in D:. Model Equation Consider a temporal evolution of solving the classical homogeneous heat equation (or diffusion equation) of the form u t = κu xx (8.) with κ > 0 (Note if κ < 0 then Eq. 8. would be a backward heat equation, which is an ill-posed problem. See Table in Chapter 7 of our lecture note). 2. Initial and boundary conditions Along with this equation, let us impose an initial condition at t = 0, u(x, 0) = f(x) (8.2) and also the Dirichlet boundary condition on a bounded domain x a x x b u(x a, t) = g a (t) and u(x b, t) = g b (t), for t > 0. (8.3) 3. Discretization in space and time Let us take the discretization technique with which we have a spatial resolution of N and a temporal resolution of M: x i = x a + (i ) x, i =,..., N, 2 (8.4) t n = n t, n = 0,...M. (8.5) Notice that the cell interface-centered grid points are written using the halfinteger indices: x i Difference Equation of order O( t, x 2 ) = x i + x 2. (8.6) In this example, we discretize the temporal and spatial derivatives that are first and second order accurate scheme. For this we choose the forward difference scheme for temporal discretization and the standard second-order central difference difference scheme for spatial discretization: and u xx (x, t) = u t = u(x, t + dt) u(x, t) t + O( t), (8.7) u(x + x, t) 2u(x, t) + u(x x, t) x 2 + O( x 2 ), (8.8)

244 244 which gives a final discrete form of our explicit finite difference scheme for the heat equation: u n+ i = u n i + κ t ( ) x 2 u n i+ 2u n i + u n i (8.9) 5. Stability Limit of time-step t As seen in Chapter 8 in our lecture note, we need to choose t that satisfies the von Neumann stability theory. The linear stability theory for our linear heat equation provides that in order the method in Eq. 8.9 to be stable, the time-step t needs to satisfy κ t x2 (8.0) 2 6. Imposing Boundary Conditions via Guard-cell (or ghost-cell) We can introduce so-called the guard-cell or ghost-cell (simply GC) on each end, having extra two GC points, x 0 = x a x/2 (8.) x N+ = x b + x/2. (8.2) With these two extra GC points (one GC on each end) over the spatial domain the difference equation are evolved only over the interior points, whereas the boundary conditions are explicitly imposed at the two GC points, U n 0 = g a (t n ), U n N+ = g b (t n ). (8.3) 7. Example Matlab Code Let s numerically solve Eq. 8. using Matlab. The boundary condition is given so as to hold the temperature u to be zero at x = 0 and 00 F at x = for t > 0 (i.e., g a = 0 F and g b = 00 F.). We can solve three different temporal evolutions for three materials: (i) iron with κ = 0.230cm 2 /sec, (ii) aluminum κ = 0.975cm 2 /sec, and (iii) copper with κ =.56cm 2 /sec. The initial condition in all three cases is to describe a same initial temperature profile { 0 f(x) = F for 0 x <, 00 (8.4) F for x =. (a) Use the grid sizes of 6, 32, 64, 28 and 256 and compare your results. What can you say about the grid resolution study in the diffusion equation? (b) What happens if your t fails to satisfy Eq. 8.0 for each κ? (c) What are your values of t max for three different materials?

245 245 Figure. 8. Matlab Code % % AMS 23 - Spring Quarter, 205 % MATLAB code for D heat diffusion % u_t = kappa*u_xx % Written by Prof. Dongwook Lee % AMSC, UCSC % clf; clear all; % grid resolution xa=0.; xb=.; N=6; dx = (xb-xa)/n; % discrete domain x=linspace(xa+0.5*dx,xb-0.5*dx,n); % Set convenient indices ngc=; %number of gc

246 246 ibeg=ngc+; %first interior point iend=ngc+n; %last interior point igc0=ngc; %first gc igcn=2*ngc+n; %last gc % fixed BC g0=0.; g=00.; u(igc0) = g0; u(ibeg:iend)=0; u(igcn)=g; % diffusion coefficient kappa=.56; % CFL & dt Ca = 0.8; dt=ca*0.5*dxˆ2/kappa; t=0; tmax=0.45; %hold on plot(x,u(ibeg:iend), r ); pause(); while t<tmax, for i=ibeg:iend; % solve heat diffusion for interior points unew(i) = u(i)+kappa*dt/dxˆ2*(u(i+)-2.*u(i)+u(i-)); end end %update t t=t+dt; % update BC unew(igc0) =g0; unew(igcn)=g; % store your solution array u=unew; % plot plot(x,u(ibeg:iend)) pause(0.)

247 Chapter 9 Hyperbolic PDEs We selectively follow Chapter 0 of the book Finite Difference Methods for Ordinary and Partial Differential Equations by Prof. Randy LeVeque, University of Washington. 247

248 rjlfdm 2007/6/ page 20 Chapter 0 Advection Equations and Hyperbolic Systems Hyperbolic partial differential equations (PDEs) arise in many physical problems, typically whenever wave motion is observed. Acoustic waves, electromagnetic waves, seismic waves, shock waves, and many other types of waves can be modeled by hyperbolic equations. Often these are modeled by linear hyperbolic equations (for the propagation of sufficiently small perturbations), but modeling large motions generally requires solving nonlinear hyperbolic equations. Hyperbolic equations also arise in advective transport, when a substance is carried along with a flow, giving rise to an advection equation. This is a scalar linear first order hyperbolic PDE, the simplest possible case. See Appendix E for more discussion of hyperbolic problems and a derivation of the advection equation in particular. In this chapter we will primarily consider the advection equation. This is sufficient to illustrate many (although certainly not all) of the issues that arise in the numerical solution of hyperbolic equations. Section 0.0 contains a very brief introduction to hyperbolic systems, still in the linear case. A much more extensive discussion of hyperbolic problems and numerical methods, including nonlinear problems and multidimensional methods, can be found in [66]. Those interested in solving more challenging hyperbolic problems may also look at the CLAWPACK software [64], which was designed primarily for hyperbolic problems. There are also a number of other books devoted to nonlinear hyperbolic equations and their solution, e.g., [58], [88]. 0. Advection In this section we consider numerical methods for the scalar advection equation u t C au x D 0; (0.) where a is a constant. See Section E.2. for a discussion of this equation. For the Cauchy problem we also need initial data u.x; 0/ D.x/: 20

249 rjlfdm 2007/6/ page Chapter 0. Advection Equations and Hyperbolic Systems This is the simplest example of a hyperbolic equation, and it is so simple that we can write down the exact solution, u.x; t/ D.x at/: (0.2) One can verify directly that this is the solution (see also Appendix E). However, many of the issues that arise more generally in discretizing hyperbolic equations can be most easily seen with this equation. The first approach we might consider is the analogue of the method (9.4) for the heat equation. Using the centered difference in space, u.x C h; t/ u.x h; t/ u x.x; t/ D C O.h 2 / (0.3) 2h and the forward difference in time results in the numerical method which can be rewritten as U nc j k U nc j U n j D U n j D a 2h.U jc n Uj n /; (0.4) ak 2h.U jc n Uj n /: (0.5) This again has the stencil shown in Figure 9.(a). In practice this method is not useful because of stability considerations, as we will see in the next section. A minor modification gives a more useful method. If we replace Uj n on the righthand side of (0.5) by the average 2.U j n C U jc n /, then we obtain the Lax Friedrichs method, U nc j D 2.U j n C U jc n / ak 2h.U jc n Uj n /: (0.6) Because of the low accuracy, this method is not commonly used in practice, but it serves to illustrate some stability issues and so we will study this method along with (0.5) before describing higher order methods, such as the well-known Lax Wendroff method. We will see in the next section that Lax Friedrichs is Lax Richtmyer stable (see Section 9.5) and convergent provided ak ˇ h ˇ : (0.7) Note that this stability restriction allows us to use a time step k D O.h/ although the method is explicit, unlike the case of the heat equation. The basic reason is that the advection equation involves only the first order derivative u x rather than u xx and so the difference equation involves =h rather than =h 2. The time step restriction (0.7) is consistent with what we would choose anyway based on accuracy considerations, and in this sense the advection equation is not stiff, unlike the heat equation. This is a fundamental difference between hyperbolic equations and parabolic equations more generally and accounts for the fact that hyperbolic equations are typically solved with explicit methods, while the efficient solution of parabolic equations generally requires implicit methods.

250 rjlfdm 2007/6/ page Method of lines discretization 203 To see that (0.7) gives a reasonable time step, note that u x.x; t/ D 0.x while u t.x; t/ D au x.x; t/ D a 0.x at/: The time derivative u t is larger in magnitude than u x by a factor of a, and so we would expect the time step required to achieve temporal resolution consistent with the spatial resolution h to be smaller by a factor of a. This suggests that the relation k h=a would be reasonable in practice. This is completely consistent with (0.7). at/; 0.2 Method of lines discretization To investigate stability further we will again introduce the method of lines (MOL) discretization as we did in Section 9.2 for the heat equation. To obtain a system of equations with finite dimension we must solve the equation on some bounded domain rather than solving the Cauchy problem. However, in a bounded domain, say, 0 x, the advection equation can have a boundary condition specified on only one of the two boundaries. If a > 0, then we need a boundary condition at x D 0, say, u.0; t/ D g 0.t/; (0.8) which is the inflow boundary in this case. The boundary at x D is the outflow boundary and the solution there is completely determined by what is advecting to the right from the interior. If a < 0, we instead need a boundary condition at x D, which is the inflow boundary in this case. The symmetric 3-point methods defined above can still be used near the inflow boundary but not at the outflow boundary. Instead the discretization will have to be coupled with some numerical boundary condition at the outflow boundary, say, a one-sided discretization of the equation. This issue complicates the stability analysis and will be discussed in Section 0.2. For analysis purposes we can obtain a nice MOL discretization if we consider the special case of periodic boundary conditions, u.0; t/ D u.; t/ for t 0: Physically, whatever flows out at the outflow boundary flows back in at the inflow boundary. This also models the Cauchy problem in the case where the initial data is periodic with period, in which case the solution remains periodic and we need to model only a single period 0 x. In this case the value U 0.t/ D U mc.t/ along the boundaries is another unknown, and we must introduce one of these into the vector U.t/. If we introduce U mc.t/, then we have the vector of grid values 2 U.t/ D 6 4 U.t/ U 2.t/ : U mc.t/ :

251 rjlfdm 2007/6/ page Chapter 0. Advection Equations and Hyperbolic Systems For 2 j m we have the ordinary differential equation (ODE) Uj 0.t/ D a 2h.U jc.t/ U j.t//; while the first and last equations are modified using the periodicity: This system can be written as with A D 2 a 2h 6 4 U 0.t/ D U 0 mc.t/ D a 2h.U 2.t/ a 2h.U.t/ U mc.t//; U m.t//: U 0.t/ D AU.t/ (0.9) : :: : :: : :: R.mC/.mC/ : (0.0) 7 5 Note that this matrix is skew-symmetric (A T D A) and so its eigenvalues must be pure imaginary. In fact, the eigenvalues are p D ia h sin.2ph/ for p D ; 2; :::; m C : (0.) The corresponding eigenvector u p has components u p j D e2ipjh for j D ; 2; :::; m C : (0.2) The eigenvalues lie on the imaginary axis between ia=h and ia=h. For absolute stability of a time discretization we need the stability region S to include this interval. Any method that includes some interval iy; jyj < b of the imaginary axis will lead to a stable method for the advection equation provided jak= hjb. For example, looking again at the stability regions plotted in Figures 7. through 7.3 and Figure 8.5 shows that the midpoint method or certain Adams methods may be suitable for this problem, whereas the backward differentiation formula (BDF) methods are not Forward Euler time discretization The method (0.5) can be viewed as the forward Euler time discretization of the MOL system of ODEs (0.9). We found in Section 7.3 that this method is stable only if jckj and the stability region S is the unit circle centered at. No matter how small the ratio k=h is, since the eigenvalues p from (0.) are imaginary, the values k p will not lie in S. Hence the method (0.5) is unstable for any fixed mesh ratio k= h; see Figure 0.(a).

252 rjlfdm 2007/6/ page Method of lines discretization 205 The method (0.5) will be convergent if we let k! 0 faster than h, since then k p! 0 for all p and the zero-stability of Euler s method is enough to guarantee convergence. Taking k much smaller than h is generally not desirable and the method is not used in practice. However, it is interesting to analyze this situation also in terms of Lax Richtmyer stability, since it shows an example where the Lax Richtmyer stability uses a weaker bound of the form (9.22), kbk C k, rather than kbk. Here B D I C ka. Suppose we take k D h 2, for example. Then we have j C k p j 2 C.ka=h/ 2 for each p (using the fact that p is pure imaginary) and so j C k p j 2 C a 2 h 2 D C a 2 k: Hence ki C kak 2 2 C a2 k and if nk T, we have k.i C ka/ n k 2. C a 2 k/ n=2 e a2 T =2 ; showing the uniform boundedness of kb n k (in the 2-norm) needed for Lax Richtmyer stability Leapfrog A better time discretization is to use the midpoint method (5.23), U nc D U n C 2kAU n ; which gives the leapfrog method for the advection equation, U nc j D Uj n ak h.u jc n Uj n /: (0.3) This is a 3-level explicit method and is second order accurate in both space and time. Recall from Section 7.3 that the stability region of the midpoint method is the interval i for < <of the imaginary axis. This method is hence stable on the advection equation provided jak=hj < is satisfied. On the other hand, note that the k p will always be on the boundary of the stability region (the stability region for midpoint has no interior). This means the method is only marginally stable there is no growth but also no decay of any eigenmode. The difference equation is said to be nondissipative. In some ways this is good the true advection equation is also nondissipative, and any initial condition simply translates unchanged, no matter how oscillatory. Leapfrog captures this qualitative behavior well. However, there are problems with this. All modes translate without decay, but they do not all propagate at the correct velocity, as will be explained in Example 0.2. As a result initial data that contains high wave number components (e.g., if the data contains steep gradients) will disperse and can result in highly oscillatory numerical approximations. The marginal stability of leapfrog can also turn into instability if a method of this sort is applied to a more complicated problem with variable coefficients or nonlinearities.

253 rjlfdm 2007/6/ page Chapter 0. Advection Equations and Hyperbolic Systems Lax Friedrichs Again consider the Lax Friedrichs method (0.6). Note that we can rewrite (0.6) using the fact that 2.U n j C U n jc / D U n j C 2.U n j 2U n j C U n jc / to obtain U nc j D U n j This can be rearranged to give U nc j k U n j ak 2h.U jc n Uj n / C 2.U j n 2Uj n C U jc n /: (0.4) C a U n jc U n j 2h! D h2 2k Uj n 2Uj n C U! jc n : If we compute the local truncation error from this form we see, as expected, that it is consistent with the advection equation u t C au x D 0, since the term on the right-hand side vanishes as k; h! 0 (assuming k=h is fixed). However, it looks more like a discretization of the advection-diffusion equation u t C au x D u xx ; where D h 2 =2k. Later in this chapter we will study the diffusive nature of many methods for the advection equation. For our present purposes, however, the crucial part is that we can now view (0.4) as resulting from a forward Euler discretization of the system of ODEs U 0.t/ D A U.t/ h 2 with A D a 2h C h : :: : :: : :: : :: : :: : :: ; 7 5 (0.5) where D h 2 =2k. The matrix A differs from the matrix A of (0.0) by the addition of a small multiple of the second difference operator, which is symmetric rather than skewsymmetric. As a result the eigenvalues of A are shifted off the imaginary axis and now lie

254 rjlfdm 2007/6/ page The Lax Wendroff method 207 in the left half-plane. There is now some hope that each k will lie in the stability region of Euler s method if k is small enough relative to h. It can be verified that the eigenvectors (0.2) of the matrix A are also eigenvectors of the second difference operator (with periodic boundary conditions) that appears in (0.5), and hence these are also the eigenvectors of the full matrix A. We can easily compute that the eigenvalues of A are p D ia h sin.2ph/ 2. cos.2ph//: (0.6) h2 The values k p are plotted in the complex plane for various different values of in Figure 0.. They lie on an ellipse centered at 2k=h 2 with semi-axes of length 2k=h 2 in the x-direction and ak=h in the y-direction. For the special case D h 2 =2k used in Lax Friedrichs, we have 2k=h 2 D and this ellipse lies entirely inside the unit circle centered at, provided that jak=hj. (If jak=hj >, then the top and bottom of the ellipse would extend outside the circle.) The forward Euler method is stable as a timediscretization, and hence the Lax Friedrichs method is Lax Richtmyer stable, provided jak= hj. 0.3 The Lax Wendroff method One way to achieve second order accuracy on the advection equation is to use a second order temporal discretization of the system of ODEs (0.9), since this system is based on a second order spatial discretization. This can be done with the midpoint method, for example, which gives rise to the leapfrog scheme (0.3) already discussed. However, this is a three-level method and for various reasons it is often much more convenient to use two-level methods for PDEs whenever possible in more than one dimension the need to store several levels of data may be restrictive, boundary conditions can be harder to impose, and combining methods using fractional step procedures (as discussed in Chapter ) may require two-level methods for each step, to name a few reasons. Moreover, the leapfrog method is nondissipative, leading to potential stability problems if the method is extended to variable coefficient or nonlinear problems. Another way to achieve second order accuracy in time would be to use the trapezoidal method to discretize the system (0.9), as was done to derive the Crank Nicolson method for the heat equation. But this is an implicit method and for hyperbolic equations there is generally no need to introduce this complication and expense. Another possibility is to use a two-stage Runge Kutta method such as the one in Example 5. for the time discretization. This can be done, although some care must be exercised near boundaries, and the use of a multistage method again typically requires additional storage. One simple way to achieve a two-level explicit method with higher accuracy is to use the idea of Taylor series methods, as described in Section 5.6. Applying this directly to the linear system of ODEs U 0.t/ D AU.t/ (and using U 00 D AU 0 D A 2 U ) gives the second order method U nc D U n C kau n C 2 k2 A 2 U n :

255 rjlfdm 2007/6/ page Chapter 0. Advection Equations and Hyperbolic Systems.5 (a) D 0:0 (Forward Euler).5 (b) D 0: (c) D 0:005.5 (d) D 0:008 (Lax-Wendroff) (e) D 0:025 (Lax-Friedrichs).5 (f) D 0: Figure 0.. Eigenvalues of the matrix A in (0.5), for various values of, in the case h D =50 and k D 0:8h, a D, soak=h D 0:8. (a) shows the case D 0 which corresponds to the forward Euler method (0.5). (d) shows the case D a 2 k=2, the Lax Wendroff method (0.8). (e) shows the case D h 2 =2k, the Lax Friedrichs method (0.6). The method is stable for between a 2 k=2 and h 2 =2k, as in (d) through (e).

256 rjlfdm 2007/6/ page The Lax Wendroff method 209 Here A is the matrix (0.0), and computing A 2 and writing the method at the typical grid point then gives U nc j D U n j ak 2h.U jc n Uj n / C a2 k 2 8h.U n 2 j 2 2Uj n C U jc2 n /: (0.7) This method is second order accurate and explicit but has a 5-point stencil involving the points Uj n 2 and U jc2 n. With periodic boundary conditions this is not a problem, but with other boundary conditions this method needs more numerical boundary conditions than a 3- point method. This makes it less convenient to use and potentially more prone to numerical instability. Note that the last term in (0.7) is an approximation to 2 a2 k 2 u xx using a centered difference based on step size 2h. A simple way to achieve a second order accurate 3-point method is to replace this term by the more standard 3-point formula. We then obtain the standard Lax Wendroff method: U nc j D Uj n ak 2h.U jc n Uj n / C a2 k 2 2h.U n 2 j 2Uj n C U jc n /: (0.8) A cleaner way to derive this method is to use Taylor series expansions directly on the PDE u t C au x D 0, to obtain u.x; t C k/ D u.x; t/ C ku t.x; t/ C 2 k2 u tt.x; t/ C: Replacing u t by au x and u tt by a 2 u xx gives u.x; t C k/ D u.x; t/ kau x.x; t/ C 2 k2 a 2 u xx.x; t/ C: If we now use the standard centered approximations to u x and u xx and drop the higher order terms, we obtain the Lax Wendroff method (0.8). It is also clear how we could obtain higher order accurate explicit two-level methods by this same approach, by retaining more terms in the series and approximating the spatial derivatives (including the higher order spatial derivatives that will then arise) by suitably high order accurate finite difference approximations. The same approach can also be used with other PDEs. The key is to replace the time derivatives arising in the Taylor series expansion with spatial derivatives, using expressions obtained by differentiating the original PDE Stability analysis We can analyze the stability of Lax Wendroff following the same approach used for Lax Friedrichs in Section 0.2. Note that with periodic boundary conditions, the Lax Wendroff method (0.8) can be viewed as Euler s method applied to the linear system of ODEs U 0.t/ D A U.t/, where A is given by (0.5) with D a 2 k=2 (instead of the value D h 2 =2k used in Lax Friedrichs). The eigenvalues of A are given by (0.6) with the appropriate value of, and multiplying by the time step k gives k p D i ak h sin.ph/ C ak h 2.cos.ph/ /:

257 rjlfdm 2007/6/ page Chapter 0. Advection Equations and Hyperbolic Systems These values all lie on an ellipse centered at.ak=h/ 2 with semi-axes of length.ak=h/ 2 and jak=hj. Ifjak=hj, then all of these values lie inside the stability region of Euler s method. Figure 0.(d) shows an example in the case ak= h D 0:8. The Lax Wendroff method is stable with exactly the same time step restriction (0.7) as required for Lax Friedrichs. In Section 0.7 we will see that this is a very natural stability condition to expect for the advection equation and is the best we could hope for when a 3-point method is used. A close look at Figure 0. shows that the values k p near the origin lie much closer to the boundary of the stability region for the Lax Wendroff method (Figure 0.(d)) than for the other methods illustrated in this figure. This is a reflection of the fact that Lax Wendroff is second order accurate, while the others are only first order accurate. Note that a value k p lying inside the stability region indicates that this eigenmode will be damped as the wave propagates, which is unphysical behavior since the true solution advects with no dissipation. For small values of p (low wave numbers, smooth components) the Lax Wendroff method has relatively little damping and the method is more accurate. Higher wave numbers are still damped with Lax Wendroff (unless jak= hj D, in which case all the k p lie on the boundary of S) and resolving the behavior of these modes properly would require a finer grid. Comparing Figures 0.(c), (d), and (e) shows that Lax Wendroff has the minimal amount of numerical damping needed to bring the values k p within the stability region. Any less damping, as in Figure 0.(c) would lead to instability, while more damping as in Figure 0.(e) gives excessive smearing of low wave numbers. Recall that the value of used in Lax Wendroff was determined by doing a Taylor series expansion and requiring second order accuracy, so this makes sense. 0.4 Upwind methods So far we have considered methods based on symmetric approximations to derivatives. Alternatively, one might use a nonsymmetric approximation to u x in the advection equation, e.g., u x.x j ; t/ h.u j U j / (0.9) or u x.x j ; t/ h.u jc U j /: (0.20) These are both one-sided approximations, since they use data only to one side or the other of the point x j. Coupling one of these approximations with forward differencing in time gives the following methods for the advection equation: U nc j D U n j ak h.u n j U n j / (0.2) or U nc j D Uj n ak h.u jc n Uj n /: (0.22) These methods are first order accurate in both space and time. One might wonder why we would want to use such approximations, since centered approximations are more accurate.

258 rjlfdm 2007/6/ page Upwind methods 2 For the advection equation, however, there is an asymmetry in the equations because the equation models translation at speed a. If a > 0, then the solution moves to the right, while if a < 0 it moves to the left. There are situations where it is best to acknowledge this asymmetry and use one-sided differences in the appropriate direction. The choice between the two methods (0.2) and (0.22) should be dictated by the sign of a. Note that the true solution over one time step can be written as u.x j ; t C k/ D u.x j ak; t/ so that the solution at the point x j at the next time level is given by data to the left of x j if a > 0, whereas it is determined by data to the right of x j if a < 0. This suggests that (0.2) might be a better choice for a > 0 and (0.22) for a < 0. In fact the stability analysis below shows that (0.2) is stable only if 0 ak h : (0.23) Since k and h are positive, we see that this method can be used only if a > 0. This method is called the upwind method when used on the advection equation with a > 0. If we view the equation as modeling the concentration of some tracer in air blowing past us at speed a, then we are looking in the correct upwind direction to judge how the concentration will change with time. (This is also referred to as an upstream differencing method in some literature.) Conversely, (0.22) is stable only if ak h 0 (0.24) and can be used only if a < 0. In this case (0.22) is the proper upwind method to use Stability analysis The method (0.2) can be written as U nc j D U n j ak 2h.U jc n Uj n / C ak 2h.U jc n 2Uj n C U j n /; (0.25) which puts it in the form (0.5) with D ah=2. We have seen previously that methods of this form are stable provided jak=hj and also 2 < 2k=h 2 < 0. Since k; h > 0, this requires in particular that >0. For Lax Friedrichs and Lax Wendroff, this condition was always satisfied, but for upwind the value of depends on a and we see that > 0 only if a > 0. If a < 0, then the eigenvalues of the MOL matrix lie on a circle that lies entirely in the right half-plane, and the method will certainly be unstable. If a > 0, then the above requirements lead to the stability restriction (0.23). If we think of (0.25) as modeling an advection-diffusion equation, then we see that a < 0 corresponds to a negative diffusion coefficient. This leads to an ill-posed equation, as in the backward heat equation (see Section E.3.4). The method (0.22) can also be written in a form similar to (0.25), but the last term will have a minus sign in front of it. In this case we need a < 0 for any hope of stability and then easily derive the stability restriction (0.24).

259 rjlfdm 2007/6/ page Chapter 0. Advection Equations and Hyperbolic Systems The three methods, Lax Wendroff, upwind, and Lax Friedrichs, can all be written in the same form (0.5) with different values of. If we call these values LW, up, and LF, respectively, then we have LW D a2 k 2 D ah 2 ; up D ah 2 ; LF D h2 2k D ah 2 ; where D ak=h. Note that LW D up and up D LF : If 0 <<, then LW < up < LF and the method is stable for any value of between LW and LF, as suggested by Figure The Beam Warming method The upwind method is only first order accurate. A second order accurate method with the same one-sided character can be derived by following the derivation of the Lax Wendroff method, but using one-sided approximations to the spatial derivatives. This results in the Beam Warming method, which for a > 0 takes the form U nc j D U n j ak 2h.3U j n 4Uj n C U j n 2 / C a2 k 2 2h.U n 2 j 2Uj n C U j n 2 /: (0.26) For a < 0 the Beam Warming method is one-sided in the other direction: U nc j D U n j ak 2h. 3U j n C 4U jc n UjC2 n / C a2 k 2 2h.U n 2 j 2UjC n C U jc2 n /: (0.27) 2 0, respectively. These methods are stable for 0 2 and 0.5 Von Neumann analysis We have analyzed the stability of various algorithms for the advection equation by viewing them as ODE methods applied to the MOL system (0.9). The same stability criteria can be obtained by using von Neumann analysis as described in Section 9.6. Recall that this is done by replacing Uj n by g./n e ijh (where i D p in this section). Canceling out common factors results in an expression for the amplification factor g./, and requiring that this be bounded by in magnitude gives the stability bounds for the method. Also recall from Section 9.6 that this can be expected to give the same result as our MOL analysis because of the close relation between the e ijh factor and the eigenvectors of the matrix A. In a sense von Neumann analysis simply combines the computation of the eigenvalues of A together with the absolute stability analysis of the time-stepping method being used. Nonetheless we will go through this analysis explicitly for several of the methods already considered to show how it works, since for other methods it may be more convenient to work with this approach than to interpret the method as an MOL method. For the von Neumann analysis in this section we will simplify notation slightly by setting D ak= h, the Courant number.

260 rjlfdm 2007/6/ page Von Neumann analysis 23 Example 0.. Following the procedure of Example 9.6 for the upwind method (0.2) gives g./ D e ih D. / C e ih : (0.28) As the wave number varies, g./ moves around a circle of radius centered at. These values stay within the unit circle if and only if 0, the stability limit that was also found in Section Example 0.2. Going through the same procedure for Lax Friedrichs (0.6) gives g./ D e ih C e ih e ih e ih 2 (0.29) D cos.h/ i sin.h/ and so jg./j 2 D cos 2.h/ C 2 sin 2.h/; (0.30) which is bounded by for all only if jj. Example 0.3. For the Lax Wendroff method (0.8) we obtain g./ D 2 e ih e ih C 2 2 e ih 2 C e ih D i sin.h/ C 2.cos.h/ / D iœ2 sin.h=2/ cos.h=2/ C 2 Œ2 sin 2.h=2/ ; (0.3) where we have used two trigonometric identities to obtain the last line. This complex number has modulus jg./j 2 D Œ 2 2 sin 2.h=2/ 2 C 4 sin 2.h=2/ cos 2.h=2/ D / sin 4.h=2/: (0.32) Since 0 sin 4.h=2/ for all values of, we see that jg./j 2 for all, and hence the method is stable provided that jj, which again gives the expected stability bound (0.7). Example 0.4. The leapfrog method (0.3) involves three time levels but can still be handled by the same basic approach. If we set Uj n D g./n e ijh in the leapfrog method we obtain g./ nc e ijh D g./ n e ijh g./ n e i.jc/h e i.j /h : (0.33) If we now divide by g./ n e ijh we obtain a quadratic equation for g./, g./ 2 D 2i sin.h/g./: (0.34) Examining this in the same manner as the analysis of the stability region for the midpoint method in Example 7.7 yields the stability limit jj <.

261 rjlfdm 2007/6/ page Chapter 0. Advection Equations and Hyperbolic Systems It is important to note the severe limitations of the von Neumann approach just presented. It is strictly applicable only in the constant coefficient linear case (with periodic boundary conditions or on the Cauchy problem). Applying von Neumann analysis to the frozen coefficient problem locally often gives good guidance to the stability properties of a method more generally, but it cannot always be relied on. A great deal of work has been done on proving that methods stable for frozen coefficient problems remain stable for variable coefficient or nonlinear problems when everything is sufficiently smooth; see, for example, [40], [75]. In the nonlinear case, where the solution can contain shocks, a nonlinear stability theory is needed that employs techniques very different from von Neumann analysis; see, e.g., [66]. 0.6 Characteristic tracing and interpolation The solution to the advection equation is given by (0.2). The value of u is constant along each characteristic, which for this example is a straight line with constant slope. Over a single time step we have u.x j ; t nc / D u.x j ak; t n /: (0.35) Tracing this characteristic back over time step k from the grid point x j results in the picture shown in Figure 0.2(a). Note that if 0 < ak=h <, then the point x j ak lies between x j and x j. If we carefully choose k and h so that ak=h D exactly, then x j ak D x j and we would find that u.x j ; t nc / D u.x j ; t n /. The solution should just shift one grid cell to the right in each time step. We could compute the exact solution numerically with the method U nc j D Uj n : (0.36) Actually, all the two-level methods that we have considered so far reduce to the formula (0.36) in this special case ak D h, and each of these methods happens to be exact in this case. t nc h h (a) t n x j ak x j (b) x j ak x jc Figure 0.2. Tracing the characteristic of the advection equation back in time from the point.x j ; t nc / to compute the solution according to (0.35). Interpolating the value at this point from neighboring grid values gives the upwind method (for linear interpolation) or the Lax Wendroff or Beam Warming methods (quadratic interpolation). (a) shows the case a > 0, (b) shows the case a < 0.

262 rjlfdm 2007/6/ page The Courant Friedrichs Lewy condition 25 If ak=h <, then the point x j ak is not exactly at a grid point, as illustrated in Figure 0.2. However, we might attempt to use the relation (0.35) as the basis for a numerical method by computing an approximation to u.x j ak; t n / based on interpolation from the grid values Ui n at nearby grid points. For example, we might perform simple linear interpolation between Uj n and U j n. Fitting a linear function to these points gives the function! p.x/ D Uj n C.x x Uj n Uj n j/ : (0.37) h Evaluating this at x j ak and using this to define U nc j gives U nc j D p.x j ak/ D U n j U nc j ak h.u n j U n j /: This is precisely the first order upwind method (0.2). Note that this also can be interpreted as a linear combination of the two values Uj n and U j n: D Uj n C ak h U j n : (0.38) ak h Moreover, this is a convex combination (i.e., the coefficients of Uj n and U j n are both nonnegative and sum to ) provided the stability condition (0.23) is satisfied, which is also the condition required to ensure that x j ak lies between the two points x j and x j. In this case we are interpolating between these points with the function p.x/. If the stability condition is violated, then we would be using p.x/ to extrapolate outside of the interval where the data lies. It is easy to see that this sort of extrapolation can lead to instability consider what happens if the data U n is oscillatory with Uj n D. / j, for example. To obtain better accuracy, we might try using a higher order interpolating polynomial based on more data points. If we define a quadratic polynomial p.x/ by interpolating the values Uj n, U j n, and U jc n nc, and then define Uj by evaluating p.x j ak/, we simply obtain the Lax Wendroff method (0.8). Note that in this case we are properly interpolating provided that the stability restriction jak= hj is satisfied. If we instead base our quadratic interpolation on the three points Uj n 2, U j n, and U j n, then we obtain the Beam Warming method (0.26), and we are properly interpolating provided 0 ak= h The Courant Friedrichs Lewy condition The discussion of Section 0.6 suggests that for the advection equation, the point x j ak must be bracketed by points used in the stencil of the finite difference method if the method is to be stable and convergent. This turns out to be a necessary condition in general for any method developed for the advection equation: if U nc j is computed based on values UjCp n, UjCpC n ; :::;U jcq n with p q (negative values are allowed for p and q), then we must have x jcp x j ak x jcq or the method cannot be convergent. Since x i D ih, this requires q ak h p:

263 rjlfdm 2007/6/ page Chapter 0. Advection Equations and Hyperbolic Systems This result for the advection equation is one special case of a much more general principle that is called the CFL condition. This condition is named after Courant, Friedrichs, and Lewy, who wrote a fundamental paper in 928 that was the first paper on the stability and convergence of finite difference methods for PDEs. (The original paper [7] is in German but an English translation is available in [8].) The value D ak=h is often called the Courant number. To understand this general condition, we must discuss the domain of dependence of a time-dependent PDE. (See, e.g., [55], [66] for more details.) For the advection equation, the solution u.x; T / at some fixed point.x; T / depends on the initial data at only a single point: u.x; T / D u.x at /. We say that the domain of dependence of the point.x; T / is the point X at : D.X; T / DfX at g: If we modify the data at this point, then the solution u.x; T / will change, while modifying the data at any other point will have no effect on the solution at this point. This is a rather unusual situation for a PDE. More generally we might expect the solution at.x; T / to depend on the data at several points or over a whole interval. In Section 0.0 we consider hyperbolic systems of equations of the form u t C Au x D 0, where u 2 R s and A 2 R ss is a matrix with real eigenvalues ; 2 ;:::; s. If these values are distinct then we will see that the solution u.x; T / depends on the data at the s distinct points X T; :::, X s T, and hence D.X; T / DfX p T for p D ; 2; :::; sg: (0.39) The heat equation u t D u xx has a much larger domain of dependence. For this equation the solution at any point.x; T / depends on the data everywhere and the domain of dependence is the whole real line, D.X; T / D. ; /: This equation is said to have infinite propagation speed, since data at any point affects the solution everywhere at any small time in the future (although its effect of course decays exponentially away from this point, as seen from the Green s function (E.37)). A finite difference method also has a domain of dependence. On a particular fixed grid we define the domain of dependence of a grid point.x j ; t n / to be the set of grid points x i at the initial time t D 0 with the property that the data Ui 0 at x i has an effect on the solution Uj n. For example, with the Lax Wendroff method (0.8) or any other 3-point method, the value Uj n n depends on Uj, U j n, and UjC n. These values depend in turn on Uj n 2 2 through U jc2 n 2. Tracing back to the initial time we obtain a triangular array of grid points as seen in Figure 0.3(a), and we see that Uj n depends on the initial data at the points x j n ; :::;x jcn. Now consider what happens if we refine the grid, keeping k=h fixed. Figure 0.3(b) shows the situation when k and h are reduced by a factor of 2, focusing on the same value of.x; T / which now corresponds to U2j 2n on the finer grid. This value depends on twice as many values of the initial data, but these values all lie within the same interval and are merely twice as dense.

264 rjlfdm 2007/6/ page The Courant Friedrichs Lewy condition 27 t 2 t 4 (a) t 0 x j 2 x j x jc2 (b) t 0 x j 4 x j x jc4 Figure 0.3. (a) Numerical domain of dependence of a grid point when using a 3-point explicit method. (b) On a finer grid. If the grid is refined further with k=h r fixed, then clearly the numerical domain of dependence of the point.x; T / will fill in the interval ŒX T =r; X C T =r. As we refine the grid, we hope that our computed solution at.x; T / will converge to the true solution u.x; T / D.X at /. Clearly this can be possible only if X T =r X at X C T =r: (0.40) Otherwise, the true solution will depend only on a value.x at / that is never seen by the numerical method, no matter how fine a grid we take. We could change the data at this point and hence change the true solution without having any effect on the numerical solution, so the method cannot be convergent for general initial data. Note that the condition (0.40) translates into jaj =r and hence jak= hj. This can also be written as jakj h, which just says that over a single time step the characteristic we trace back must lie within one grid point of x j. (Recall the discussion of interpolation versus extrapolation in Section 0.6.) The CFL condition generalizes this idea: The CFL condition: A numerical method can be convergent only if its numerical domain of dependence contains the true domain of dependence of the PDE, at least in the limit as k and h go to zero. For the Lax Friedrichs, leapfrog, and Lax Wendroff methods the condition on k and h required by the CFL condition is exactly the stability restriction we derived earlier in this chapter. But it is important to note that in general the CFL condition is only a necessary condition. If it is violated, then the method cannot be convergent. If it is satisfied, then the method might be convergent, but a proper stability analysis is required to prove this or to determine the proper stability restriction on k and h. (And of course consistency is also required for convergence stability alone is not enough.) Example 0.5. The 3-point method (0.5) has the same stencil and numerical domain of dependence as Lax Wendroff but is unstable for any fixed value of k= h even though the CFL condition is satisfied for jak=hj. Example 0.6. The upwind methods (0.2) and (0.22) each have a 2-point stencil and the stability restrictions of these methods, (0.23) and (0.24), respectively, agree precisely with what the CFL condition requires.

265 rjlfdm 2007/6/ page Chapter 0. Advection Equations and Hyperbolic Systems Example 0.7. The Beam Warming method (0.26) has a 3-point one-sided stencil. The CFL condition is satisfied if 0 ak=h 2. When a < 0 the method (0.27) is used and the CFL condition requires 2 ak=h 0. These are also the stability regions for the methods, which must be verified by appropriate stability analysis. Example 0.8. For the heat equation the true domain of dependence is the whole real line. It appears that any 3-point explicit method violates the CFL condition, and indeed it does if we fix k=h as the grid is refined. However, recall from Section 0.2. that the 3-point explicit method (9.5) is convergent as we refine the grid, provided we have k=h 2 =2. In this case when we make the grid finer by a factor of 2 in space it will become finer by a factor of 4 in time, and hence the numerical domain of dependence will cover a wider interval at time t D 0. As k! 0 the numerical domain of dependence will spread to cover the entire real line, and hence the CFL condition is satisfied in this case. An implicit method such as the Crank Nicolson method (9.7) satisfies the CFL condition for any time step k. In this case the numerical domain of dependence is the entire real line because the tridiagonal linear system couples together all points in such a manner that the solution at each point depends on the data at all points (i.e., the inverse of a tridiagonal matrix is dense). 0.8 Some numerical results Figure 0.4 shows typical numerical results obtained with three of the methods discussed in the previous sections. The initial data at time t D 0, shown in Figure 0.4(a), are smooth and consist of two Gaussian peaks, one sharper than the other: u.x; 0/ D.x/ D exp. 20.x 2/ 2 / C exp..x 5/ 2 /: (0.4) The remaining frames in this figure show the results obtained when solving the advection equation u t C u x D 0 up to time t D 7, so the exact solution is simply the initial data shifted by 7 units. Note that only part of the computational domain is shown; the computation was done on the interval 0 x 25. The grid spacing h D 0:05 was used, with time step k D 0:8h so the Courant number is ak=h D 0:8. On this grid one peak is fairly well resolved and the other is poorly resolved. Figure 0.4(b) shows the result obtained with the upwind method (0.2) and illustrates the extreme numerical dissipation of this method. Figure 0.4(c) shows the result obtained with Lax Wendroff. The broader peak remains well resolved, while the dispersive nature of Lax Wendroff is apparent near the sharper peak. Dispersion is even more apparent when the leapfrog method is used, as seen in Figure 0.4(d). The modified equation analysis of the next section sheds more light on these results. 0.9 Modified equations Our standard tool for estimating the accuracy of a finite difference method has been the local truncation error. Seeing how well the true solution of the PDE satisfies the difference equation gives an indication of the accuracy of the difference equation. Now we will study a slightly different approach that can be very illuminating since it reveals much more about the structure and behavior of the numerical solution.

266 rjlfdm 2007/6/ page Modified equations 29.5 initial data.5 Upwind (a) LaxWendroff 0.5 (b) Leapfrog (c) (d) Figure 0.4. The numerical experiments on the advection equation described in Section 0:8. The idea is to ask the following question: is there a PDE v t Dsuch that our numerical approximation Uj n is actually the exact solution to this PDE, U j n D v.x j ; t n /? Or, less ambitiously, can we at least find a PDE that is better satisfied by Uj n than the original PDE we were attempting to model? If so, then studying the behavior of solutions to this PDE should tell us much about how the numerical approximation is behaving. This can be advantageous because it is often easier to study the behavior of PDEs than of finite difference formulas. In fact it is possible to find a PDE that is exactly satisfied by the Uj n by doing Taylor series expansions as we do to compute the local truncation error. However, this PDE will have an infinite number of terms involving higher and higher powers of k and h. By truncating this series at some point we will obtain a PDE that is simple enough to study and yet gives a good indication of the behavior of the Uj n. The procedure of determining a modified equation is best illustrated with an example. See [00] for a more detailed discussion of the derivation of modified equations. Example 0.9. Consider the upwind method (0.2) for the advection equation u t C au x D 0 in the case a > 0, U nc j D U n j ak h.u j n Uj n /: (0.42)

267 rjlfdm 2007/6/ page Chapter 0. Advection Equations and Hyperbolic Systems The process of deriving the modified equation is very similar to computing the local truncation error, only now we insert the formula v.x; t/ into the difference equation. This is supposed to be a function that agrees exactly with Uj n at the grid points and so, unlike u.x; t/, the function v.x; t/ satisfies (0.42) exactly: ak v.x; t C k/ D v.x; t/.v.x; t/ v.x h; t//: h Expanding these terms in Taylor series about.x; t/ and simplifying gives v t C 2 kv tt C 6 k2 v ttt C We can rewrite this as C a v x 2 hv xx C 6 h2 v xxx C v t C av x D 2.ahv xx kv tt / C 6.ah2 v xxx k 2 v ttt / C: D 0: This is the PDE that v satisfies. If we take k= h fixed, then the terms on the right-hand side are O.k/; O.k 2 /; etc., so that for small k we can truncate this series to get a PDE that is quite well satisfied by the Uj n. If we drop all the terms on the right-hand side, we just recover the original advection equation. Since we have then dropped terms of O.k/, we expect that Uj n satisfies this equation to O.k/, as we know to be true since this upwind method is first order accurate. If we keep the O.k/ terms, then we get something more interesting: v t C av x D 2.ahv xx kv tt /: (0.43) This involves second derivatives in both x and t, but we can derive a slightly different modified equation with the same accuracy by differentiating (0.43) with respect to t to obtain v tt D av xt C 2.ahv xxt kv ttt / and with respect to x to obtain v tx D av xx C 2.ahv xxx kv ttx /: Combining these gives Inserting this in (0.43) gives v tt D a 2 v xx C O.k/: v t C av x D 2.ahv xx a 2 kv xx / C O.k 2 /: Since we have already decided to drop terms of O.k 2 /, we can drop these terms here also to obtain v t C av x D 2 ah ak v xx : (0.44) h

268 rjlfdm 2007/6/ page Modified equations 22 This is now a familiar advection-diffusion equation. The grid values Uj n can be viewed as giving a second order accurate approximation to the true solution of this equation (whereas they give only first order accurate approximations to the true solution of the advection equation). The fact that the modified equation is an advection-diffusion equation tells us a great deal about how the numerical solution behaves. Solutions to the advection-diffusion equation translate at the proper speed a but also diffuse and are smeared out. This is clearly visible in Figure 0.4(b). Note that the diffusion coefficient in (0.44) is 2.ah a2 k/, which vanishes in the special case ak D h. In this case we already know that the exact solution to the advection equation is recovered by the upwind method. Also note that the diffusion coefficient is positive only if 0 < ak=h <. This is precisely the stability limit of upwind. If this is violated, then the diffusion coefficient in the modified equation is negative, giving an ill-posed problem with exponentially growing solutions. Hence we see that even some information about stability can be extracted from the modified equation. Example 0.0. If the same procedure is followed for the Lax Wendroff method, we find that all O.k/ terms drop out of the modified equation, as is expected since this method is second order accurate on the advection equation. The modified equation obtained by retaining the O.k 2 / term and then replacing time derivatives by spatial derivatives is v t C av x C! ak 2 6 ah2 v xxx D 0: (0.45) h The Lax Wendroff method produces a third order accurate solution to this equation. This equation has a very different character from (0.43). The v xxx term leads to dispersive behavior rather than diffusion. This is clearly seen in Figure 0.4(c), where the Uj n computed with Lax Wendroff are compared to the true solution of the advection equation. The magnitude of the error is smaller than with the upwind method for a given set of k and h, since it is a higher order method, but the dispersive term leads to an oscillating solution and also a shift in the location of the main peak, a phase error. This is similar to the dispersive behavior seen in Figure E. for an equation very similar to (0.45). In Section E.3.6 the propagation properties of dispersive waves is analyzed in terms of the dispersion relation of the PDE and the phase and group velocities of different wave numbers. Following the discussion there, we find that for the modified equation (0.45), the group velocity for wave number is! ak 2 c g D a 2 ah2 2 ; h which is less than a for all wave numbers. As a result the numerical result can be expected to develop a train of oscillations behind the peak, with the high wave numbers lagging farthest behind the correct location. Some care must be used here, however, when looking at highly oscillatory waves (relative to the grid, i.e., waves for which h is far from 0). For h sufficiently small the modified equation (0.45) is a reasonable model, but for larger h the terms we have

269 rjlfdm 2007/6/ page Chapter 0. Advection Equations and Hyperbolic Systems neglected in this modified equation may play an equally important role. Rather than determining the dispersion relation for a method from its modified equation, it is more reliable to determine it directly from the numerical method, which is essentially what we have done in von Neumann stability analysis. This is pursued further in Example 0.3 below. If we retain one more term in the modified equation for Lax Wendroff, we would find that the U n j are fourth order accurate solutions to an equation of the form v t C av x C 6 ah2! ak 2 v xxx D v xxxx ; (0.46) h where the in the fourth order dissipative term is O.k 3 Ch 3 / and positive when the stability bound holds. This higher order dissipation causes the highest wave numbers to be damped (see Section E.3.7), so that there is a limit to the oscillations seen in practice. The fact that this method can produce oscillatory approximations is one of the reasons that the first order upwind method is sometimes preferable in practice. In some situations nonphysical oscillations may be disastrous, for example, if the value of u represents a concentration that cannot become negative or exceed some limit without difficulties arising elsewhere in the modeling process. Example 0.. The Beam Warming method (0.26) has a similar modified equation, v t C av x D 6 ah2 2 3ak h C! ak 2 v xxx : (0.47) h In this case the group velocity is greater than a for all wave numbers in the case 0 < ak=h <, so that the oscillations move ahead of the main hump. If < ak=h < 2, then the group velocity is less than a and the oscillations fall behind. (Again the dispersion relation for (0.47) gives an accurate idea of the dispersive properties of the numerical method only for h sufficiently small.) Example 0.2. The modified equation for the leapfrog method (0.3) can be derived by writing v.x; t C k/ v.x; t k/ v.x C h; t/ v.x h; t/ D a D 0 (0.48) 2k 2h and expanding in Taylor series. As in Example 0.9 we then further differentiate the resulting equation (which has an infinite number of terms) to express higher spatial derivatives of v in terms of temporal derivatives. The dominant terms look just like Lax Wendroff, and (0.45) is again obtained. However, from the symmetric form of (0.48) in both x and t we see that all evenorder derivatives drop out. If we derive the next term in the modified equation we will find an equation of the form v t C av x C 6 ah2! ak 2 v xxx D v xxxxx C (0.49) h for some D O.h 4 C k 4 /, and higher order modified equations will also involve only oddorder derivatives and even powers of h and k. Hence the numerical solution produced with the leapfrog method is a fourth order accurate solution to the modified equation (0.45).

270 rjlfdm 2007/6/ page Modified equations 223 Moreover, recall from Section E.3.6 that all higher order odd-order derivatives give dispersive terms. We conclude that the leapfrog method is nondissipative at all orders. This conclusion is consistent with the observation in Section that k p is on the boundary of the stability region for all eigenvalues of A from (0.0), and so we see neither growth nor decay of any mode. However, we also see from the form of (0.49) that high wave number modes will not propagate with the correct velocity. This was also true of Lax Wendroff, but there the fourth order dissipation damps out the worst offenders, whereas with leapfrog this dispersion of often much more apparent in computational results, as observed in Figure 0.4(d). Example 0.3. Since leapfrog is nondissipative it serves as a nice example for calculating the true dispersion relation of the numerical method. The approach is very similar to the von Neumann stability analysis of Section 0.5, only now we use e i.x j!t n/ as our Ansatz (so that the g from von Neumann analysis is replaced by e i!k ). Following the same procedure as in Example 0.4, we find that which can be simplified to yield e i!k D e i!k ak h e ih e ih ; (0.50) sin.!k/ D ak h sin.h/: (0.5) This is the dispersion relation relating! to. Note that jhj for waves that can be resolved on our grid and that for each such h there are two corresponding values of!k. The dispersion relation is multivalued because leapfrog is a three-level method, and different temporal behavior of the same spatial wave can be seen, depending on the relation between the initial data chosen on the two initial levels. For well-resolved waves (jhj small) and reasonable initial data we expect!k also near zero (not near, where the other solution is in this case). Solving for! as a function of and expanding in Taylor series for small h would show this agrees with the dispersion relation of the infinite modified equation (0.49). We do not need to do this, however, if our goal is to compute the group velocity for the leapfrog method. We can differentiate (0.5) with respect to and solve for d! d D a cos.h/ cos.!k/ D q a cos.h/ ; (0.52) sin 2.h/ where D ak=h is the Courant number and again the arises from the multivalued dispersion relation. The velocity observed would depend on how the initial two levels are set. Note that the group velocity can be negative and near a for jhj. This is not surprising since the leapfrog method has a 3-point centered stencil, and it is possible for numerical waves to travel from right to left although physically there is advection only to the right. This can be observed in some computations, for example, in Figure 0.5 as discussed in Example 0.4.

271 rjlfdm 2007/6/ page Chapter 0. Advection Equations and Hyperbolic Systems 0.0 Hyperbolic systems The advection equation u t C au x D 0 can be generalized to a first order linear system of equations of the form u t C Au x D 0; (0.53) u.x; 0/ D.x/; where u W R R! R s and A 2 R ss is a constant matrix. (Note that this is not the matrix A from earlier in this chapter, e.g., (0.0).) This is a system of conservation laws (see Section E.2) with the flux function f.u/ D Au. This system is called hyperbolic if A is diagonalizable with real eigenvalues, so that we can decompose A D RƒR ; (0.54) where ƒ D diag. ; 2 ;:::; s / is a diagonal matrix of eigenvalues and R D Œr jr 2 jjr s is the matrix of right eigenvectors. Note that AR D Rƒ, i.e., Ar p D p r p for p D ; 2; :::; s: (0.55) The system is called strictly hyperbolic if the eigenvalues are distinct Characteristic variables We can solve (0.53) by changing to the characteristic variables w D R u (0.56) in much the same way we solved linear systems of ODEs in Section Multiplying (0.53) by R and using (0.54) gives R u t C ƒr u x D 0 (0.57) or, since R is constant, w t C ƒw x D 0: (0.58) Since ƒ is diagonal, this decouples into s independent scalar equations.w p / t C p.w p / x D 0; p D ; 2; :::; s: (0.59) Each of these is a constant coefficient linear advection equation with solution w p.x; t/ D w p.x p t; 0/: (0.60) Since w D R u, the initial data for w p is simply the pth component of the vector w.x; 0/ D R.x/: (0.6) The solution to the original system is finally recovered via (0.56): u.x; t/ D Rw.x; t/: (0.62)

272 rjlfdm 2007/6/ page Numerical methods for hyperbolic systems 225 Note that the value w p.x; t/ is the coefficient of r p in an eigenvector expansion of the vector u.x; t/, i.e., (0.62) can be written out as u.x; t/ D sx w p.x; t/r p : (0.63) Combining this with the solutions (0.60) of the decoupled scalar equations gives u.x; t/ D pd sx w p.x p t; 0/r p : (0.64) pd Note that u.x; t/ depends only on the initial data at the s points x p t. This set of points is the domain of dependent D.x; t/ of (0.39). The curves x D x 0 C p t satisfying x 0.t/ D p are the characteristics of the pth family, or simply p-characteristics. These are straight lines in the case of a constant coefficient system. Note that for a strictly hyperbolic system, s distinct characteristic curves pass through each point in the x-t plane. The coefficient w p.x; t/ of the eigenvector r p in the eigenvector expansion (0.63) of u.x; t/ is constant along any p-characteristic. 0. Numerical methods for hyperbolic systems Most of the methods discussed earlier for the advection equation can be extended directly to a general hyperbolic system by replacing a with A in the formulas. For example, the Lax Wendroff method becomes U nc j D Uj n k 2h A.U jc n Uj n / C k2 2h 2 A2.Uj n 2Uj n C U jc n /: (0.65) This is second order accurate and is stable provided the Courant number is no larger than, where the Courant number is defined to be D max ps j pk=hj: (0.66) For the scalar advection equation, there is only one eigenvalue equal to a, and the Courant number is simply jak= hj. The Lax Friedrichs and leapfrog methods can be generalized in the same way to systems of equations and remain stable for. The upwind method for the scalar advection equation is based on a one-sided approximation to u x, using data in the upwind direction. The one-sided formulas (0.2) and (0.22) generalize naturally to and U nc j U nc j D U n j D U n j k h A.U n j U n j / (0.67) k h A.U jc n Uj n /: (0.68)

273 rjlfdm 2007/6/ page Chapter 0. Advection Equations and Hyperbolic Systems For a system of equations, however, neither of these is useful unless all the eigenvalues of A have the same sign, so that the upwind direction is the same for all characteristic variables. The method (0.67) is stable only if 0 k p h while (0.68) is stable only if for all p D ; 2;:::; s; (0.69) k p 0 for all p D ; 2;:::; s: (0.70) h It is possible to generalize the upwind method to more general systems with eigenvalues of both signs, but to do so requires decomposing the system into the characteristic variables and upwinding each of these in the appropriate direction. The resulting method can also be generalized to nonlinear hyperbolic systems and generally goes by the name of Godunov s method. These methods are described in much more detail in [66]. 0.2 Initial boundary value problems So far we have studied only numerical methods for hyperbolic problems on a domain with periodic boundary conditions (or the Cauchy problem, if we could use a grid with an infinite number of grid points). Most practical problems are posed on a bounded domain with nonperiodic boundary conditions, which must be specified in addition to the initial conditions to march forward in time. These problems are called initial boundary value problems (IBVPs). Consider the advection equation u t C au x D 0 with a > 0, corresponding to flow to the right, on the domain 0 x where some initial conditions u.x; 0/ D.x/ are given. This data completely determine the solution via (0.2) in the triangular region 0 x at of the x-t plane. Outside this region, however, the solution is determined only if we also impose boundary conditions at x D 0, say, (0.8). Then the solution is u.x; t/ D.x at/ if 0 x at ; g 0.t x=a/ otherwise: (0.7) Note that boundary data are required only at the inflow boundary x D 0, not at the outflow boundary x D, where the solution is determined via (0.7). Trying to impose a different value on u.; t/ would lead to a problem with no solution. If a < 0 in the advection equation, then x D is the inflow boundary, the solution is transported to the left, and x D 0 is the outflow boundary Analysis of upwind on the initial boundary value problem Now suppose we apply the upwind method (0.2) to this IBVP with a > 0 on a grid with h D =.m C / and x i D ih for i D 0; ; :::; m C. The formula (0.2) can be applied for i D ; :::; m C in each time step, while U0 n D g.t n/ is set by the boundary condition. Hence the method is completely specified.

274 rjlfdm 2007/6/ page Initial boundary value problems 227 When is this method stable? Intuitively we expect it to be stable if 0 ak=h. This is the stability condition for the problem with periodic boundary conditions, and here we are using the same method at every point except i D 0, where the exact solution is being set in each time step. Our intuition is correct in this case and the method is stable if 0 ak=h. Notice, however, that von Neumann analysis cannot be used in this case, as discussed already in Section 0.5: the Fourier modes e ih are no longer eigengridfunctions. But von Neumann analysis is still useful because it generally gives a necessary condition for stability. In most cases a method that is unstable on the periodic domain or Cauchy problem will not be useful on a bounded domain either, since locally on a fine grid, away from the boundaries, any instability indicated by von Neumann analysis is bound to show up. Instead we can use MOL stability analysis, although it is sometimes subtle to do so correctly. We have a system of ODEs similar to (0.9) for the vector U.t/, which again has m C components corresponding to u.x i ; t/. But now we must incorporate the boundary conditions, and so we have a system of the form U 0.t/ D AU.t/ C g.t/; (0.72) where A D 2 a h 6 4 : :: 3 2 ; g.t/ D g 0.t/a=h 0 0 : 0 3 : (0.73) 7 5 The upwind method corresponds to using Euler s method on the ODE (0.72). The change from the matrix of (0.0) to (0.73) may seem trivial but it completely changes the character of the matrix. The matrix (0.0) is circulant and normal and has eigenvalues uniformly distributed about the circle of radius a= h in the complex plane centered at z D a= h. The matrix (0.73) is a defective Jordan block with all its eigenvalues at the point a= h. The eigenvalues have moved distance a= h! as h! 0. Suppose we attempt to apply the usual stability analysis of Chapter 7 to this system and require that k p 2 S for all eigenvalues of A, where S is the stability region for Euler s method. Since S contains the interval Œ 2; 0 on the real axis, this would suggest the stability restriction 0 ak=h 2 (0.74) for the upwind method on the IBVP. This is wrong by a factor of 2. It is a necessary condition but not sufficient. The problem is that A in (0.73) is highly nonnormal. It is essentially a Jordan block of the sort discussed in Section D.5., and on a fine grid its -pseudospectra roughly fill up the circle of radius a=h about a=h, even for very small. This is a case where we need to apply a more stringent requirement than simply requiring that k be inside the stability region for all eigenvalues; we also need to require that dist.k ; S/ C (0.75)

275 rjlfdm 2007/6/ page Chapter 0. Advection Equations and Hyperbolic Systems holds for the -pseudoeigenvalues (see Section D.5), where C is a modest constant, as suggested in [47], [92]. Requiring (0.75) shows that the expected requirement 0 ak= h is needed rather than (0.74) Outflow boundary conditions When the upwind method is used for the advection equation, as in the previous section, the outflow boundary poses no problem. The finite difference formula is one sided and the value UmC n at the rightmost grid point is computed by the same formula that is used in the interior. However, if we use a finite difference method whose stencil extends to the right as well as the left, we will need to use a different formula at the rightmost point in the domain. This is called a numerical boundary condition or artificial boundary condition since it is required by the method, not for the PDE. Numerical boundary conditions may also be required at boundaries where a physical boundary condition is imposed, if the numerical method requires more conditions than the equation. For example, if we use the Beam Warming method for advection, which has a stencil that extends two grid points in the upwind direction, then we can use this only for updating U nc 2 ; U nc 3 ; :::. The value U nc 0 will be set by the physical boundary condition but U nc will have to be set by some other method, which can be viewed as a numerical boundary condition for Beam Warming. Naturally some care must be used in choosing numerical boundary conditions, both in terms of the accuracy and stability of the resulting method. This is a difficult topic, particularly the stability analysis of numerical methods for IBVPs, and we will not pursue it here. See, for example, [40], [84], [89]. Example 0.4. We will look at one example simply to give a flavor of the potential difficulties. Suppose we use the leapfrog method to solve the IBVP for the advection equation u t C au x D 0. At the left (inflow) boundary we can use the given boundary condition, but at the right we will need a numerical boundary condition. Suppose we use the first order upwind method at this point, U nc mc D U mc n ak h.u mc n Um n /; (0.76) which is also consistent with the advection equation. Figure 0.5 shows four snapshots of the solution when the initial data is u.x; 0/ D.x/ D exp. 5.x 2/ 2 / and the first two time levels for leapfrog are initialized based on the exact solution u.x; t/ D.x at/. We see that as the wave passes out the right boundary, a reflection is generated that moves to the left, back into the domain. The dispersion relation for the leapfrog method found in Example 0.3 shows that waves with h can move to the left with group velocity approximately equal to a, and this wave number corresponds exactly to the sawtooth wave seen in the figure. For an interesting discussion of the relation of numerical dispersion relations and group velocity to the stability of numerical boundary conditions, see [89]. Outflow boundaries are often particularly troublesome. Even if a method is formally stable (as the leapfrog method in the previous example is with the upwind boundary condition), it is often hard to avoid spurious reflections. We have seen this even for the advection equation, where in principle flow is entirely to the right, and it can be even more difficult

276 rjlfdm 2007/6/ page Initial boundary value problems Figure 0.5. Numerical solution of the advection equation using the leapfrog method in the interior and the upwind method at the right boundary. The solution is shown at four equally spaced times, illustrating the generation and leftward propagation of a sawtooth mode. to develop appropriate boundary conditions for wave propagation or fluid dynamics equations that admit wave motion in all directions. Yet in practice we always have to compute over a finite domain and this often requires setting artificial boundaries around the region of interest. The assumption is that the phenomena of interest happen within this region, and the hope is that any waves hitting the artificial boundary will leave the domain with no reflection. Numerical boundary conditions that attempt to achieve this are often called nonreflecting or absorbing boundary conditions.

277 rjlfdm 2007/6/ page Chapter 0. Advection Equations and Hyperbolic Systems 0.3 Other discretizations As in the previous chapter on parabolic equations, we have concentrated on a few basic methods in order to explore some fundamental ideas. We have also considered only the simplest case of constant coefficient linear hyperbolic equations, whereas in practice most hyperbolic problems of interest have variable coefficients (e.g., linear wave propagation in heterogeneous media) or are nonlinear. These problems give rise to a host of new difficulties, not least of which is the fact that the solutions of interest are often discontinuous since nonlinearity can lead to shock formation. There is a well-developed theory of numerical methods for such problems that we will not delve into here; see, for example, [66]. Even in the case of constant coefficient linear problems, there are many other discretizations possible beyond the ones presented here. We end with a brief overview of just a few of these: Higher order discretizations of u x can be used in place of the discretizations considered so far. If we write the MOL system for the advection equation as U 0 j.t/ D aw j.t/; (0.77) where W j.t/ is some approximation to u x.x j ; t/, then there are many ways to approximate W j.t/ from the U j values beyond the centered approximation (0.3). One-sided approximations are one possibility, as in the upwind method. For sufficiently smooth solutions we might instead use higher order accurate centered approximations, e.g., W j D 4 UjC U j UjC2 U j 2 : (0.78) 3 2h 3 4h This and other approximations can be determined using the fdcoeffv.m routine discussed in Section.5; e.g., fdcoeffv(,0,-2:2) produces (0.78). Centered approximations such as (0.78) generally lead to skew-symmetric matrices with pure imaginary eigenvalues, at least when applied to the problem with periodic boundary conditions. In practice most problems are on a finite domain with nonperiodic boundary conditions, and other issues arise as already seen in Section 0.2. Note that the discretization (0.78) requires more numerical boundary conditions than the upwind or second order centered operator. An interesting approach to obtaining better accuracy in W j is to use a so-called compact method, in which the W j are determined by solving a linear system rather than explicitly. A simple example is 4 W j C W j C 4 W jc D 3 2 UjC U j 2h : (0.79) This gives a tridiagonal system of equations to solve for the W j values, and it can be shown that the resulting values will be O.h 4 / approximations to u x.x j ; t/. Higher order methods of this form also exist; see Lele [63] for an in-depth discussion. In addition to giving higher order of accuracy with a compact stencil, these approximations also typically have much better dispersion properties than standard finite difference approximations of the same order.

278 rjlfdm 2007/6/ page Other discretizations 23 Spectral approximations to the first derivative can be used, based on the same ideas as in Section 2.2. In this case W D DU is used to obtain the approximations to the first derivative, where D is the dense spectral differentiation matrix. These methods can also be generalized to variable coefficient and even nonlinear problems and often work very well for problems with smooth solutions. Stability analysis of these methods can be tricky. To obtain reasonable results a nonuniform distribution of grid points must be used, such as the Chebyshev extreme points as discussed in Section 2.2. In this case the eigenvalues of the matrix D turn out to be O.=h 2 /, rather than O.=h/ as is expected for a fixed-stencil discretization of the first derivative. If instead we use the roots of the Legendre polynomial (see Section B.3.), another popular choice of grid points for spectral methods, it can be shown that the eigenvalues are O.= h/, which appears to be better. In both cases, however, the matrix D is highly nonnormal and the eigenvalues are misleading, and in fact a time step k D O.h 2 / is generally required if an explicit method is used for either choice of grid points; see, e.g., [92]. Other time discretizations can be used in place of the ones discussed in this chapter. In particular, for spectral methods the MOL system is stiff and it may be beneficial to use an implicit method as discussed in Chapters 8 and 9. Another possibility is to use an exponential time differencing method, as discussed in Section.6. For conservation laws (see Section E.2) numerical methods are often more naturally derived using the integral form (E.9) than by using finite difference approximations to derivatives. Such methods are particularly important for nonlinear hyperbolic problems, where shock waves (discontinuous solutions) can develop spontaneously even from smooth initial data. In this case the discrete value Ui n is viewed as an approximation to the cell average of u.x; t n / over the grid cell Œx i =2 ; x ic=2 of length h centered about x i, U n i Z xic=2 u.x; t n / dx: (0.80) h x i =2 According to (E.9) this cell average evolves at a rate given by the difference of fluxes at the cell edges, and a particular numerical method is obtained by approximating these fluxes based on the current cell averages. Methods of this form are often called finite volume methods since the spatial domain is partitioned into volumes of finite size. Simple finite volume methods often look identical to finite difference methods, but the change in viewpoint allows the development of more sophisticated methods that are better suited to solving nonlinear conservation laws. These methods also have some advantages for linear problems, particularly if they have variable coefficients with jump discontinuities, as often arises in solving wave propagation problems in heterogeneous media. See [66] for a detailed description of such methods.

279 Chapter 0 Short Note on Hyperbolic PDEs - Linear Scalar Case. Linear scalar equations We consider two types of linear scalar advection equations, one with a constant velocity a, and the other with a variable velocity a(x((t)). Let s first take a look at the D linear scalar advection equation for t 0 written as u t + au x = 0 (0.) with a constant advection velocity a, and together with initial conditions on R, u(x, 0) = u 0 (x). (0.2) As shown in the previous chapter, we know the solution is given by u(x, t) = u 0 (x at) (0.3) for t 0. Recall that x at = x 0 is called the characteristic line with a given constant x 0 and with the propagation velocity a. Depending on the sign of a, the initial data u 0 (x) is advected (or transported) hence the name advection equation to the right (if a > 0) or left (if a < 0). Note that there are infinitely many characteristic lines in the x-t plane as there are infinite choices of x 0 R. See Fig.. In general, the characteristics are curves (or simply the characteristics ) in the x-t plane satisfying the ODEs x (t) = a and x(0) = x 0. (0.4) One very important property on the characteristics is that the solution u(x, t) of the constant velocity a remains as constant along the characteristics. To see this, d dt u(x(t), t) = u(x(t), t) + t x u(x(t), t)x (t) = u t + au x = 0, (0.5) confirming the claim. 279

280 280 Figure. Characteristic curves and the advection of the solution. All information is simply advected to the later time solution u(x, t) along the characteristic curves in the x-t plane without any shape changes from the initial condition u 0 (x). In the more general case of the scalar equation with the variable velocity a(x(t)), we consider ( ) u t + a(x(t))u = 0. (0.6) x In this case, the characteristics are no longer straight lines satisfying x (t) = a(x(t)) and x(0) = x 0, (0.7) and the solution u(x, t) is no longer constant along the characteristics. This can be easily verified if we rewrite Eq. 0.6 as therefore we obtain u t + a(x(t))u x = a (x(t))u, (0.8) d dt u(x(t), t) = a (x(t))u 0. (0.9) In both cases of the constant and variable velocities, the solution can be easily determined by solving sets of ODEs. Remark: In words, the characteristic curves track the motion of material particles. Remark: We can see that if u 0 (x) C k (R) then u(x, t) C k (R) (0, ). Remark: So far, we have assumed differentiability of u(x, t) in manipulating the above relations. Note that this assumption makes it possible to seeks for a classical solution u(x, t) of the differential equations.

281 .. Domain of dependence & Range of influence We now make an important observation in solutions to the linear advection equations: 28 The solution u(x, t) at any point ( x, t) depends only on the initial data u 0 only at a single point, namely x 0 such that ( x, t) lies on the characteristic through x 0. This means that the solution u( x, t) will remain unchanged no matter how we change the initial data at any points other then x 0. We now define two related regions, the first is called the domain of dependence, and the second is called the range of influence. Definition: The set D( x, t) = { x λ m t : m =, 2,..., p} is called the domain of dependence of the point ( x, t), where p is the total number of characteristic velocities (or the number of equations of hyperbolic PDE systems). See Fig. 2 for an illustration. Remark: For convenience, let us assume λ... λ m... λ p. Note that p = for scalar hyperbolic equations, whereas p > for systems of hyperbolic equations. For instance, p = 3 for the systems of D Euler equations ( continuity equations, momentum equation, and energy equation). Note: What are the values of p for the systems of 2D Euler and 3D Euler equations? Definition: The region R = {(x, t) : λ t x x 0 λ p t} is called the range of influence of the point x 0. See Fig. 3 for an illustration. Note: One can always find a bounded set D = {x : x x λ p t} such that D( x, t) D. The existence of D and R are the consequence of the fact that hyperbolic equations have finite propagation speed; information can travel with speed at most max m { λ m : m =,..., p}. 2. A List of Finite Difference Methods for the Linear Problem In this section, we provide a couple of finite difference (FD) methods for solving our model PDE, u t + au x = 0. We assume a > 0 for Beam-Warming and Fromm s methods. One can easily get appropriate forms for these two methods for a < 0. Backward Euler (FTCS Forward Time Centered Space) Ui n+ = Ui n a t ( ) Ui+ n Ui n 2 x One-sided (FTBS Forward Time Backward Space) Ui n+ = Ui n a t ( ) Ui n Ui n x (0.0) (0.)

282 282 Figure 2. The domain of dependence of the point ( x, t) for a typical hyperbolic system of three equations with λ < 0 < λ 2 < λ 3. Note that one can always find a bounded domain D such that D D because of the fact that the propagation velocities (or characteristic velocities) of hyperbolic PDEs are always finite. One-sided (FTFS Forward Time Forward Space) Ui n+ = Ui n a t ( ) Ui+ n Ui n x Leapfrog U n+ i = U n i Lax-Friedrichs (LF) Ui n+ = ( Ui+ n + Ui n 2 Lax-Wendroff (LW) U n+ i = Ui n a t ( ) U n 2 x i+ Ui n a t ( ) Ui+ n Ui n 2 x ) a t ( ) Ui+ n Ui n 2 x + 2 ( a t ) 2 ( ) U n x i+ 2Ui n +Ui n (0.2) (0.3) (0.4) (0.5)

283 283 Figure 3. The range of influence R = {x : λ t x x 0 λ 3 t} of the point x 0 of the same problem in Fig. 2. Notice that the conic region R is a symmetric image of D with respect to ( x, t), shifted to t = 0 axis. Beam-Warming (BW) for a > 0 U n+ i = Ui n a t ( ) 3Ui n 4Ui n + Ui 2 n + ( a t ) 2 ( ) Ui n 2Ui n + Ui 2 n 2 x 2 x (0.6) Fromm s method for a > 0 Ui n+ = Ui n a t ( Ui n Ui n x + a t ( a t 4 x x ) 4 a t ( x ) )( U n i U n i 2 a t x )( ) Ui+ n Ui n (0.7) 3. The Fundamental Theorem of Numerical Methods The Lax Equivalence Theorem for Linear PDEs Recall that as in the case with ODEs, our ultimate goal in PDE theory is to establish the following result, called the The Lax equivalence theorem. It is important to note that this theorem is only valid for linear PDEs, and does not hold for nonlinear PDEs. Consistency + Stability Convergence Consistency: lim t, x 0 ELT n = 0. Stability: U n. N n C, n t T for each T. Equivalently, U n+ Convergence: lim x, t 0 E n g = 0.

284 284 Remark: With this theorem, we often show consistency and stability of numerical methods in order to show the convergence of the methods. Example: Consistency We consider the first-order upwind method and see if the DE is consistent. The upwind DE for a > 0 is i = Ui n a t ( Ui n x U n+ U n i ). (0.8) Let s now apply Taylor expansion to obtain the local truncation error E n+ E n+ LT,i = t [ { u(x i, t n+ ) u(x i, t n ) a t x LT,i : [ u(x i, t n ) u(x i, t n )]} ] (0.9) where N (u n i ) = u(x i, t n ) a t [ ] u(x i, t n ) u(x i, t n ) (0.20) x Here, please note that we differentiate between u n i and Ui n. Using Taylor expansions of u(x i, t n+ ) and u(x i, t n ): and u(x i, t n+ ) = u(x i, t n ) + u t (x i, t n ) t + u tt (x i, t n ) t2 2 + O( t3 ), (0.2) u(x i, t n ) = u(x i, t n ) u x (x i, t n ) x + u xx (x i, t n ) x2 2 + O( x 3 ). (0.22) Substituting Eq. 0.2 and Eq.0.22 into Eq.0.9 gives E n+ LT,i = u t(x i, t n ) + au x (x i, t n ) + u tt (x i, t n ) t 2 au xx(x i, t n ) x 2 + O( t2, x 2 ). (0.23) Note that the first two terms vanishes since u(x, t) is the exact solution to the PDE. Using so called the Cauchy-Kowalewski procedure, we get and we finally arrive to get u tt = au tx = a( au x ) x = a 2 u xx, (0.24) E n+ LT,i = 2 a(a )u xx(x i, t n )O( t, x) + O( t 2, x 2 ). (0.25) This means that the local truncation error ELT,i n is dominated by O( t, x), whereby we show that the method is first-order accurate in both space and time. It also proves that the method is consistent because E n+ LT,i approaches to zero when t and x go to zero as long as u(x, t) is at least twice differentiable in both space and time.

285 285 Note: In the above, we actually have the pointwise property of the limit: lim t, x 0 En+ LT,i = 0, (0.26) as long as the solution u(x, t) is twice differentiable in space and time. Henceforth, it is natural to see its norm ELT n approaches to zero in the limit, without depending on the choice of norms, even in the max-norm. This is what is to be expected for smooth continous solutions. Homework Show that the LF method is consistent. The Lax-Friedrichs (LF) method for u t + au x = 0 can be written as Ui n+ = ( ) Ui+ n + Ui n t ( ) f(u n 2 2 x i+) f(ui ) n. (0.27) where the flux function given by f(u) = au where a > 0 or a < 0. Homework 2 The Lax-Wendroff (LW) method for u t + au x = 0 is Ui n+ = Ui n C ( ) a Ui+ n Ui n 2 + C2 ( a Ui+ n 2Ui n 2 Again, we have a > 0 or a < 0. Show that LW is consistent. + U n i ). (0.28) 4. The CFL Condition We can write an explicit DE for our model PDE as a conservative form ( ) Ui n+ = Ui n C a F n F n, (0.29) i+ i 2 2 where a form of the numerical fluxes can have of the form F n i+ 2 = F(U n i, U n i+) = { au n i if a > 0 au n i+ if a < 0. (0.30) Similarly, F n i 2 = F(U n i, U n i ) = { au n i if a > 0 au n i if a < 0. (0.3) Note that the method in Eq is said to be the upwind method, whose stability is achieved by taking a proper upwind direction in the flux function evaluation in Eq and Eq One can easily see that the upwind method is the same as the one-sided methods in Eq. 0. or Eq. 0.2 depending on the upwind direction, a > 0 or a < 0, respectively. In general, when the model PDE is no longer a linear constant scalar case but nonlinear systems (e.g., the Euler equation), one needs to have a conditional statement to consider signs of a i±/2 and produce the cell-interface fluxes of the form a n i± U i± n. We consider these more sophisticated cases much later when we

286 286 Figure 4. Characteristics for the model advection equation with a > 0. Left panel: For a small t satisfying t < x/a, the characteristic information travels less than a single grid cell distance in a single time step t, hence the numerical flux F n at x i+ i+ depends on U n 2 2 i only. Right panel: For a large t failing to satisfy t < x/a, the characteristic information travels more than a single grid cell distance in a single time step t, giving the extended dependency of the numerical flux F n on U i+ i n as well as U i n. 2 study numerical methods for nonlinear systems and just focus on a simple linear constant scalar for now. We see that the update scheme of DE in Eq uses basically three neighboring cell data, Ui n, U i n, U i+ n n+ in order to update Ui over t. One can think of two different situations in choosing t: () a t < x (0.32) (2) a t > x. (0.33) For the first case in Eq. 0.32, information propagates less than one grid cell distance in a single time step, whereas information travels much longer distance than one grid cell for the second case in Eq As shown in the right panel of Fig. 4, the way we formulate the numerical flux F n in Eq would become unstable because it does not include i+ 2 Ui n for the large choice of t > x/a. As a result, the numerical scheme in Eq will become unstable in this large single time step t, and hence the instability will grow exponentially. This is a consequence of the CFL condition, named after Courant, Friedrichs, and Lewy. See those three genius faces on top of the Pantheon pillars in Fig. 6

287 Figure 5. The green triangles represent the numerical domain of dependence of the point marked by the red stars. The red triangles illustrate the analytical domain of dependence of the red point.

287 287 Figure 5. The green triangles represent the numerical domain of dependence of the point marked by the red stars. The red triangles illustrate the analytical domain of dependence of the red point. The solid triangles with two characteristic lines x = ±at show the maximum CFL stability regions of C a =. Top figure: Illustration of a stable case where the numerical domain of dependence (green triangle) includes all the analytical domain (red triangle) of dependence. Bottom figure: Illustration of an unstable case where the numerical domain of dependence (green triangle) does not include all the analytical domain (red triangle) of dependence. of the cover page. The CFL condition is a necessary stability condition for any numerical method and is stated as follows: A numerical method can be convergent only if its numerical domain of dependence contains the true domain of dependence of the given PDE, at least in the limit as t and x go to zero. The CFL condition therefore provides a necessary condition for choosing the length of t depending on the PDE under consideration. The CFL condition amounts to say, if we let C a to be the CFL number that satisfy 0 < C a, C a becomes, for the advection case, C a = max p λ p t x, (0.34)

288 288 Figure 6. Some early pioneers of CFD in the era since WWII. Top level: Jay Boris, Vladimir Kolgan, Bran van Leer, Antony Jameson. Ground level: Richard Courant, Kurt Friedrichs, Hans Lewy, Robert MacCormack, Philip Roe, John von Neumann, Stanley Osher, Amir Harten, Peter Lax, Sergei Godunov. Courtesy of Bram van Leer. and for the diffusion case, Ca = max κp p 2 t, x2 (0.35) where p is the number of all available wave speeds λp or the diffusion coefficients κp, respectively. Note that p = for the linear scalar equations. It is important to note that the CFL condition is only a necessary condition for stability (and hence convergence). It is not always sufficient to guarantee stability, and a numerical method satisfying the CFL condition can become unstable. So far, we have discussed stability of the numerical schemes related to the CFL condition. What can we say about the numerical accuracy regarding to the CFL condition? Let us try to give a brief discussion about the relation between Ca and the numerical accuracy. Consider the stable case shown in the top figure in Fig. 5. We know from the domain of dependence that the properties at the red star depend only on those points inside the red triangle. However, the grid points xi and xi+ are outside the domain of dependence the red triangle for the red star and hence theoretically they should not influence the properties of the red star. On the other hand, the numerical domain of dependence the region of the green triangle actually takes information only from the two locations at

git Tutorial Nicola Chiapolini Physik-Institut University of Zurich January 26, 2015

Nicola Chiapolini, January 26, 2015 1 / 36 git Tutorial Nicola Chiapolini Physik-Institut University of Zurich January 26, 2015 Based on talk by Emanuele Olivetti https://github.com/emanuele/introduction_to_git.git