Adaptive Spike-Based Solver 1.0 User Guide

Size: px
Start display at page:

Download "Adaptive Spike-Based Solver 1.0 User Guide"

Transcription

1 Intel r Adaptive Spike-Based Solver 10 User Guide I V 1 W 2 I V 2 W 3 I V 3 W 4 I 1

2 Contents 1 Overview 4 11 A Quick What, Why, and How 4 12 A Hello World Example 7 13 Future Developments 8 14 User Guide Outline 9 2 The SPIKE Subroutine Setting the environment Autoadapt Data Disabling Spike Adapt Running the spike adaptexe command 16 3 Separate calls 18 4 Banded Preconditioner 20 5 Manual Data Partition Dense Banded Format Sparse CSR Format 22 6 SPIKE Examples Example1: Automatic Partitioning Example2: Automatic Partitioning and Multiple RHS Example3: Automatic Partitioning and Multiple RHS with Separate Factorization and Solution Example4: Manual Partitioning Example5: Automatic Partitioning Using the CSR Input Format Example 6: Automatic Partitioning Using the CSR Input Format with a Preconditioner Toeplitz Matrix Example Sparse Banded Matrix Example Calling SPIKE from C Programs 39 7 Reference guide SPIKE 10 directory structure SPIKE and ScaLAPACK 44 2

3 73 Spike Default Spike Spike Begin Spike Preprocess Spike Process Spike End spike param details matrix data details info details 50 Bibliography 52 A Mathematical Description of Key Strategies 57 A1 Az = r via TU 58 A2 Az = r via FL 60 A3 Az = r via RL/RP 60 A4 Az = r via TA 62 A5 Az = r via EA 63 B How Spike Adapt Works 65 B1 Why is Spike Adapt Necessary? 65 B2 How Does Spike Adapt Work? 65 B3 Spike Adapt Return Codes 66 C MPI Compatibility Library 67 3

4 Chapter 1 Overview 11 A Quick What, Why, and How Intel r Adaptive Spike-Based Solver (SPIKE in short) is a software package for solving large, banded linear systems on parallel computers Solving banded linear systems is a crucial step in many high-performance computing (HPC) applications For example, they frequently arise after a general sparse matrix is reordered in some fashion In other instances, banded systems are used as effective preconditioners to general sparse systems where they are solved via iterative methods Existing parallel software using direct methods for banded matrices are mostly based on LU factorizations In contrast, SPIKE is based on a different decomposition method that increases arithmetic costs but naturally leads to lower communication overhead, which is advantageous on modern parallel architectures where arithmetic performance has outpaced memory and network performance Thus, SPIKE offers HPC users a new and valuable tool The central idea behind SPIKE is a different decomposition of a matrix [10, 4, 6, 2, 5, 8, 9] compared to the common LU decomposition that represents a matrix A as a product of lower and upper triangular matrices A = LU Consequently, solving AX = F can be achieved by solutions of two triangular systems LG = F and UX = G In contrast, SPIKE is based on a decomposition motivated by the important case where A is a banded matrix Figure 11 shows a banded matrix and its partitioning for parallel processing The decomposition takes the form A = DS Here, D is A 1 B 1 A 2 A = Partitioned C 2 B 2 C 3 A 3 Figure 11: A banded matrix with a conceptual partition 4

5 block diagonal matrix consisting of all the A j blocks (see Figure 11) and S is D 1 A, assuming for the moment that the A j blocks are non-singular Matrix S has the structure of an identity matrix with some extra spikes, hence the name of the package (Figure 12) In practice, D and S may not A = D S A 1 A 1 B 1 A 2 I V 1 C 2 A 2 W 2 I V 2 B 2 C 3 A 3 A 3 W 3 I Figure 12: Decomposition where A = DS, S = D 1 A be obtained exactly, either intentionally or due to limitations such as singularity Instead, the numerical algorithm yields D and S that resemble the structures of D and S in Figure 12 and satisfy an equation of the form A = D S + R for some residual R Even when R is non-zero, it is by design small in some sense The basic method employed in SPIKE is as follows: solve ( D S + R)X = F via a preconditioned iterative method (with preconditioner M = D S); solve systems of the form MZ = Y for varying Y s; end The key step of this iterative method is the solution of systems with the D S matrix Solving AX = F can now be seen as involving three steps conceptually: 1 Solving the block-diagonal system DG = F Because D consists of decoupled systems of each diagonal block Ãi, they can be solved in parallel without synchronization between the individual systems A number of strategies based on the LU decomposition of each Ãi can be applied here These include variants such as LU without pivoting, LU with pivoting, as well as a combination of LU and UL decompositions with or without pivoting 2 Solving the system SY = G This system has the wonderful characteristic that it is also largely decoupled Except for a reduced system near the junction between the identity blocks, the rest are independent The natural way to tackle this system is to first solve the reduced system using parallel algorithms that require interprocessor communication, followed by retrieval of the rest of the solution without requiring further interprocess communication Here again, a number of different strategies exist for solving the reduced system 5

6 3 Depending on how D and S were obtained earlier, which is related to the exact strategy used in the two previous steps, R can be zero or non-zero If R is zero, then of course the Y obtained is the desired solution to AX = F Otherwise, some corrections must be computed This can be accomplished by a number of standard iterative methods such as iterative refinement, GMRES, or BiCGStab, just to name a few All in all, a large variety of strategies can be applied based on the basic decomposition A = DS and the realization of the approximations D and S; ie, A = D S+R in which R is a correction, where M = D S is an effective preconditioner for a variety of iterative schemes SPIKE offers a number of choices to solve AX = F based on the framework of this decomposition SPIKE can compute the solution of AX = F by a single call where the specific strategy can be selected automatically or manually A user can also solve a system by issuing several step-by-step calls similar to separating the LU factorization and the forward/backward substitutions in LAPACK [1] In this case, the user can handle more interesting situations including the solution of different right-hand sides (RHS) at different times, AX i = F i while amortizing those one-time computation costs related to the same matrix A To summarize, Intel r Adaptive Spike-Based Solver 10 aims to solve AX = F in parallel where A is a banded matrix It currently supports users using MPI to express parallelism The algorithmic framework is based on a decomposition of the form A = D S + R This framework allows many different strategies that can exploit special properties of the underlying processor architectures, network properties, as well as the numerical nature of the input matrix A SPIKE 10 consists of two main layers: a computational layer called Spike Core and a strategy selection layer called Spike Adapt Spike Core consists of the necessary linear algebra software to support different solution strategies whereas Spike Adapt is an independent layer that selects an efficient strategy based on the characteristics of the input matrix A and the underlying computer system By default, Spike Adapt automatically picks a strategy on the user s behalf Nevertheless, expert users have the option to pick a strategy manually A strategy is defined by algorithmic choices for each of the three steps (involving D, S, and as needed for non-zero R) outlined previously A user can ask for the solution to the problem AX = F via a single call to SPIKE This is covered in Chapter 2 Alternatively, this single function call can be replaced by separate calls similar to separating the calls to triangular factorization and the subsequent triangular solves This added complexity is especially worthwhile when solutions with different RHS for the same matrix A are needed at different times, allowing the common preprocessing cost pertaining to A to be amortized Invoking SPIKE with multiple function calls is covered in Chapter 3 Finally, concerning data distribution, the user can provide the complete matrix A and the RHS in the MPI master process and rely on SPIKE to distribute the data to the remaining MPI processes Alternatively, the user can manually distribute the data Chapter 5 covers the data distribution options in greater detail 6

7 12 A Hello World Example This example solves a 32-by-32 tridiagonal Toeplitz system with 6 on the diagonal, -1 on the two off-diagonals, and the constant vector 1 as the RHS That is, solve for X where X = A single call to the SPIKE subroutine takes care of data distribution and strategy selection The user only needs to set a few global parameters such as number of processors, the local MPI rank, and the structure and the bandwidth of the matrix The matrix and RHS data are stored initially on the MPI master process (ie, process-0) The source code of hello worldf90 is listed in Figure 13 To create the executable, compile the source program INCLUDE s p i k e f i program h e l l o w o r l d c o d e use s p i k e m o d u l e use mpi! b e f o r e t h e MPI INIT c a l l i n g s e q u e n c e s integer : : i, rank, nb procs, code integer : : i n f o type ( spike param ) : : p s p i k e! S p i k e p a r a m e t e r d a t a s t r u c t u r e type ( m a t r i x d a t a ) : : mat! S p i k e m a t r i x d a t a s t r u c t u r e double precision, dimension ( :, : ), a l l o c a t a b l e : : f! r h s c a l l MPI INIT ( code ) c a l l MPI COMM SIZE(MPI COMM WORLD, nb procs, code ) c a l l MPI COMM RANK(MPI COMM WORLD, rank, code )! s e t up S p i k e p a r a m e t e r d a t a s t r u c t u r e on a l l p r o c e s s o r s pspike%nbprocs=nb procs ; pspike%rank=rank c a l l SPIKE DEFAULT( p s p i k e )! d e f a u l t v a l u e s f o r p s p i k e pspike%autoadapt= true! autoadapt i s on! s e t up S p i k e m a t r i x d a t a p a r a m e t e r s on a l l p r o c e s s o r s mat%format = D ; mat%a s t r u = G ; mat%diagdo = Y mat%n = 3 2 ; mat%kl = 1 ; mat%ku = 1! c r e a te i np ut matrix and rhs on Processor 0 i f ( rank == 0) then a l l o c a t e ( mat%a( 1 : mat%kl+mat%ku+1, mat%n ) ) a l l o c a t e ( f ( 1 : mat%n, 1 : 1 ) ) mat%a( 1, : ) = 1 0 d0 ; mat%a( 2, : ) = 6 0 d0 ; mat%a( 3, : ) = 1 0 d0 f = 1 0 d0 end i f! one c a l l t o S p i k e f o r s o l v i n g Ax=f c a l l SPIKE( pspike, mat, f, i n f o )! s o l u t i o n i s i n f which r e s i d e s i n P r o c e s s o r 0 i f ( i n f o >=0) then i f ( rank == 0) then do i =1,mat%n print, i, f ( i, 1 ) end do end i f end i f c a l l MPI FINALIZE( code ) end program h e l l o w o r l d c o d e Figure 13: A very simple example 7

8 and link it with the Intel r Adaptive Spike-Based Solver 10 libraries which also provide BLAS and LAPACK libraries Assuming that SPIKE has been installed in a directory called <SPIKE directory> and the user is compiling the source program called hello worldf90: mpiifort hello worldf90 -o hello worldexe \ -I<SPIKE directory>/include \ -L<SPIKE directory>/lib/<arch> \ -lspike -lspike mpi comm \ -lspike adapt -lspike adapt de -lspike adapt grid f \ -lmkl solver -lmkl lapack -lmkl -lguide -lpthread where mpiifort is the Fortran compiler driver for the Intel MPI Library and <arch> is either 64, for IA-64 architecture or em64t, for Intel r 64 architecture A run of the resulting executable hello worldexe may look like mpirun np 4 hello worldexe and the following is the output of the run: >>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>> Spike S t r a t e g y EA3 TIME FOR PARTITIONING e 02 TIME FOR SPIKE BANDED FACT e 02 TIME FOR SPIKE BANDED SOLV e 03 TIME FOR SPIKE (FACT+SOLV) e 02 RESIDUAL e 16 # Outside i t e r a t i o n s : 0 SPIKE has s u c c e e d e d ( to r e a c h the a c c u r a c y p s p i k e%e p s o u t ) Future Developments Enhancements to SPIKE 10 will be made in several orthogonal areas: the kinds of sparse matrices handled via added utility functions, the set of so- 8

9 lution strategies it encompasses, and the variety of parallel environments it supports When A is a general sparse matrix, often times reordering can transform it either into a banded matrix, or a low-rank perturbation of a banded matrix We intend to offer utilities for matrix reordering and capabilities to handle more general sparse matrices In addition to the current LU-based strategies for handling the diagonal blocks of the D matrix, we intend to add other strategies (eg, based on least-squares) to handle very ill-condition systems Other data distribution strategies that exhibit better load-balancing properties will also be added MPI is the only parallel environment supported currently but alternative parallel environments may be considered in future releases 14 User Guide Outline The remainder of this guide describes the usage of SPIKE 10 in greater detail Chapter 2 focuses on invoking SPIKE with a single function call to obtain the solution X to the equation AX = F where both A and F are stored in the MPI master process Chapter 3 describes how to solve AX = F using multiple SPIKE functions The motivating example is the solution of multiple RHS for AX k = F k where F k are available at different times This way the step that performs setup related to A can be done just once We assume A and F k are initially stored in the MPI master process Chapter 5 describes how the user can distribute A and F across multiple MPI processes This avoids the overhead of data distribution and allows the solver to use the aggregate memory of a distributed-memory parallel computer SPIKE supports several distribution schemes including ScaLAPACK s format Thus, ScaLAPACK programs can be modified to use SPIKE with very little effort Chapter 6 presents a number of SPIKE examples illustrating its uses Chapter 7 provides detailed reference material on the SPIKE directory structure and each SPIKE function 9

10 Chapter 2 The SPIKE Subroutine SPIKE 10 contains two main components: Spike Core is the component that implements the underlying numerical methods including for example the solution of the S system in A = D S + R, factorization of the D system, and outer iterations to deal with a non-zero R The second component Spike Adapt implements a strategy selection method based on information about the underlying architecture, computer platform, and the linear system in question The single driver Spike conveniently integrates and makes available the functionalities offered by these two components to the user via a single call In brief, this driver exercises the strategy selection mechanism and then proceeds to solve AX = F for X given A and F using the selected strategy The user can find out what strategy was chosen by examining several parameters in the program, or by running the standalone binary executable spike adaptexe (at command line) that comes with SPIKE The user also has the option of selecting a strategy manually through setting several parameters, but this requires more detailed knowledge of how the strategies work To this end, this chapter also gives a brief guideline on choosing strategies, but defers to the Appendix for a more mathematical description The single driver call is call Spike(pspike, mat, f, info) Related details are given in the rest of this chapter 21 Setting the environment SPIKE provides scripts to automatically initialize the user environment They are located in the <SPIKE directory>/tools/environment directory where <SPIKE directory> is the SPIKE main directory after installation For example, it could be /opt/intel/spike/10 10

11 These scripts set environment variables that are needed to build and run SpikePACK applications Select the appropriate script for the Linux shell and architecture For example, to initialize SPIKE for the BASH shell on an Intel r EM64T system, execute the following command: > source spikevarsem64tsh To initialize SPIKE for CSH on an Itanium r processor system, use the following command: > source spikevars64csh It is recommended that the initialization command be placed in the appropriate shell startup file in $HOME; cshrc or bashrc for the CSH and BASH shells, respectively 22 Autoadapt As illustrated in the hello world program in Figure 13, parameters contained in two components of the derived type spike param variable pspike need to be set While the type spike param has many components, only two need to be set manually by the user; the rest can be assigned default values by making a call to the routine Spike Default The two components that need to be set are Component Type Description nbprocs integer number of processors - MPI related rank integer rank of the local processor - MPI related The rest of the components can be set to their default by calling the routine Spike Default For example call Spike Default(pspike) will set those components in the derived type spike param variable pspike to their default values These default values are given in Table 21 Note that some of these components are inout in nature which means SPIKE may actually overwrite the input values as a result of executing the software The spike param derive type consists of a host of other output components Refer to Section 79 for comprehensive information 23 Data In this section we explain how we can set up the parameters within the type matrix data variable mat that we use in our calling sequence example The type matrix data main purpose is to hold the matrix represented in a number of popular representation In SPIKE 10, both the LAPACK banded-type storage format (without additional storage for pivoting) or CSR (Compressed Sparse Row) format are supported Depending on the 11

12 Component Type Default Description autoadapt logical true strategy automatically picked if true RSS char R Reduced System Strategy: R, T, or F DFS char P Diagonal Factorization Strategy: P, L, U, or A OIS integer 3 Outer Iteration Strategy: 3 (more options in the future) The three components above together specify a strategy for solving a banded system using the Spike framework When autoadapt is set to true(which is the default value), the input values of these three components are ignored and overwritten to record the automatically chosen strategy Section 24 has more details on manual strategy selection BPS integer 0 Banded Preconditioner Strategy 0 User does not specify a banded preconditioner -1 A banded preconditioner is specified by user threads integer 1 value of the OpenMP environment variable OMP NUM THREADS if mat%format= S (ie # of threads for the PARDISO solver on each partition) nbit out integer 50 max # of outer iteration eps out double 10 7 accuracy residual outer iteration nbit in integer 100 max # of inner iteration eps in double 10 7 accuracy residual inner iteration nzero double 10 9 new zero value for diagonal boosting O ɛ if pivot < O ɛ 1 then pivot pivot ± O ɛ 1 tp integer 0 data distribution: 0 data in Proc 0 1 data on each Procs (cf Chapter 5) memfree logical false deallocate memory for matrix (the case when tp=0) residual logical true compute the L relative residual norm timing logical false provide timing information comd logical false provide detailed running information file output integer 6 print information to screen if 6, file ID for spikeoutput otherwise Table 21: List of input components for the derived type spike param Note: RSS, DFS, OIS are inout whereas the rest are input only 12

13 the specific value of pspike%tp in the variable pspike being passed to the routine, the mat variable on Processor 0 may be used to hold the full original, or the mat variable on each of the Processor may be used to hold part of the original matrix In the former case, Spike Core will partition the data held on Processor 0 and distribute them to the other processors under the hood In the latter case, the user needs to manually put the appropriate part of the matrix in each of the different processors Chapter 5 will give the necessary details for one to perform this task For now, Table 22 gives details of the matrix data structure relevant for pspike%tp=0, that is, the user put the complete matrix into the mat variable on Processor 0 mat% Type Distribution Description format char (in) global matrix format: D : Dense; S : Sparse CSR astru char (in) global matrix structure: G : General non-symmetric diagdo char (inout) global diagonal dominance Y : Yes; N : No; I : Investigate vdiagdo double (out) global SPIKE computed diagonal dominance value if mat%diagdo= I or pspike%autoadapt =true n integer (in) global matrix dimension The input field below is for the case mat%format= D kl integer (inout) global # of subdiagonals in matrix ku integer (inout) global # of superdiagonals in matrix A double(bwd,mat%n) rank 0 LAPACK banded matrix format, no extra pivoting space bwd = mat%kl+mat%ku+1 The input fields below are associated with the sparse CSR format (mat%format= S ) nbsa integer rank 0 # of non-zero matrix elements sa double(mat%nbsa) rank 0 CSR format, matrix elements jsa integer(mat%nbsa) rank 0 CSR format, column indices isa integer(mat%n+1) rank 0 CSR format, start-of-row indicies Table 22: List of parameter fields of the type matrix data variable mat Here all the matrix data are stored in Processor 0 If space for mat%a in Processor 0 is allocated dynamically, the user may want to have it deallocated automatically by setting pspike%memfree = true All the other parameter fields must be declared as global (ie commun for each processors) Finally, if the matrix data have been defined in rank 0, the rhs parameter should also be defined in rank 0 using: 13

14 Parameter Type Distribution Description f double(mat%n,nrhs) (inout) rank 0 Right-hand side f (in) Solution x of Ax=f (out) nrhs stands for # of RHS Table 23: Definition of the RHS (in) and solution (out) stored in rank 0 24 Disabling Spike Adapt While we recommend that the user set the autoadapt component to true, it is possible to disable automatic strategy selection by setting the autoadapt component to false In this case, the strategy is defined by the values set in the three components (RSS, DFS, OIS), which are set to ( R, P,3) by Spike Default We explain in this section what these parameters mean and offer a general guideline on how to set strategy manually Recall that the computational framework of SPIKE is based on the decomposition A = D S + R where the structure of D and S are depicted in Figure 12 The generic way used to solve the system AX = F can be described as: Solve AX = F by a preconditioned iterative method Use M as preconditioner where M = D S The preconditioning step solving MZ = Y The three components of a strategy are: Reduced System Strategy RSS: The crux of a preconditioned iterative scheme is the solution involving the preconditioner M The key parallel algorithm is the handling of the S matrix The portion S red of S near the partition boundaries constitute a reduced system; and the key in solving system with S lies in the solution of this reduced system S red There are several strategies in solving the reduced system: R: stands for recursive A recursive algorithm can be applied to the reduced system F: stands for on the fly The reduced system can be solved using an iterative method In this situation, there is no need to have to compute the S red matrix explicitly as one only need to compute the action of S red on vectors These are computed on the fly based mostly on the A matrix itself E: stands for explicit Here the V j and W j blocks of the S matrix are explicitly computed The reduced system is solved in an iterative manner T: stands for truncated This is based on an exploitation of the special structure of S Should the top and bottom portions of 14

15 suitable sizes of the V j and W j blocks be zero, solution of the reduced system S red becomes extremely easy This strategy sets those blocks to zero deliberately (hence truncating the V j and W j submatrices) and trade the ease of of solution of this slightly wrong S red system at the expense of corrective effort elsewhere Diagonal Factorization Strategy DFS: Solving D SZ = Y naturally involves in one form or another solutions of system with the D matrix, which is block diagonal in structure For SPIKE 10, we rely on various direct factorization algorithms to tackle this problem The strategies here correspond to factorizations of those diagonal block matrices Note however that while these strategies normally correspond to familiar methods designed for dense matrices, they can be overloaded to represent direct sparse matrix factorizations motivated by the corresponding dense versions For example, in the case of sparse bands, L refers to the factorization provided by the popular package PARDISO [11] P: stands for pivoting This is LU factorization with partial pivoting L: stands for LU This is the LU factorization without pivoting U: stands for UL This is obtaining both the LU and UL factorizations, neither with pivoting A: stands for alternate This alternate from block to block between LU and UL factorizations, without pivoting Outer Iteration Strategy OIS: represents the iterative method use in the outermost layer An integer value is used to direct a specific choice For the current release SPIKE 10, we only support BiCGStab iterative scheme which corresponds to the value 3 While RSS and DFS are mostly orthogonal, they are not completely so Indeed, some factorization strategies are motivated and consequently applicable only to some particular reduced system strategies Therefore, not all combinations of choices in RSS with DFS are supported or in fact meaningful In the current release, the following six combinations of (RSS,DFS) are supported: (T,U), (F,L), (R,L), (R,P), (T,A), (E,A) Moreover, if mat%format= D the setting of the tp component of the spike param variable as well as the number of processors also affect the applicability of these six choices In this case, Table 24 tabulates the applicable strategies under different tp and nbprocs setting In the case where mat%format= S only the combination (F,L) is allowed while Spike Adapt is turned off 15

16 pspike%tp pspike%nbprocs n (n > 1) Even ( 2 n ) Odd TU FL 0 RL RP All All TU FL EA TA TU FL 1 None All TU FL TU FL RL RP Table 24: This table illustrates how the type of matrix partitioning and the number of MPI processes affect the choice of (RSS,DFS) for the Spike Core strategy In future developments of SPIKE 10, the choice of (RSS,DFS) will be independent of the setting of the tp component 25 Running the spike adaptexe command User applications do not call Spike Adapt directly Rather, Spike Core calls Spike Adapt if the autoadapt component element of the spike param structure is set to true Note that in this case the user-specified (RSS,DFS,OIS) values are ignored and in fact will be overwritten Nevertheless, a standalone executable spike adaptexe is provided by SPIKE 10 in the location <SPIKE directory>/bin/<arch> where arch is either 64, for IA-64 architecture, or em64t, for Intel r 64 architecture Given a set of input characteristics (matrix size, bandwidth, number of MPI processes, sparsity, diagonal dominance, the number of righthand sides, type of matrix partitioning), this executable will suggest an optimal Spike Core strategy Edit the Fortran NAMELIST file, ivarsnml, to specify the matrix parameters, eg: &IVAR matrix_size = bandwidth = 161 n_proc = 4 sparsity = 09d0 diagonal_dominance = 12d0 n_rhs = 1 tp = 0 / Simply run spike adaptexe in the same directory as ivarsnml to get a recommended Spike Core strategy, eg: [cluster0]$ spike_adaptexe /spike_adaptexe Bandwidth = 161 Diagonal dominance = Matrix size = Sparsity =

17 # RHS = 1 # Procs = 4 Type of partition: 0 The Spike_Adapt performance models selected fl3 17

18 Chapter 3 Separate calls A single call to Spike CALL Spike(pspike,mat,f,info) can be split into a calling sequence of four separate operations: where CALL Spike Begin(pspike,mat,pre,info) CALL Spike Preprocess(pspike,pre,info) CALL Spike Process(pspike,mat,pre,f,info) CALL Spike End(pspike,mat,pre,info) Spike Begin: beginning of the calling sequence; Spike Preprocess: preprocessing of the preconditioner data structure; Spike Process: processing of the matrix, preconditioner and the righthand side; Spike End: ending of the calling sequence We can see in additional to pspike, mat, f and info, there is a new parameter pre needed for the split calls This parameter pre is of type matrix data and pertains to a preconditioner However, the user needs not set any of the component values Consider it a work array of some sort that the software uses internally Splitting a single call to SPIKE is useful for applications having iterations with changing right-hand-sides but using the same original matrix The following program invokes Spike Process multiple times rather than invoking Spike multiple times Figure 31 presents a program solving two different right hand sides: (1, 0, 0, 0, 0, 0, 0, 0) T and then (0, 1, 0, 0, 0, 0, 0, 0) T Note that the program uses the global partitioning scheme, so the right hand sides are set up in node 0 In the program, Spike Begin, Spike Preprocess and Spike End are called once while Spike Process is called twice (once for each right hand side) This program is expected to run faster than an equivalent one with 18

19 ! D e c l a r e v a r i a b l e s u s e d by SpikePACK integer : : i n f o type ( spike param ) : : p s p i k e type ( m a t r i x d a t a ) : : mat, pre double precision, dimension ( 8, 1 ) : : f! S e t up p s p i k e and mat as u s u a l! The f o l l o w i n g two c a l l s a r e c a l l e d once c a l l Spike Begin ( pspike, mat, pre, i nf o ) c a l l Spike Preprocess ( pspike, pre, i nf o )! S o l v e f o r t h e f i r s t r i g h t hand s i d e i f ( rank == 0) then f =00 d0 f ( 1, 1 ) = 1 0 d0 end i f! S p i k e P r o c e s s ( ) i s i n v o k e d f o r t h e f i r s t r i g h t hand s i d e c a l l Spike Process ( pspike, mat, pre, f, in f o )! The s o l u t i o n o f t h e f i r s t RHS i s s t o r e d i n f a f t e r S p i k e P r o c e s s ( )! S o l v e f o r t h e s e c o n d r i g h t hand s i d e i f ( rank == 0) then f=f 0 1 d0 end i f! S p i k e P r o c e s s ( ) i s i n v o k e d f o r t h e s e c o n d r i g h t hand s i d e c a l l Spike Process ( pspike, mat, pre, f, in f o )! The s o l u t i o n o f t h e s e c o n d RHS i s s t o r e d i n f a f t e r S p i k e P r o c e s s ( )! The f o l l o w i n g c a l l i s c a l l e d once c a l l Spike End ( pspike, mat, pre, i n f o ) Figure 31: A program solving two right hand side using separate Spike calls two Spike calls because this program only initializes and frees Spike data structures once while a program calling Spike twice would have duplicated these works 19

20 Chapter 4 Banded Preconditioner SPIKE can be used as a framework for solving banded systems to be used as effective preconditioners for general sparse systems, which are solved via iterative methods In future releases, SPIKE will offer different options for enabling an automatic derivation of a robust banded preconditioner from an arbitrary general sparse systems In particular the component %BPS for the derived type spike param in Table 21, has been introduced to such effect For the current SPIKE version 10, the component %BPS can only take two values: 0 (no preconditioner default value) or 1 where the banded preconditioner has to be set by the user Some users may take advantage of this option in the case where banded preconditioners can be constructed directly from an application at hand, such as in nanoelectronics nanowire simulations [7] Using the separate SPIKE calling sequences presented in Chapter 3, one can decide on a preconditioner pre that will be called by the preprocessing sequence, while the processing sequence takes advantage of the obtained factorization of the preconditioner to accelerate the outer-iterative schemes Therefore, with the option %BPS= 1 the user has the possibility of defining his own banded preconditioner (either dense or sparse within the band) for solving iteratively an original system matrix that can be general sparse Depending on the data distribution format (component %tp), the user must define the preconditioner pre using the derived type spike param in a similar way he defines the original matrix mat either using Table 22 (%tp=0) or Table 51 (%tp=1) In Chapter 6, Example 6 illustrates the use of the option %BPS= 1 20

21 Chapter 5 Manual Data Partition It has been assumed until now that all of the matrix and RHS data reside in the MPI master process (ie, process-0) This is specified by setting the spike param tp parameter to zero When the matrix and RHS data are entirely in process-0, SPIKE automatically distributes a portion of the data to each MPI process before invoking the solver The price paid for this convenience is the overhead associated with the data distribution and potential limits on the overal problem size Specifically, the problem size is limited to the memory available to process-0 Alternatively, SPIKE allows the user to partition dense matrices and RHS s among the MPI processes before calling Spike Core This chapter describes the local partitioning schemes supported by SPIKE 10 Let pspike and mat be the variables of type spike param and matrix data, respectively, used during calls to Spike Default, Spike, Spike Begin, Spike Process, etc The dense banded format is specified by mat%format = D, while the sparse CSR format is specified by mat%format = S In the following pspike%tp is set to 1 to manually distribute the matrix and RHS to the MPI processes 51 Dense Banded Format Consider a (complete) matrix of dimension n and bandwidth bwd, where bwd = mat%kl + mat%ku + 1 If SPIKE were to distribute the data automatically (ie, tp=0), one would allocate a space of bwd-by-n for mat%a Here Table 51 gives details of the matrix data structure relevant for pspike%tp=1, that is, the user distributes manually the complete matrix into the local mat variable on each processors Figure 51 illustrates this partitioning scheme The user must distribute this bwd-by-n array into pspike%nbprocs arrays of dimension bwd-by-n j where the values of n j satisfying n = nbprocs j=1 are set by the user The values of n j are stored globally (ie commun for all processors) in the array of integer mat%sizea of dimension nbprocs, 21 n j

22 such that mat%sizea=(n 1, n 2,, n nbprocs ) The matrix elements are stored locally on each processors in mat%a The RHS s are distributed by rows in a natural way Each MPI process j 1 will have an array of dimension n j -by-nrhs, for j = 1, 2,, nbprocs Figure 51: Illustration of a matrix in LAPACK banded storage format distributed to four MPI processes 52 Sparse CSR Format Consider a (complete) sparse matrix of dimension n, if SPIKE were to distribute the data automatically (ie, tp=0), one would use a CSR format and allocate in processor 0 the set of arrays mat%sa, mat%isa, mat%isa However, with tp=1, the user must distribute the complete sparse matrix by block of rows into %nbprocs set of arrays in CSR format where the number of non-zero elements of each submatrices nnz j and the number of rows n j satisfying n = nbprocs j=1 are set by the user Figure 52 illustrates this partitioning scheme and Table 51 gives details of the matrix data structure relevant for pspike%tp=1 n j Figure 52: Illustration of a matrix in CSR sparse storage format distributed to four MPI processes The values of nnz j are stored locally (ie on each processors) in the integer mat%nbsa The matrix elements are also stored locally on each pro- 22

23 mat% Type Distribution Description format char (in) global matrix format: D : Dense; S : Sparse CSR astru char (in) global matrix structure: G : General non-symmetric diagdo char (inout) global diagonal dominance Y : Yes; N : No; I : Investigate vdiagdo double (out) global SPIKE computed diagonal dominance value if mat%diagdo= I or pspike%autoadapt =true n integer (in) global matrix dimension sizea integer(pspike%nbprocs) (in) global set of partitions dimensions with mat%sizea=(n 1, n 2,, n nbprocs ) The input field below is for the case mat%format= D kl integer (inout) global # of subdiagonals in matrix ku integer (inout) global # of superdiagonals in matrix A double(bwd,mat%sizea(i+1)) rank i LAPACK banded matrix format, no extra pivoting space bwd = mat%kl+mat%ku+1 The input fields below are associated with the sparse CSR format (mat%format= S ) nbsa integer rank i # of non-zero matrix elements nnz j for partition j=i+1 sa double(mat%nbsa) rank i CSR format, matrix elements jsa integer(mat%nbsa) rank i CSR format, column indices isa integer(mat%sizea(i+1)+1) rank i CSR format, start-of-row indicies Table 51: List of parameter fields of the type matrix data variable mat Here all the matrix data are distributed on each processors with pspike%tp=1 23

24 cessors in the arrays of integer mat%sa, mat%jsa, mat%isa with dimension mat%nbsa, mat%nbsa and n j + 1, repectively The RHS s are distributed by rows in a natural way Each MPI process j 1 will have an array of dimension n j -by-nrhs, for j = 1, 2,, nbprocs 24

25 Chapter 6 SPIKE Examples This section shows sample programs illustrating the SPIKE calling sequences In examples 1, 2, 3 and 4, SPIKE solves the following linear system of size n = 8: x 1 f x 2 f x 3 f x x 5 = f 4 f x 6 f x 7 f x 8 f 8 Note that examples 1, 2, 3, and 5 can use 1, 2, or 4 MPI processes Example 4 is designed for only 2 MPI processes 61 Example1: Automatic Partitioning In this example, partitioning of the coefficient matrix and the RHS is done by SPIKE The RHS is (1, 1, 1, 1, 1, 1, 1, 1) T This example calls the SPIKE subroutine INCLUDE s p i k e f i program example1 use use s p i k e m o d u l e mpi i m p l i c i t none integer : : rank, code, nb procs, i double precision, dimension ( :, : ), a l l o c a t a b l e : : f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! integer : : i n f o type ( spike param ) : : p s p i k e type ( m a t r i x d a t a ) : : mat! c a l l MPI INIT ( code ) c a l l MPI COMM SIZE(MPI COMM WORLD, nb procs, code ) c a l l MPI COMM RANK(MPI COMM WORLD, rank, code ) c a l l MPI Errhandler set (MPI COMM WORLD, MPI ERRORS RETURN, code ) ;!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! p s p i k e%nbprocs=n b p r o c s 25

26 p s p i k e%rank=rank c a l l SPIKE DEFAULT( p s p i k e )!! c h a n g e s from d e f a u l t p s p i k e%autoadapt = f a l s e!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER MATRIX and RHS!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! A l l p r o c e s s o r s mat%format = D mat%astru= G mat%diagdo= Y mat%n=8 mat%k l=2 mat%ku=2!! G l o b a l m a t r i x i s d e f i n e d o n l y on p r o c e s s o r 0 i f ( rank ==0) then!! o n l y on p r o c e s s o r 0 ( g l o b a l m a t r i x ) allocate ( mat%a( 1 : mat%kl+mat%ku+1,mat%n ) ) mat%a( mat%ku+1,:)=60 d0 mat%a( mat%ku 1,:)= 10 d0 mat%a( mat%ku,:)= 10 d0 mat%a( mat%ku+2,:)= 10 d0 mat%a( mat%ku+3,:)= 10 d0!! RHS a l l o c a t e ( f ( 1 : mat%n, 1 : 1 ) ) f =10 d0 end i f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! CALLING SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! c a l l SPIKE( pspike, mat, f, i n f o ) i f ( i n f o >=0) then!!!!!! G l o b a l S o l u t i o n i f ( rank ==0) then print, Global s o l u t i o n do i =1,mat%n print, i, f ( i, 1 ) end do end i f end i f c a l l MPI FINALIZE( code ) end program example1 We get the following output: >>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>> Spike S t r a t e g y RP3 TIME FOR PARTITIONING e 03 TIME FOR SPIKE BANDED FACT e 04 TIME FOR SPIKE BANDED SOLV e 04 TIME FOR SPIKE (FACT+SOLV) e 03 RESIDUAL e 16 # Outside i t e r a t i o n s : 0 SPIKE has s u c c e e d e d ( to r e a c h the a c c u r a c y p s p i k e%e p s o u t ) SPIKE WARNING 1 Global s o l u t i o n Example2: Automatic Partitioning and Multiple RHS In this example, two systems with same coefficient matrix are solved The RHS are (1, 0, 0, 0, 0, 0, 0, 0) T and (0, 1, 0, 0, 0, 0, 0, 0) T This example calls 26

27 the SPIKE subroutine INCLUDE s p i k e f i program use example2 s p i k e m o d u l e use mpi i m p l i c i t none integer : : rank, code, nb procs, i double precision, dimension ( :, : ), a l l o c a t a b l e : : f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! integer : : i n f o type ( spike param ) : : p s p i k e type ( m a t r i x d a t a ) : : mat! c a l l MPI INIT ( code ) c a l l MPI COMM SIZE(MPI COMM WORLD, nb procs, code ) c a l l MPI COMM RANK(MPI COMM WORLD, rank, code ) c a l l MPI Errhandler set (MPI COMM WORLD, MPI ERRORS RETURN, code ) ;!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! p s p i k e%nbprocs=n b p r o c s p s p i k e%rank=rank c a l l SPIKE DEFAULT( p s p i k e )!! c h a n g e s from d e f a u l t p s p i k e%autoadapt = f a l s e pspike%dfs= L!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER MATRIX and RHS!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! mat%format = D mat%astru= G mat%diagdo= Y mat%n=8 mat%k l=2 mat%ku=2 i f ( rank ==0) then allocate ( mat%a( 1 : mat%kl+mat%ku+1,mat%n ) ) mat%a( mat%k l +1,:)=60 d0 mat%a( mat%kl 1,:)= 10 d0 mat%a( mat%kl,:)= 10 d0 mat%a( mat%kl +2,:)= 10 d0 mat%a( mat%kl +3,:)= 10 d0!! RHS a l l o c a t e ( f ( 1 : mat%n, 1 : 2 ) ) f =00 d0 f ( 1, 1 ) = 1 0 d0 f ( 2, 2 ) = 1 0 d0 end i f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! CALLING SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! c a l l SPIKE( pspike, mat, f, i n f o ) i f ( i n f o >=0) then!!!!!! G l o b a l S o l u t i o n i f ( rank ==0) then print, Global s o l u t i o n do i =1,mat%n print, i, f ( i, 1 ), f ( i, 2 ) end do end i f!!!!!!!!!! end i f c a l l MPI FINALIZE( code ) end program example2 We get the following output: >>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>> Spike S t r a t e g y RL3 TIME FOR PARTITIONING e 03 TIME FOR SPIKE BANDED FACT e 04 27

28 TIME FOR SPIKE BANDED SOLV e 04 TIME FOR SPIKE (FACT+SOLV) e 03 RESIDUAL e 16 # Outside i t e r a t i o n s : 0 SPIKE has s u c c e e d e d ( to r e a c h the a c c u r a c y p s p i k e%e p s o u t ) SPIKE WARNING 1 Global s o l u t i o n E E E E E E E E E E E E E E Example3: Automatic Partitioning and Multiple RHS with Separate Factorization and Solution In this example, we again use two RHS but this time the SPIKE calling sequence is separated into factorization and solves where factorization is done once and there are two solves for each RHS (1, 0, 0, 0, 0, 0, 0, 0) T and (0, 1, 0, 0, 0, 0, 0, 0) T INCLUDE s p i k e f i program use example3 s p i k e m o d u l e use mpi i m p l i c i t none integer : : rank, code, nb procs, i double precision, dimension ( :, : ), a l l o c a t a b l e : : f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! integer : : i n f o type ( spike param ) : : p s p i k e type ( m a t r i x d a t a ) : : mat, pre! c a l l MPI INIT ( code ) c a l l MPI COMM SIZE(MPI COMM WORLD, nb procs, code ) c a l l MPI COMM RANK(MPI COMM WORLD, rank, code ) c a l l MPI Errhandler set (MPI COMM WORLD, MPI ERRORS RETURN, code ) ;!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! p s p i k e%nbprocs=n b p r o c s p s p i k e%rank=rank c a l l SPIKE DEFAULT( p s p i k e )!! c h a n g e s from d e f a u l t p s p i k e%autoadapt = f a l s e!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER MATRIX and RHS!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! mat%format = D! D f o r Dense Banded, S f o r Sparse banded, G f o r General Sparse mat%astru= G!!! G e n e r a l s t r u c t u r e ( non symmetric ) mat%diagdo= Y mat%n=8 mat%k l=2 mat%ku=2 i f ( rank ==0) then allocate ( mat%a( 1 : mat%kl+mat%ku+1,mat%n ) ) mat%a( mat%ku+1,:)=60 d0 mat%a( mat%ku 1,:)= 10 d0 mat%a( mat%ku,:)= 10 d0 mat%a( mat%ku+2,:)= 10 d0 mat%a( mat%ku+3,:)= 10 d0!! RHS a l l o c a t e ( f ( 1 : mat%n, 1 : 1 ) ) end i f 28

29 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! CALLING SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! c a l l SPIKE BEGIN( pspike, mat, pre, i n f o ) i f ( ( rank ==0)and ( i n f o <0)) then print, 1 Spike INFO e x i t / E r r o r Code :, i n f o, p s p i k e%e r r o r c o d e end i f c a l l SPIKE PREPROCESS( pspike, pre, i n f o ) i f ( ( rank ==0)and ( i n f o <0)) then print, 2 Spike INFO e x i t / E r r o r Code :, i n f o, p s p i k e%e r r o r c o d e end i f i f ( rank ==0) then f =00 d0 f ( 1, 1 ) = 1 0 d0 end i f c a l l SPIKE PROCESS( pspike, mat, pre, f, i n f o ) i f ( ( rank ==0)and ( i n f o <0)) then print, 3 Spike INFO e x i t / E r r o r Code :, i n f o, p s p i k e%e r r o r c o d e end i f i f ( i n f o >=0) then!!!!!! G l o b a l S o l u t i o n 1 i f ( rank ==0) then print, Global s o l u t i o n 1 do i =1,mat%n print, i, f ( i, 1 ) end do end i f!!!!!!!!!! end i f i f ( rank ==0) then f =00 d0 f ( 2, 1 ) = 1 0 d0 end i f c a l l SPIKE PROCESS( pspike, mat, pre, f, i n f o ) i f ( ( rank ==0)and ( i n f o <0)) then print, 4 Spike INFO e x i t / E r r o r Code :, i n f o, p s p i k e%e r r o r c o d e end i f i f ( i n f o >=0) then!!!!!! G l o b a l S o l u t i o n 2 i f ( rank ==0) then print, Global s o l u t i o n 2 do i =1,mat%n print, i, f ( i, 1 ) end do end i f!!!!!!!!!! end i f c a l l SPIKE END( pspike, mat, pre, i n f o ) c a l l MPI FINALIZE( code ) end program example3 We get the following output: Global s o l u t i o n E E E E E E E 003 Global s o l u t i o n E E E E E E E

30 >>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>> Spike S t r a t e g y RP3 TIME FOR PARTITIONING e 03 TIME FOR SPIKE BANDED FACT e 04 TIME FOR SPIKE BANDED SOLV e 04 TIME FOR SPIKE (FACT+SOLV) e 03 RESIDUAL e 16 # Outside i t e r a t i o n s : 0 SPIKE has s u c c e e d e d ( to r e a c h the a c c u r a c y p s p i k e%e p s o u t ) SPIKE WARNING 1 64 Example4: Manual Partitioning In this example, partitioning of the coefficient matrix and the RHS is done manually on 2 processors The RHS is (1, 1, 1, 1, 1, 1, 1, 1) T INCLUDE s p i k e f i program use example4 s p i k e m o d u l e use mpi i m p l i c i t none integer : : rank, code, nb procs, i double precision, dimension ( :, : ), a l l o c a t a b l e : : f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! integer : : i n f o type ( spike param ) : : p s p i k e type ( m a t r i x d a t a ) : : mat! c a l l MPI INIT ( code ) c a l l MPI COMM SIZE(MPI COMM WORLD, nb procs, code ) c a l l MPI COMM RANK(MPI COMM WORLD, rank, code ) c a l l MPI Errhandler set (MPI COMM WORLD, MPI ERRORS RETURN, code ) ;!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! p s p i k e%nbprocs=n b p r o c s p s p i k e%rank=rank c a l l SPIKE DEFAULT( p s p i k e )!! c h a n g e s from d e f a u l t p s p i k e%tp =1!! c u s t o m i z e d l o c a l p a r t i t i o n i n g o f t y p e 1 p s p i k e%autoadapt = f a l s e!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER MATRIX and RHS!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! mat%format = D!! d e n s e banded f o r m a t mat%astru= G mat%diagdo= Y! g l o b a l d a t a mat%n=8 mat%k l=2 mat%ku=2 a l l o c a t e ( mat%s i z e A ( 1 : 2 ) )!! o n l y 2 p a r t i t i o n s a r e c o n s i d e r e d mat%s i z e A (1)= 4 mat%s i z e A (2)= 4! l o c a l d a t a f o r p a r t i t i o n number rank+1 allocate ( mat%a( 1 : mat%kl+mat%ku+1,mat%sizea ( rank +1))) mat%a( mat%ku+1,:)=60 d0 mat%a( mat%ku 1,:)= 10 d0 mat%a( mat%ku,:)= 10 d0 mat%a( mat%ku+2,:)= 10 d0 mat%a( mat%ku+3,:)= 10 d0!! RHS ( l o c a l ) a l l o c a t e ( f ( 1 : mat%s i z e A ( rank + 1 ), 1 : 1 ) ) f =10 d0!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! CALLING SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! c a l l SPIKE( pspike, mat, f, i n f o ) 30

31 i f ( i n f o >=0) then!!!!!! L o c a l S o l u t i o n print, L o c a l s o l u t i o n f o r p a r t i t i o n, rank+1 do i =1,mat%s i z e A ( rank +1) print, i, f ( i, 1 ) end do e n d i f c a l l MPI FINALIZE( code ) end program example4 We get the following output: >>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>> Spike S t r a t e g y RP3 TIME FOR PARTITIONING e 03 TIME FOR SPIKE BANDED FACT e 04 TIME FOR SPIKE BANDED SOLV e 04 TIME FOR SPIKE (FACT+SOLV) e 03 RESIDUAL e 16 # Outside i t e r a t i o n s : 0 SPIKE has s u c c e e d e d ( to r e a c h the a c c u r a c y p s p i k e%e p s o u t ) SPIKE WARNING 1 L o c a l s o l u t i o n f o r p a r t i t i o n L o c a l s o l u t i o n f o r p a r t i t i o n Example5: Automatic Partitioning Using the CSR Input Format The following system in compressed sparse row (CSR) format is solved using the SPIKE subroutine x 1 x 2 x 3 x 4 x 5 x 6 x 7 x = INCLUDE s p i k e f i program use example5 s p i k e m o d u l e use mpi i m p l i c i t none integer : : rank, code, nb procs, i double precision, dimension ( :, : ), a l l o c a t a b l e : : f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! integer : : i n f o type ( spike param ) : : p s p i k e type ( m a t r i x d a t a ) : : mat 31

32 ! c a l l MPI INIT ( code ) c a l l MPI COMM SIZE(MPI COMM WORLD, nb procs, code ) c a l l MPI COMM RANK(MPI COMM WORLD, rank, code ) c a l l MPI Errhandler set (MPI COMM WORLD, MPI ERRORS RETURN, code ) ;!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! p s p i k e%nbprocs=n b p r o c s p s p i k e%rank=rank c all SPIKE DEFAULT( p s p i k e )!! c h a n g e s from d e f a u l t p s p i k e%autoadapt = f a l s e p s p i k e%rss= F pspike%dfs= L!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER MATRIX and RHS!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! mat%format = S!! CSR mat%astru= G mat%diagdo= Y mat%n=8 i f ( rank ==0) then mat%nbsa =20!! number o f non z e r o e l e m e n t s i n CSR f o r m a t a l l o c a t e ( mat%sa ( 1 : mat%nbsa ) )! a r r a y f o r v a l u e s a l l o c a t e ( mat%j s a ( 1 : mat%nbsa ) )! a r r a y f o r column i n d e x e s a l l o c a t e ( mat%i s a ( 1 : mat%n +1))! a r r a y f o r row CSR i n d e x e s mat%sa =(/6, 1,6, 1, 1,6, 1, 1,6, 1, 1,6, 1, 1,6, 1, 1,6, 1,6/) mat%j s a = ( / 1, 3, 2, 4, 1, 3, 5, 2, 4, 6, 3, 5, 7, 4, 6, 8, 5, 7, 6, 8 / ) mat%i s a = ( / 1, 3, 5, 8, 1 1, 1 4, 1 7, 1 9, 2 1 / )!! RHS a l l o c a t e ( f ( 1 : mat%n, 1 : 1 ) ) f =10 d0 end i f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! CALLING SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! c a l l SPIKE( pspike, mat, f, i n f o ) i f ( i n f o >=0) then!!!!!! G l o b a l S o l u t i o n i f ( rank ==0) then print, Global s o l u t i o n do i =1,mat%n print, i, f ( i, 1 ) end do end i f e n d i f c a l l MPI FINALIZE( code ) end program example5 We get the following output: >>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>> Spike S t r a t e g y FL3 TIME FOR PARTITIONING e 04 TIME FOR SPIKE BANDED FACT e 01 TIME FOR SPIKE BANDED SOLV e 03 TIME FOR SPIKE (FACT+SOLV) e 01 RESIDUAL e 16 # Outside i t e r a t i o n s : 0 SPIKE has s u c c e e d e d ( to r e a c h the a c c u r a c y p s p i k e%e p s o u t ) SPIKE WARNING 1 Global s o l u t i o n

33 66 Example 6: Automatic Partitioning Using the CSR Input Format with a Preconditioner Let us define the following general sparse system: x x x x x 5 = x x x 8 0 This linear system is solved iteratively with the following dense, banded preconditioner: M = INCLUDE s p i k e f i program use example6 s p i k e m o d u l e use mpi i m p l i c i t none integer : : rank, code, nb procs, i double precision, dimension ( :, : ), a l l o c a t a b l e : : f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! integer : : i n f o type ( spike param ) : : p s p i k e type ( m a t r i x d a t a ) : : mat, pre! c a l l MPI INIT ( code ) c a l l MPI COMM SIZE(MPI COMM WORLD, nb procs, code ) c a l l MPI COMM RANK(MPI COMM WORLD, rank, code ) c a l l MPI Errhandler set (MPI COMM WORLD, MPI ERRORS RETURN, code ) ;!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! p s p i k e%nbprocs=n b p r o c s p s p i k e%rank=rank c a l l SPIKE DEFAULT( p s p i k e )!! c h a n g e s from d e f a u l t p s p i k e%autoadapt = f a l s e pspike%dfs= L p s p i k e%bps= 1! a banded p r e c o n d i t i o n e r i s p r o v i d e d by t h e u s e r!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER MATRIX and RHS!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 33

34 i f mat%format = S!! CSR mat%astru= G mat%diagdo= Y mat%n=8 ( rank ==0) then mat%nbsa=16 a l l o c a t e ( mat%sa ( 1 : mat%nbsa ) ) a l l o c a t e ( mat%j s a ( 1 : mat%nbsa ) ) a l l o c a t e ( mat%i s a ( 1 : mat%n+1)) mat%sa =(/6, 1,6, 1,6, 1,6, 1, 1,6, 1,6, 1,6, 1,6/) mat%j s a = ( / 1, 8, 2, 7, 3, 6, 4, 5, 4, 5, 3, 6, 2, 7, 1, 8 / ) mat%i s a = ( / 1, 3, 5, 7, 9, 1 1, 1 3, 1 5, 1 7 / )!! RHS a l l o c a t e ( f ( 1 : mat%n, 1 : 1 ) ) f =00 d0 f ( 1, 1 ) = 1 0 d0 end i f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER PRECONDITIONER!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! pre%format = D! Dense Banded f o r m a t pre%astru= G pre%diagdo= Y pre%n=8 pre%k l=1 pre%ku=1 i f ( rank ==0) then allocate ( pre%a( 1 : pre%kl+pre%ku+1, pre%n ) ) pre%a( pre%ku +1,:)=60 d0 pre%a( pre%ku,:)= 10 d0 pre%a( pre%ku+2,:)= 10 d0 end i f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! CALLING SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! c a l l SPIKE BEGIN( pspike, mat, pre, i n f o ) c a l l SPIKE PREPROCESS( pspike, pre, i n f o ) c a l l SPIKE PROCESS( pspike, mat, pre, f, i n f o ) c a l l SPIKE END( pspike, mat, pre, i n f o ) i f ( i n f o >=0) then!!!!!! G l o b a l S o l u t i o n i f ( rank ==0) then print, Global s o l u t i o n do i =1,mat%n print, i, f ( i, 1 ) end do end i f e n d i f c a l l MPI FINALIZE( code ) end program example6 We get the following output: >>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>> Spike S t r a t e g y RL3 TIME FOR PARTITIONING e 03 TIME FOR SPIKE BANDED FACT e 03 TIME FOR SPIKE BANDED SOLV e 04 TIME FOR SPIKE (FACT+SOLV) e 03 RESIDUAL e 08 # Outside i t e r a t i o n s : 4 SPIKE has s u c c e e d e d ( to r e a c h the a c c u r a c y p s p i k e%e p s o u t ) Global s o l u t i o n E E E E E

35 E E Toeplitz Matrix Example This example solves a large Toeplitz matrix with RHS (1, 1,, 1, 1) T Source code is not shown for this example and can be found in <SPIKE dir>/examples/examples f90/source The input matrix elements and properties must be defined in the file <SPIKE dir>/examples/examples f90/data/matrix toeplitzin The following is a sample input file for a banded matrix (n = 48, 000), 3 on the main diagonal, 4 on the upper and lower off-diagonals, 01 on the other off-diagonals, and upper and lower bandwidths of 80 (total bandwidth is 161): !! n, s i z e o f t h e m a t r i x 8 0!! k l, Lower band 8 0!! ku, Upper band 3 0 d0!! d i a g o n a l e l e m e n t 40 d0!! f i r s t l o w e r o f f d i a g o n a l e l e m e n t 40 d0!! f i r s t upper o f f d i a g o n a l e l e m e n t 0 1 d0!! OTHERS o f f d i a g o n a l e l e m e n t 1!! s, number of RHS (THE v a l ue of the RHS are generated by the code ) N!! DIAGDO? Y ( Yes ), N ( No ), I ( I n v e s t i g a t e ) Some of the components for the derived type spike param variable can be changed from their default values while modying the input file <SPIKE dir>/examples/examples f90/data/spike toeplitzin Here a sample input file which selects the (R,L) strategy: R!! RSS? E ( E x p l i c i t ), F ( on t h e F l y ), T ( Truncated ), R ( R e c u r s i v e ) L!! DFS? L ( LU), U ( LU, and UL ), P ( LU w i t h p i v o t i n g 3!! OIS? 0 ( DIRECT ), 2 ( ITREFINEM), 3 ( BiCGStab ) 1D 7!! e ps o u t!! ACCURACY BiCGstab OUTSIDE 5 0!! n b i t o u t!! NBRE MAX o f ITERATIONS OUTSIDE 1D 5!! e p s i n!! ACCURACY BiCGstab INSIDE 3 0!! n b i t i n!! NBRE MAX of ITERATIONS INSIDE 1D 10!! New zero machine f o r d i ag o n al BOOSTing procedure 0!! t y p e o f p a r t i t i o n n i n g ( o : g l o b a l, 1 : l o c a l ) t r u e!! t i m i n g t r u e!! d e t a i l e d i n f o r m a t i o n o f t h e s i m u l a t i o n 6!! i n f o p r i n t e d on s c r e e n i f = 6, or on f i l e s p i k e o u t p u t i f /=6 f a l s e!! t o e n a b l e s p i k e a d a p t Finally one can run example toeplitz program with the command mpirun -np 2 toeplitz to get the following output: SPIKE INFO!! NB PROCESSORS? 4!! NB PARTITIONS? 4!! SPIKE ADAPT? F!! ALGORITHM? R!! FACTORIZATION? L!! TYPE OF SOLVER? 3!! ACCURACY OUT? e 07!! NB ITMAX OUT? 50!! ACCURACY IN? e 05!! NB ITMAX IN? 30!! NEW ZERO PIVOT? e 09!! BOOST? e 10 35

36 !! Orign P a r t i t i o n? 0!! S i z e f i r s t l a s t p a r t i t i o n? 12000!! S i z e p a r t i t i o n m i d d l e? 12000!! Free memory? T!! Compute R e s i d u a l? T!! ADD MEMORY NEEDED ( Mb ) e+02 MATRIX INFO!! MATRIX FORMAT? D!! MATRIX STRUCT? G!! Diag Dominant? N!! SIZE MATRIX? DENSE BANDED MATRIX!! Lower band? 80!! Upper band? 80 DETAILED TIME o f PREPROCESS NORM L1 o f Aj ( 1 s t p a r t i t i o n ) e+01 TIME FACT LU ( < to copy UL+FACT LU, i f any ) e 01 TIME FOR COMPUTING THE SPIKES e 01 > TIME FOR SPIKE PREPROCESSING e 01 RHS INFO!! Number o f RHS? 1 DETAILED TIME o f PROCESS TIME FOR MODIFIED RHS e 01 TIME FOR REDUCED SYSTEM e 02 TIME FOR RETRIEVE e 03 RESIDUAL BEFORE OUTSIDE ITERATION e 10 TIME postprocess MATMUL e+00 TIME postprocess SOLVE e+00 > TIME FOR SPIKE PROCESSING e 01 >>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>> Spike S t r a t e g y RL3 TIME FOR PARTITIONING e 01 TIME FOR SPIKE BANDED FACT e 01 TIME FOR SPIKE BANDED SOLV e 01 TIME FOR SPIKE (FACT+SOLV) e 01 RESIDUAL e 10 # Outside i t e r a t i o n s : 0 SPIKE has s u c c e e d e d ( to r e a c h the a c c u r a c y p s p i k e%e p s o u t ) 68 Sparse Banded Matrix Example This example reads and solves a sparse banded matrix in CSR format The source code and a sample input matrix is provided in <SPIKE dir>/examples/examples f90/source/sparsef90 The input matrix file is defined in <SPIKE dir>/examples/examples f90/data/matrix sparsein and it contains the following fields 36

37 c s r f i l e!! g e n e r i c name o f s p a r s e f o r m a t I!! DIAGDO? Y ( Yes ), N ( No ), I ( I n v e s t i g a t e ) f a l s e!! s p a r s e 2 d e n s e banded ( t r u e or f a l s e ) The sparse system matrix is stored using four files where the generic name of those file is defined by the first line of the input file above (ie here the name is csrfile) The names of these four files (located in the same directory above) are: csrfilesa for the matrix elements, csrfilejsa for the column indices, csrfileisa for the start-of-row indicies and csrfilesf for the right-hand-side elements The number of non-zero elements is indicated at the beginning of the first two files, while the beginning of the last two indicates the number of rows In addition, the first line of csrfilesf contains the number of right-hand-side as well (if this number is greater than one, the elements should be stored in multicolumns) Similarly to the toeplitz example, some of the components for the derived type spike param variable can be changed from their default values while modying the input file <SPIKE dir>/examples/examples f90/data/spike sparsein In SPIKE 10, only the (F,L) strategy is allowed for solving sparse banded systems However, the last field of the file matrix sparsein is an utility routine which gives the option to the user to transform the CSR input matrix to a dense banded matrix It will then set the option mat%format= D for SPIKE enabling the use of all the other strategies for dense banded systems Finally one can run example toeplitz program with the command mpirun -np 4 sparse to get the following output: Matrix l o a d e d n= nnz= SPIKE INFO!! NB PROCESSORS? 4!! NB PARTITIONS? 4!! SPIKE ADAPT? F!! ALGORITHM? F!! FACTORIZATION? L!! TYPE OF SOLVER? 3!! ACCURACY OUT? e 07!! NB ITMAX OUT? 50!! ACCURACY IN? e 05!! NB ITMAX IN? 30!! NEW ZERO PIVOT? e 09!! BOOST? e 10!! Orign P a r t i t i o n? 0!! S i z e f i r s t l a s t p a r t i t i o n? 240!! S i z e p a r t i t i o n m i d d l e? 240!! Free memory? T!! Compute R e s i d u a l? T!! ADD MEMORY NEEDED ( Mb ) e 01 MATRIX INFO!! MATRIX FORMAT? S!! MATRIX STRUCT? G!! Diag Dominant? N!! Degree of Diag Dominant? e 01!! Degree o f S p a r s i t y ( w i t h i n t h e band )? e 01!! SIZE MATRIX? 960 SPARSE BANDED MATRIX!! Lower band? 43!! Upper band? 43!! # o f non z e r o e l?

38 DETAILED TIME o f PREPROCESS Pardiso Reorder e 01 Pardiso Factor e 02 TIME FACT LU ( < to copy UL+FACT LU, i f any ) e 01 TIME FOR COMPUTING THE SPIKES e 05 > TIME FOR SPIKE PREPROCESSING e 01 RHS INFO!! Number o f RHS? 1 DETAILED TIME o f PROCESS RESIDUAL BEFORE BICGSTAB IN ITERATION e+00\ t1 0 E e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e 05 TIME p o s t p r o c e s s MATMUL e 02 TIME p o s t p r o c e s s SOLVE e 05 TIME FOR MODIFIED RHS e 04 TIME FOR REDUCED SYSTEM e 02 TIME FOR RETRIEVE e 04 RESIDUAL BEFORE OUTSIDE ITERATION e 06 RESIDUAL BEFORE BICGSTAB IN ITERATION e+00\ t1 0 E e e e e e e e e e e e e e e e e e e e e e e 04 38

39 e e 06 TIME p o s t p r o c e s s MATMUL e 02 TIME p o s t p r o c e s s SOLVE e 06 TIME FOR MODIFIED RHS e 04 TIME FOR REDUCED SYSTEM e 02 TIME FOR RETRIEVE e e e 11 TIME p o s t p r o c e s s MATMUL e 05 TIME p o s t p r o c e s s SOLVE e 02 > TIME FOR SPIKE PROCESSING e 01 >>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>> Spike S t r a t e g y FL3 TIME FOR PARTITIONING e 03 TIME FOR SPIKE BANDED FACT e 01 TIME FOR SPIKE BANDED SOLV e 01 TIME FOR SPIKE (FACT+SOLV) e 01 RESIDUAL e 11 # Outside i t e r a t i o n s : 1 SPIKE has s u c c e e d e d ( to r e a c h the a c c u r a c y p s p i k e%e p s o u t ) 69 Calling SPIKE from C Programs SPIKE can also be called from C programs The data structures in the C interface are available in the header file: <SPIKE dir>/include/spikeh It is very important to know the difference between the Fortran and C input formats Inside the spikeh header file, the integer variables comd, autoadapt, failed, timing, memfree, residual, singular, blocked, boost, and custom pre of the spike param c interface data structure are actually logical variables in Fortran Therefore, to initialize these variables in C, set them to the -1 for trueand 0 for false To call SPIKE, add the following lines to the C code: #include <mpi h> #include spike h / B e f o r e the MPI INIT c a l l i n g s e q u e n c e / i n t rank, nb procs, code, i n f o [ 4 ] ; s p i k e p a r a m c i n t e r f a c e p s p i k e ; / Data s t r u c t u r e a s s o c i a t e d with a g i v e n SPIKE environment / m a t r i x d a t a c i n t e r f a c e mat, pre ; / Data s t r u c t u r e a s s o c i a t e d with the o r i g i n a l matrix mat and pre ( i f s e p a r a t e c a l l i n g i s used ) / / In s ide main function / / / code = MPI Init (&argc, & argv ) ; code = MPI Comm size (MPI COMM WORLD, & nb procs ) ; code = MPI Comm rank (MPI COMM WORLD, & rank ) ; / / / A f t e r the MPI INIT c a l l i n g s e q u e n c e s / pspike nbprocs=nb procs ; pspike rank=rank ; s p i k e d e f a u l t (& p s p i k e ) ; / Default values f o r pspike / / CALL FOR SPIKE with DEFINITION o f INPUT PARAMETERS / 39

40 / End o f main f u n c t i o n / The C version of the Toeplitz example program as well as examples 1-5 are available in the directory: <SPIKE dir>/examples/examples c If necessary, modify both the makefile and makefiletarget to use the desired compiler and MPI implementation The examples use the Intel compilers and MPI library by default Moreover, the makefile should (i) link the libspikea library, (ii) link the BLAS and LAPACK libraries, and (iii) specify the path to the spikeh header file 40

41 Chapter 7 Reference guide 71 SPIKE 10 directory structure High-level Directory Structure The table below shows a high-level structure for SPIKE 10 after installation All directories are under the SPIKE main directory, for example /opt/intel/spike/10 Directory Comment bin/64 bin/em64t doc examples/examples c examples/examples f90 include lib/64 lib/em64t spike adapt/64 spike adapt/em64t Itanium2 r binary executable Intel64 r binary executable Documentation C source code and data for examples Fortran 90 source code and data for examples C headers, Fortran 90 module interfaces, and MPI wrappers Itanium2 r static libraries Intel64 r static libraries Spike Adapt data files, Itanium2 r Spike Adapt data files, Intel64 r Detailed Directory Structure The information in the table below shows detailed structure of the directories of SPIKE Again, all directories are under the SPIKE main directory, for example, /opt/intel/spike/10 41

42 Directory and Files Contents bin/64 Binaries directory, Itanium2 r ivarsnml Fortran NAMELIST file storing the input characteristics used by spike adaptexe spike adaptexe Standalone executable to query Spike Adapt bin/em64t Binaries directory, Intel64 r ivarsnml Fortran NAMELIST file storing the input characteristics used by spike adaptexe spike adaptexe Standalone executable to query Spike Adapt doc Documentation directory Installtxt Installation Guide spikeeulatxt SPIKE license spike ugpdf The SPIKE User Guide(PDF format) spike ugps The SPIKE User Guide(Postscript format) examples/examples c C example source code and data source Source code subdirectory data Data files subdirectory makefile[target] Makefiles to build examples examples/examples f90 Fortran 90 example source code and data source Source code subdirectory data Data files subdirectory makefile[target] Makefiles to build examples include Headers, Interfaces, wrappers spikefi Fortran 90 module interface spikeh, spike c wrapperh C headers spike mpi commf90 source of MPI wrapper lib/64 Itanium2 r static libraries libguidea Intel r Legacy OpenMP run-time library for static linking libguideso Intel r Legacy OpenMP run-time library for dynamic linking libmkl corea Kernel library for IA-64 architecture libmkl coreso Library dispatcher for dynamic load of processor-specific kernel library libmkl intel lp64a LP64 interface library for Intel compiler libmkl intel lp64so LP64 interface library for Intel compiler libmkl intel threada Parallel drivers library supporting Intel compiler libmkl intel threadso Parallel drivers library supporting Intel compiler 42

43 libmkl lapacka Dummy library Contains references to Intel MKL libraries libmklso Dummy library Contains references to Intel MKL libraries libmkl solvera Dummy library Contains references to Intel MKL libraries libmkl solver lp64a Sparse Solver, Interval Solver, and GMP routines library supporting LP64 interface libspikea Spike Core routines libspike adapta Spike Adapt routines libspike adapt deso Spike Adapt routines, performance model specific libspike adapt grid fa libspike mpi comma Spike Adapt routines, grid specific Default MPI wrapper copied from libspike mpi comm intelmpia User can build their own See Appendix C for detail libspike mpi comm intelmpia MPI wrapper supporting Intel MPI Library for Linux libspike mpi comm mpich1a MPI wrapper supporting MPICH 1 libspike mpi comm mpich2a MPI wrapper supporting MPICH 2 libspike mpi comm openmpia MPI wrapper supporting Open MPI lib/em64t Intel64 r static libraries libguidea Intel r Legacy OpenMP run-time library for static linking libguideso Intel r Legacy OpenMP run-time library for dynamic linking libmkl corea Kernel library for libmkl coreso libmkl intel lp64a libmkl intel lp64so libmkl intel threada libmkl intel threadso Intel64 r architecture Library dispatcher for dynamic load of processor-specific kernel library LP64 interface library for Intel compiler LP64 interface library for Intel compiler Parallel drivers library supporting Intel compiler Parallel drivers library supporting Intel compiler libmkl lapacka Dummy library Contains references to Intel MKL libraries libmklso Dummy library Contains references to Intel MKL libraries libmkl solvera Dummy library Contains references to Intel MKL libraries 43

44 libmkl solver lp64a Sparse Solver, Interval Solver, and GMP routines library supporting LP64 interface libspikea Spike Core routines libspike adapta Spike Adapt routines libspike adapt deso Spike Adapt routines, performance model specific libspike adapt grid fa libspike mpi comma Spike Adapt routines, grid specific Default MPI wrapper copied from libspike mpi comm intelmpia User can build their own See Appendix C for detail libspike mpi comm intelmpia MPI wrapper supporting Intel MPI Library for Linux libspike mpi comm mpich1a MPI wrapper supporting MPICH 1 libspike mpi comm mpich2a MPI wrapper supporting MPICH 2 libspike mpi comm openmpia MPI wrapper supporting Open MPI spike adapt/64 Itanium2 r Spike Adapt data files de Subdirectory, calibration data files spike adapt/em64t Intel64 r Spike Adapt data files de Subdirectory, calibration data files tools/environment Initialization shell scripts spikevars64csh Itanium2 r platforms; C shell spikevars64sh Itanium2 r platforms; Bourne shell spikevarsem64tcsh Intel64 r platforms; C shell spikevarsem64tsh Intel64 r platforms; Bourne shell Table 71: Detailed SPIKE directory structure 72 SPIKE and ScaLAPACK This section is addressed to ScaLAPACK users who would like to experiment with SPIKE, making only minor changes to their code for solving dense banded linear systems (data in double precision) We describe a practical way to insert SPIKE calling sequences in place of ScaLAPACK ones The ScaLAPACK calling sequences that are concerned with this migration procedure are: For non-diagonally dominant systems PDGBSV: Single calling sequence Factorization+Solve PDGBTRF, PDGBTRS: Separated calling sequences Factorization and Solve For diagonally dominant systems PDDBSV: Single calling sequence Factorization+Solve PDDBTRF, PDGBTRS: Separated calling sequences Factorization and Solve 44

45 As described in the documentation, SPIKE can also handle single or separated calling sequences In contrast to ScaLAPACK, the diagonally dominant property does not involve new calling sequences but can be defined in the data structure matrix data within the parameter mat%diagdo Let us consider the following ScaLAPACK code: Call PDGBSV(N, BWL, BWU, NRHS, A, JA, DESCA, IPIV, B, IB, DESCB, WORK, LWORK, INFO ) where we suppose the users to be familiar with all the above parameters (as described in the ScaLAPACK user guide [3]) This calling sequence can be replaced by the following one: Call Spike(pspike, mat, B, info spike) where the parameters pspike, mat, info spike need to be declared at the beginning of the program as described in this documentation, while the parameter B which contains the RHS and solution is identical to the ScaLAPACK one Before the call to SPIKE, the other parameters need to be declared as follows: p s p i k e%rank=rank! w i t h rank t h e u s e r v a r i a b l e name f o r p r o c e s s o r rank p s p i k e%nbprocs=n b p r o c s! with nb procs the user v a r i a b l e name! f o r # o f p r o c e s s o r s c a l l S p i k e D e f a u l t ( p s p i k e ) p s p i k e%tp =1! d a t a l o c a l d i s t r i b u t i o n o f t y p e 1 i s c o m p a t i b l e w i t h ScaLAPACK! i f t h e u s e r wants t o t u r n o f f s p i k e a d a p t by p s p i k e%a u t o a d a p t = f a l s e! t h e u s e r can s e l e c t h e r e h i s own S p i k e C o r e s t r a t e g y ( RSS, DFS, OIS ) mat%format = D! d o u b l e p r e c i s i o n d a t a mat%a s t r u = G! g e n e r a l non symmetric mat%n=n! N as i n ScaLAPACK mat%k l=bwl! BWL as i n ScaLAPACK mat%ku=bwu! BWU as i n ScaLAPACK mat%diagdo = N! N i f ScalAPACK command s t a r t s w i t h PDGB! Y i f ScalAPACK command s t a r t s w i t h PDDB mat%aj=aa! AA i s t h e m a t r i x A i n ScaLAPACK w i t h o u t e x t r a s p a c e f o r p i v o t i n g! i f mat%d i a g d o = Y AA i s i d e n t i c a l t o A and one can s i m p l y! u s e mat%aj=>a ( w i t h a t t r i b u t i o n t a r g e t f o r A)! i f mat%d i a g d o = N t h e u s e r may f i r s t want t o s u p p r e s s t h e e x t r a! s t o r a g e s p a c e i n t h e a l l o c a t i o n o f A and t h e n! u s e mat%aj=a a l l o c a t e ( mat%s i z e A ( 1 : n b p r o c s ) ) mat%s i z e A ( 1 : nb procs 1)=DESCA( 4 )! ScaLAPACK v a r i a b l e! s i z e o f t h e l o c a l p a r t i t i o n mat%s i z e A ( n b p r o c s )=n ( nb procs 1) mat%s i z e A ( 1 )! s i z e o f t h e l a s t p a r t i t i o n In the case of separated calling sequences, the setup of the above parameters is identical Also the BLACS command introduced in ScaLAPACK are unnecessary for SPIKE and can be removed (SPIKE is independent of the library BLACS) 73 Spike Default Set the default values on all the applicable components within the type spike param variable 45

46 Syntax CALL Spike Default(pspike) Description The routine assigns defaults values to those input and inout components of the type spike param variable pspike that have default Other components remain unchanged Input Parameters pspike SPIKE data structure of type spike param described in Section 22 Output Parameters pspike SPIKE data structure described in Section 22 On exit, the components of pspike tabulated in Table 21 will be assinged their default values specified there 74 Spike Spike solver driver solves complete system via one call Syntax CALL Spike(pspike,mat,f,info) Description The routine solves the system specified by a matrix contained in mat with the right hand side(s) contained in f Input Parameters pspike mat f SPIKE type spike param data structure described in Section 22 matrix data structure of type matrix data described in Section 23 and Chapter 5 double precision array containing the right hand side(s) Depending on the value of pspike%tp, f may be global on rank 0 or locally distributed on each processor 46

47 Output Parameters pspike SPIKE data structure described in Section 22 f the computed solution of the system info return the error code If info=0 the execution is successful If info 0, SPIKE encountered a problem and has stopped unexpectedly, the detail description of the meaning of error code is presented in Section Spike Begin Begin the calling sequence Syntax CALL Spike Begin(pspike,mat,pre,info) Description The routine partitions the matrix and allocates a work table for SPIKE Morever, Spike Adapt may be invoked in this routine Input Parameters pspike mat pre SPIKE data structure of type spike param described in Section 22 On entry, if pspike%autoadapt is true, Spike Adapt will be invoked to select a SPIKE strategy matrix data structure of type matrix data described in Section 23 and Chapter 5 preconditioner data structure of type matrix data The use of banded preconditioner is described in chapter 4 47

48 Output Parameters pspike SPIKE data structure described in Section 22 On exit, if Spike Adapt was invoked, pspike%dfs, pspike%rss and pspike%ois will be updated mat matrix data structure described in Section 23 If the matrix is defined with global data as input, on exit, mat will contain the local partitioning of the matrix on each processors (the memory of the global matrix in rank 0 is deallocated if pspike%memfree is set to true) pre Contents set by Spike Begin It contains the local partitioning of the preconditioner (it may just be a copy of the matrix) that will be used in Spike Preprocess info 76 Spike Preprocess Preprocess the preconditioner data Syntax return the error code If info=0, the execution is successful If info 0, SPIKE encountered a problem and has stopped unexpectedly, the detail description of the meaning of error code is presented in Section 711 CALL Spike Preprocess(pspike,pre,info) Description The routine factorizes the preconditioner pre using the SPIKE strategy specified in pspike Note that pre could be an explicit preconditioner supplied by the user or is just in fact a copy (made automatically by SPIKE ) of the original system Input Parameters pspike SPIKE data structure described in Section 22 pre the output from Spike Begin after the Spike Begin call 48

49 Output Parameters pspike SPIKE data structure described in Section 22 pre Contents modified, it contains the factorization of the preconditioner ready to be used in Spike Process multiple number of times info return the error code If info=0 the execution is successful If info 0, SPIKE encountered a problem and has stopped unexpectedly, the detail description of the meaning of error code is presented in Section Spike Process Process the matrix, preconditioner and the right-hand side Syntax CALL Spike Process(pspike,mat,pre,f,info) Description The routine solves the reduced system then retrieves the overall solution In this verision of SPIKE, the solver includes outer-iterations The preconditioner is defined by pre, and the original matrix is defined by mat The routine Spike Process can be repeated if needed for applications that involves iterations with changing right-hand-sides f but with the same original matrix of coefficients Input Parameters pspike SPIKE data structure described in Section 22 mat matrix data structure On entry, the matrix data should have been processed by a previous Spike Begin call, so that data have been distributed to all processors pre f Output Parameters set up by Spike Preprocess in a previous call On entry, f stores the right-hand side Depending on the value of pspike%tp, f may be global on rank 0 or locally distributed on each processor pspike SPIKE data structure described in Section 22 f On exit, f stores the solution of the system Depending on the value of pspike%tp, f may be global on rank 0 or locally distributed on each processor info return the error code If info=0 the execution is successful If info 0, SPIKE encountered a problem and has stopped unexpectedly, the detail description of the meaning of error code is presented in Section

50 78 Spike End End of the calling sequence Syntax CALL Spike End(pspike,mat,pre,info) Description The routine clears the memory space, deallocating all local partitioning for mat and pre Input Parameters pspike SPIKE data structure described in Section 22 mat matrix data structure described in Section 23 pre preconditioner data structure Output Parameters pspike SPIKE data structure described in Section 22 mat matrix data structure described in Section 23 On exit, several components of mat are deallocated pre On exit, pre is deallocated info return the error code If info=0 the execution is successful If info 0, SPIKE encountered a problem and has stopped unexpectedly, the detail description of the meaning of error code is presented in Section spike param details The type spike param has a number of input components that has possible default values listed in Table 21 Furthermore, this type has a number of output components This is listed in the follow Table matrix data details The derived type matrix data is used for storage of matrices In SPIKE 10, this is exclusively used for the matrix representing the linear system In the future, the user can explicitly store, using this type, a separate matrix used as a preconditioner to the linear system The components and meaning of this type is given previously in Chapter info details Errors and warnings encountered during a run of SPIKE are stored in an integer variable, info All MPI, LAPACK and PARDISO errors are fatal; 50

51 Component Type(Intent) Distribution Description boost logical (out) local Return trueif a zero-pivot is detected pivot > 0 ɛ nb boost integer (out) global # of boost performed nbit out0 integer (out) global # of outer iteration nbit in0 integer (out) global # of inner iteration memory double (out) global Total amount of memory (in Mb) needed by Spike Core maxres double (out) global If component residual is set to true return the maximum relative residual for all rhs failed logical (out) global Return trueif Spike Core fails to reach the accuracy specified in the eps out component error code integer (out) global If info 0 in the SPIKE calling sequences returns the error code as presented in Section 711 Below are the output components fields for timing information if the timing component is set to true tspike adapt double (out) global Time spent in Spike Adapt tspike preparation double (out) global Preparation time (with Spike Adapt) tspike prep double (out) global Preprocessing time tspike process double (out) global Processing time tspike residual double (out) global Time spent to compute the residual Table 72: List of output components for the derived type spike param The variable of this type can be local on each partition or global (ie common to all partitions) 51

52 in other words, execution of the program is terminated if an error is encountered Other possible sources of warnings and errors are Spike Core and Spike Adapt errors If the output info parameter is not zero, either an error (info< 0) or warning (info> 0) was encountered The possible return values for the info parameter are given in Table 73 info Classification Description 3 Warning Spike Adapt could not make a prediction 2 Warning A zero-pivot has been detected, OIS has been set to 3 due to boosting 1 Warning this matrix (or precondioner if any) is not narrow banded, this will affect the spike performances 0 Successful exit -1 Error Spike Core error -2 Error Spike Adapt error -3 Error MPI error -4 Error LAPACK error -5 Error PARDISO error Table 73: SPIKE return code descriptions for the parameter info If info< 0 the user can determine whether Spike Core, Spike Adapt, MPI, LAPACK, or PARDISO is responsable for the unexpected termination The correponding error code is stored in the component pspike%error code Please refer to Table 74 for possible return codes on pspike%error code if a fatal error is encountered in Spike Core (info= 1), and similarly refer to Table 75 if a fatal error is encountered in Spike Adapt (info= 2) When info equals 3, 4, 5, the error code is also stored in pspike%error code, and the user should consult the MPI, LAPACK, or PARDISO documentation, respectively 52

53 info= 1 Description 0 Successful exit - Default value -200 memory allocation error -201 rho = 0, BiCGStab(out) failed -202 omega =0, BiCGStab(out) failed -303 cannot select Spike Adapt if you want to use your own preconditioner %BPS= the format of the preconditioner is incorrect, it should be pre%format= D or S -305 the preconditioner should be banded -306 the preconditioner should be the same size as the matrix -307 if preconditioner (option %BPS= 1), one needs to use iterative methods %OIS 0) -308 the preconditioner cannot be used with DFS= P -309 either upper or lower bandwidth is too small for the size of the partitions -310 number of processors has to be even for RSS= A or P -313 the size of the matrix mat%n must be > mat%kl and mat%ku must be the format of the matrix is incorrect, it should be mat%format= D or S -320 Spike Adapt cannot be selected if only one processor -399 wrong value for %tp -400 combinations (DFS, RSS) not supported by SPIKE DFS= L or P are only possible options if one processor is used -402 DFS= A cannot be used here see Table RSS= R cannot be used here see Table only tp=0 can handle one processor run Table 74: SPIKE return code descriptions for %error code 53

54 info= 2 Classification Description 1 Information Spike Core strategy selected by grid lookup 2 Information Spike Core strategy selected by performance models 3 Warning Spike Core strategy selected arbitrarily -310 Error pspike%tp=2 requires an even number if MPI processes -312 Error pspike%tp=2 requires RSS = A -313 Error pspike%tp=1 cannot be used when RSS = A -402 Error Memory allocation failed during model evaluation -403 Error SPIKE ADAPT DATA environment variable not set -404 Error Error reading directory specified by SPIKE ADAPT DATA environment variable -405 Error Performance models not found in directory specified by SPIKE ADAPT DATA environment variable -406 Error Could not open performance models -407 Error Could not read performance models Table 75: This table contains descriptions of the Spike Adapt return codes for %error code 54

55 Bibliography [1] E Anderson, Z Bai, C Bischof, J Demmel, J Dongarra, J DuCroz, A Greenbaum, S Hammarling, A McKenney, and D Sorensen LA- PACK: A portable linear algebra library for high-performance computers Technical report, Knoxville, 1990 [2] Michael W Berry and Ahmed Sameh Multiprocessor schemes for solving block tridiagonal linear systems The International Journal of Supercomputer Applications, 1(3):37 57, 1988 [3] L S Blackford, J Choi, A Cleary, E D Azevedo, J Demmel, I Dhillon, J Dongarra, S Hammarling, G Henry, A Petitet, K Stanley, D Walker, and R C Whaley ScaLAPACK: a linear algebra library for message-passing computers In Proceedings of the Eighth SIAM Conference on Parallel Processing for Scientific Computing (Minneapolis, MN, 1997), page 15 (electronic), Philadelphia, PA, USA, 1997 Society for Industrial and Applied Mathematics [4] S C Chen, D J Kuck, and A H Sameh Practical parallel band triangular system solvers ACM Transactions on Mathematical Software, 4(3): , 1978 [5] Jack J Dongarra and Ahmed H Sameh On some parallel banded system solvers Parallel Computing, 1(3): , 1984 [6] D H Lawrie and A H Sameh The computation and communication complexity of a parallel banded system solver ACM Trans Math Softw, 10(2): , 1984 [7] E Polizzi and A Sameh Numerical parallel algorithms for large-scale nanoelectronics simulations using nessie Journal of Computational Electronics, (3), 3-4: , 2005 [8] Eric Polizzi and Ahmed H Sameh A parallel hybrid banded system solver: the spike algorithm Parallel Comput, 32(2): , 2006 [9] Eric Polizzi and Ahmed H Sameh Spike: A parallel environment for solving banded linear systems Computers & Fluids, 36(1): , 2007 [10] A H Sameh and D J Kuck On stable parallel linear system solvers J ACM, 25(1):81 91,

56 [11] O Schenk and K Gärtner Solving unsymmetric sparse systems of linear equations with pardiso Journal of Future Generation Computer Systems, 20(3): ,

57 Appendix A Mathematical Description of Key Strategies In the following sections, we outline the algorithms corresponding to the six (RSS,DFS) combinations supported in SPIKE 10 Since OIS is always 3 in the current release, and since BiCGStab is a well-documented method, we will not explain it here The following descriptions assume four MPI processes (RSS, DFS, 3): Refine the solution of Ax = f using the BiCGStab iterative solver solve Ax = f via preconditioned BiCGStab (with preconditioner M); solve Mz = r using (RSS,DFS); end The exact spike factorization consists of A = D S Each computational scheme, however, only produces an approximation D of D and S of S In other words, for solving Ax = f via an iterative scheme we use M = D S as a preconditioner Here, A = M + R where R is a correction term The preconditioner M is defined as shown in Table A1 for each (RSS, DSS) pair Table A1: Preconditioners for different schemes (RSS,DFS) Preconditioner T A T U F L RL RP EA M T A = D T A S T A M T U = D T U S T U M F L = D F L S F L M RL = D RL S RL M RP = D S M EA = D EA S EA Note that D T A = D EA and D F L = D RL The reduced system in F L is solved iteratively without forming the coefficient matrix explicitly Also, in EA, the reduced system is solved iteratively and formed explicitly The details of how diagonal and spike systems are treated are given in following sections Throughout, we present the solution process of Az = r in which z 57

58 is the action M 1 r A1 Az = r via TU The matrix, RHS, and solution are distributed among the MPI processes as shown in Figure A1 A 1 B 1 z 1 r 1 (1) A = C 2 A 2 B 2 C 3 A 3 B 3 z = z 2 z 3 r = r 2 r 3 (2) (3) C 4 A 4 z 4 r 4 (4) Figure A1: Illustration of the partitioning of the linear system The T U scheme consists of the following steps: 1 Compute the LU and U L factorizations without pivoting (apply diagonal boosting if needed) L j U j A j for j = 1, 2, 3 U j Lj A j for j = 2, 3, 4 2 Compute the tips of the spikes V, W in Figure A2 as follows Solve for V (b) L j U j V (b) j j : Solve for W (t) U j Lj j : W (t) j = = 0 0 B j C j 0 0 for j = 1, 2, 3 for j = 2, 3, 4 This process is described in detail in Figure A3 58

59 S = I I W 2 * * * * V 1 I I W 3 * * * * V 2 I I W 4 * * * * V 3 I I (1) (2) (3) (4) Figure A2: SPIKE matrix L U V b j = 0 B j Figure A3: The bottom of the V j spike can be computed using only the bottom m m blocks of L and U Similarly, the top of the W j spike may be obtained if one performs the UL-factorization 3 Modify the RHS by solving: L j U j g j = r j (j = 1, 2) and U j Lj g j = r j (j = 3, 4) 4 Solve the truncated, reduced system (block diagonal) via a direct scheme where each block has the following form: ( ) ( ) ( ) I m V (b) j z (b) W (t) j g (b) j+1 I m z (t) j = j+1 g (t) (j = 1, 2, 3) j+1 5 Solve 0 A j z j = r j 0 B j z(t) j+1 C j 0 0 z(b) j 1 using the LU or U L factorization of A j (j = 1, 2, 3, 4; C 1 = 0; and B 4 = 0) 59

60 A2 Az = r via FL The matrix, RHS, and solution are distributed among the MPI processes as shown in Figure A1 The F L scheme consists of the following steps: 1 Compute the LU factorization without pivoting (apply diagonal boosting, if needed) L j U j A j for j = 1, 2, 3, 4 2 Modify the RHS by solving: L j U j g j = r j (j = 1, 2, 3, 4) 3 Solve the reduced system iteratively I m W (t) W (b) V (b) 1 2 I m V (t) 2 2 I m V (b) 2 W (t) 3 I m V (t) 3 W (b) 3 I m V (b) 3 W (t) 4 I m z (b) 1 z (t) 2 z (b) 2 z (t) 3 z (b) 3 z (t) 4 = g (b) 1 g (t) 2 g (b) 2 g (t) 3 g (b) 3 g (t) 4 where actions of the multiplications with W (t) j, W (b) j, V (t) j and V (b) j are realized via ( I m 0 ) ( ) A 1 Im j C 0 j, ( ( ) ) 0 I m A 1 Im j C 0 j, ( I m ( ) ( ) 0 Im A 1 Im j B j, respectively This requires solving systems in- 0 volving A j using the previously computed LU factorizations 4 Solve 0 A j z j = r j 0 B j z(t) j+1 C j 0 0 z(b) j 1 0 ) A 1 j using the LU factorization of A j (j = 1, 2, 3, 4; C 1 = 0; and B 4 = 0) A3 Az = r via RL/RP The matrix, RHS, and solution are distributed among the MPI processes as shown in Figure A1 The RP scheme consists of the following steps: 1 Compute the LU factorization with (RP ) or without pivoting (RL) (in case no pivoting is used, apply diagonal boosting, if needed) L j U j P j A j for j = 1, 2, 3, 4 (P j = I for RL) 2 Solve for V j : ( ) Im B 0 j, 60

61 0 L j U j V j = for j = 1, 2, 3 0 B j 3 Solve for W j : C j 0 L j U j W j = for j = 2, 3, Modify the RHS by solving: L j U j g j = r j (j = 1, 2, 3, 4) 5 Form the reduced system and partition it as follows I m I m V (t) 1 V (b) 1 W (t) 2 I m V (t) 2 W (b) 2 I m V (b) 2 W (t) 3 I m V (t) 3 à 1 C 2 B1 Ã2 W (b) 3 I m V (b) 3 W (t) 4 I m z 1 z 2 = g 1 g 2 W (b) 4 I m z (t) 1 z (b) 1 z (t) 2 z (b) 2 z (t) 3 z (b) 3 z (t) 4 z (b) 4 = g (t) 1 g (b) 1 g (t) 2 g (b) 2 g (t) 3 g (b) 3 g (t) 4 g (b) 4 6 Solve for Ṽ1 and W 2 in 0 C 2 à 1 Ṽ 1 = 0, à 0 2 W2 = B Modify the RHS à 1 1 g 1 = h 1 and à 1 2 g 2 = h 2 8 Solve the reduced system via a direct scheme ( ) ( ) I m Ṽ (b) 1 z (b) ( h(b) ) W (t) 1 2 I m z (t) = 1 2 h (t) 2 9 Retrieve z 1 and z 2 z 1 = h (t) 1 Ṽ1 z 2 z 2 = h 2 W 2 z (b) 1 61

62 10 Retrieve z j (j = 1, 2, 3, 4) z j = r j V j z (t) j+1 W jz (b) j 1 (V 4 = 0 and W 1 = 0) A4 Az = r via TA The matrix, RHS, and solution are distributed among the MPI processes as shown in Figure A4 A 1 B 1 z 1 r 1 (1) A = C 2 A 2 z = B 2 C 3 A 3 z 2 z 3 r = r 2 r 3 (2, 4) (3) Figure A4: Illustration of the partitioning of the linear system The T A scheme consists of the following steps: 1 Compute the LU and U L factorizations without pivoting (apply diagonal boosting, if needed) L j U j A j for j = 1, 2 (processes 1, 2) U j Lj A j for j = 2, 3 (processes 4, 3) 2 Solve for V (b) L j U j V (b) j j : 3 Solve for W (t) U j Lj W (t) j j : = = 0 0 B j C j 0 0 for j = 1, 2 for j = 2, 3 This process is described in detail in Figure A3 4 Modify the RHS by solving: L j U j g j = r j (j = 1, 2) and U j Lj g j = r j (j = 3) 62

63 5 Solve the truncated reduced system (block diagonal) via a direct scheme where each block has the following form: ( ) ( ) ( ) I m V (b) j z (b) W (t) j g (b) j+1 I m z (t) j = j+1 g (t) (j = 1, 2) j+1 6 Solve 0 A j z j = r j 0 B j z(t) j+1 C j 0 0 z(b) j 1 using the LU or U L factorization of A j (j = 1, 2, 3; C 1 = 0; and B 3 = 0) A5 Az = r via EA The matrix, RHS, and solution are distributed among the MPI processes as shown in Figure A4 The EA scheme consists of the following steps: 1 Compute the LU and U L factorizations without pivoting (apply diagonal boosting if needed) L j U j A j for j = 1, 2 (processes 1, 2) U j Lj A j for j = 2, 3 (processes 4, 3) 2 Solve for V j : 0 L j U j V j = for j = 1, 2 0 B j 3 Solve for W j : C j 0 U j Lj W j = for j = 2, Modify the RHS by solving: L j U j g j = r j (j = 1, 2) and U j Lj g j = r j (j = 3) 5 Solve the reduced system via preconditioned BiCGStab I m W (t) W (b) V (b) 1 2 I m V (t) 2 2 I m V (b) 2 W (t) 3 I m z (b) 1 z (t) 2 z (b) 2 z (t) 3 63 = g (b) 1 g (t) 2 g (b) 2 g (t) 3

64 with a truncated preconditioner M r = I m W (t) V (b) 1 6 Solve 0 A j z j = r j 0 B j 2 I m V (b) 2 W (t) 3 I m z(t) j+1 C j 0 0 z(b) j 1 using the LU or U L factorization of A j (j = 1, 2, 3; C 1 = 0; and B 3 = 0) 64

65 Appendix B How Spike Adapt Works B1 Why is Spike Adapt Necessary? Spike Core is a poly-algorithm implementing many different strategies The RSS, DFS, and OIS parameters can take many different values, leading to numerous possibilities Selecting an optimal strategy requires detailed knowledge of Spike Core For example, what strategies are best when the matrix is not diagonally dominant? How does the matrix bandwidth affect the choice of strategy? Spike Adapt relieves users from questions like these It is designed to select an optimal strategy based on the following matrix characteristics: matrix size, bandwidth, sparsity, and diagonal dominance It also takes the number of MPI processes and the type of partitioning into account when making a decision (Table 24) B2 How Does Spike Adapt Work? Spike Adapt automatically sets the RSS, DFS, and OIS parameters when the autoadapt element of the spike param structure is set to true It currently supports six Spike Core strategies (RSS,DFS): TU, RL, RP, FL, TA, and EA Note that OIS is basically orthogonal to (RSS,DFS) Moreover, for SPIKE 10, OIS is always set to 3 (BiCGStab) and FL is always chosen when the input matrix is in CSR format Spike Adapt uses a three-step selection process It first checks the type of matrix partitioning and the number of MPI processes to determine which strategies are allowed (Table 24) Next, it performs a grid lookup based on the matrix size, bandwidth, and diagonal dominance (Figure B1) The optimal Spike Core strategy for some matrices is best determined by a grid lookup However, if the grid does not enclose the current matrix, Spike Adapt evaluates performance models for the relevant Spike Core strategies and decides which is best If neither the grid lookup nor the performance models can make a selection, a Spike Core strategy will be chosen arbitrarily However, this should be rare and usually indicates a problem in Spike Adapt 65

66 Figure B1: This schematic illustrates how Spike Adapt might select an optimal Spike Core strategy using grid lookup The horizontal and vertical axes represent two of the relevant matrix characteristics (eg, matrix size and bandwidth) If the grid encloses this matrix, an optimal Spike Core strategy, represented by the different colors, is selected based on proximity B3 Spike Adapt Return Codes In the event of an error, Spike Adapt sets info=-1 and returns to Spike Core The actual error code is stored in the ierr spike adapt parameter of spike param structure Spike Adapt error codes range from -499 to -400 The meaning of each error code is shown below Spike Adapt sets info=0 if it is able to select a Spike Core strategy In general, knowing how Spike Adapt selects a particular Spike Core strategy is unimportant However, this knowledge could be useful if the user suspects that Spike Adapt is choosing a suboptimal strategy The spike adapt status parameter of the spike param structure tells how the Spike Core strategy was selected: spike adapt status Description 1 Grid lookup used to select Spike Core strategy 2 Performance models used to select Spike Core strategy 3 The Spike Core strategy was selected arbitrarily -402 Spike Adapt could not allocate memory -403 SPIKE ADAPT DATA environment variable not set -404 Directory containing Spike Adapt performance models not found -405 Spike Adapt model files not found -406 Could not open Spike Adapt models files -407 Error reading Spike Adapt model files Table B1: Spike Adapt Return Codes As mentioned above, arbitrary selection usually indicates a Spike Adapt problem that should be reported to technical support 66

Matrix Eigensystem Tutorial For Parallel Computation

Matrix Eigensystem Tutorial For Parallel Computation Matrix Eigensystem Tutorial For Parallel Computation High Performance Computing Center (HPC) http://www.hpc.unm.edu 5/21/2003 1 Topic Outline Slide Main purpose of this tutorial 5 The assumptions made

More information

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

Sparse Matrix Computations in Arterial Fluid Mechanics

Sparse Matrix Computations in Arterial Fluid Mechanics Sparse Matrix Computations in Arterial Fluid Mechanics Murat Manguoğlu Middle East Technical University, Turkey Kenji Takizawa Ahmed Sameh Tayfun Tezduyar Waseda University, Japan Purdue University, USA

More information

Intel Math Kernel Library (Intel MKL) LAPACK

Intel Math Kernel Library (Intel MKL) LAPACK Intel Math Kernel Library (Intel MKL) LAPACK Linear equations Victor Kostin Intel MKL Dense Solvers team manager LAPACK http://www.netlib.org/lapack Systems of Linear Equations Linear Least Squares Eigenvalue

More information

Preconditioned Parallel Block Jacobi SVD Algorithm

Preconditioned Parallel Block Jacobi SVD Algorithm Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic

More information

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009 Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.

More information

Numerical Methods I Non-Square and Sparse Linear Systems

Numerical Methods I Non-Square and Sparse Linear Systems Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant

More information

Presentation of XLIFE++

Presentation of XLIFE++ Presentation of XLIFE++ Eigenvalues Solver & OpenMP Manh-Ha NGUYEN Unité de Mathématiques Appliquées, ENSTA - Paristech 25 Juin 2014 Ha. NGUYEN Presentation of XLIFE++ 25 Juin 2014 1/19 EigenSolver 1 EigenSolver

More information

Module 5.2: nag sym lin sys Symmetric Systems of Linear Equations. Contents

Module 5.2: nag sym lin sys Symmetric Systems of Linear Equations. Contents Linear Equations Module Contents Module 5.2: nag sym lin sys Symmetric Systems of Linear Equations nag sym lin sys provides a procedure for solving real or complex, symmetric or Hermitian systems of linear

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

Contents. Preface... xi. Introduction...

Contents. Preface... xi. Introduction... Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism

More information

EVALUATING SPARSE LINEAR SYSTEM SOLVERS ON SCALABLE PARALLEL ARCHITECTURES

EVALUATING SPARSE LINEAR SYSTEM SOLVERS ON SCALABLE PARALLEL ARCHITECTURES AFRL-RI-RS-TR-2008-273 Final Technical Report October 2008 EVALUATING SPARSE LINEAR SYSTEM SOLVERS ON SCALABLE PARALLEL ARCHITECTURES Purdue University Sponsored by Defense Advanced Research Projects Agency

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value

More information

Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications

Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications Murat Manguoğlu Department of Computer Engineering Middle East Technical University, Ankara, Turkey Prace workshop: HPC

More information

Review Questions REVIEW QUESTIONS 71

Review Questions REVIEW QUESTIONS 71 REVIEW QUESTIONS 71 MATLAB, is [42]. For a comprehensive treatment of error analysis and perturbation theory for linear systems and many other problems in linear algebra, see [126, 241]. An overview of

More information

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication. CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax

More information

Parallelism in FreeFem++.

Parallelism in FreeFem++. Parallelism in FreeFem++. Guy Atenekeng 1 Frederic Hecht 2 Laura Grigori 1 Jacques Morice 2 Frederic Nataf 2 1 INRIA, Saclay 2 University of Paris 6 Workshop on FreeFem++, 2009 Outline 1 Introduction Motivation

More information

Assignment on iterative solution methods and preconditioning

Assignment on iterative solution methods and preconditioning Division of Scientific Computing, Department of Information Technology, Uppsala University Numerical Linear Algebra October-November, 2018 Assignment on iterative solution methods and preconditioning 1.

More information

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y) 5.1 Banded Storage u = temperature u= u h temperature at gridpoints u h = 1 u= Laplace s equation u= h u = u h = grid size u=1 The five-point difference operator 1 u h =1 uh (x + h, y) 2u h (x, y)+u h

More information

A Banded Spike Algorithm and Solver for Shared Memory Architectures

A Banded Spike Algorithm and Solver for Shared Memory Architectures University of Massachusetts Amherst ScholarWors@UMass Amherst Masters Theses 1911 - February 2014 2011 A Banded Spie Algorithm and Solver for Shared Memory Architectures Karan Mendiratta University of

More information

Solving Ax = b, an overview. Program

Solving Ax = b, an overview. Program Numerical Linear Algebra Improving iterative solvers: preconditioning, deflation, numerical software and parallelisation Gerard Sleijpen and Martin van Gijzen November 29, 27 Solving Ax = b, an overview

More information

Matrix Assembly in FEA

Matrix Assembly in FEA Matrix Assembly in FEA 1 In Chapter 2, we spoke about how the global matrix equations are assembled in the finite element method. We now want to revisit that discussion and add some details. For example,

More information

Lecture 4: Linear Algebra 1

Lecture 4: Linear Algebra 1 Lecture 4: Linear Algebra 1 Sourendu Gupta TIFR Graduate School Computational Physics 1 February 12, 2010 c : Sourendu Gupta (TIFR) Lecture 4: Linear Algebra 1 CP 1 1 / 26 Outline 1 Linear problems Motivation

More information

Incomplete Cholesky preconditioners that exploit the low-rank property

Incomplete Cholesky preconditioners that exploit the low-rank property anapov@ulb.ac.be ; http://homepages.ulb.ac.be/ anapov/ 1 / 35 Incomplete Cholesky preconditioners that exploit the low-rank property (theory and practice) Artem Napov Service de Métrologie Nucléaire, Université

More information

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team High-performance processing and development with Madagascar July 24, 2010 Madagascar development team Outline 1 HPC terminology and frameworks 2 Utilizing data parallelism 3 HPC development with Madagascar

More information

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 1 SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 OUTLINE Sparse matrix storage format Basic factorization

More information

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey Scientific Computing: An Introductory Survey Chapter 2 Systems of Linear Equations Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign Copyright c 2002. Reproduction

More information

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 2. Systems of Linear Equations

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 2. Systems of Linear Equations Lecture Notes to Accompany Scientific Computing An Introductory Survey Second Edition by Michael T. Heath Chapter 2 Systems of Linear Equations Copyright c 2001. Reproduction permitted only for noncommercial,

More information

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems LESLIE FOSTER and RAJESH KOMMU San Jose State University Existing routines, such as xgelsy or xgelsd in LAPACK, for

More information

Some notes on efficient computing and setting up high performance computing environments

Some notes on efficient computing and setting up high performance computing environments Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient

More information

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix Scientific Computing with Case Studies SIAM Press, 2009 http://www.cs.umd.edu/users/oleary/sccswebpage Lecture Notes for Unit VII Sparse Matrix Computations Part 1: Direct Methods Dianne P. O Leary c 2008

More information

Maximum-weighted matching strategies and the application to symmetric indefinite systems

Maximum-weighted matching strategies and the application to symmetric indefinite systems Maximum-weighted matching strategies and the application to symmetric indefinite systems by Stefan Röllin, and Olaf Schenk 2 Technical Report CS-24-7 Department of Computer Science, University of Basel

More information

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman Outline 1. Context 2.

More information

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices DICEA DEPARTMENT OF CIVIL, ENVIRONMENTAL AND ARCHITECTURAL ENGINEERING PhD SCHOOL CIVIL AND ENVIRONMENTAL ENGINEERING SCIENCES XXX CYCLE A robust multilevel approximate inverse preconditioner for symmetric

More information

Incomplete LU Preconditioning and Error Compensation Strategies for Sparse Matrices

Incomplete LU Preconditioning and Error Compensation Strategies for Sparse Matrices Incomplete LU Preconditioning and Error Compensation Strategies for Sparse Matrices Eun-Joo Lee Department of Computer Science, East Stroudsburg University of Pennsylvania, 327 Science and Technology Center,

More information

Parallel Discontinuous Galerkin Method

Parallel Discontinuous Galerkin Method Parallel Discontinuous Galerkin Method Yin Ki, NG The Chinese University of Hong Kong Aug 5, 2015 Mentors: Dr. Ohannes Karakashian, Dr. Kwai Wong Overview Project Goal Implement parallelization on Discontinuous

More information

FEAST eigenvalue algorithm and solver: review and perspectives

FEAST eigenvalue algorithm and solver: review and perspectives FEAST eigenvalue algorithm and solver: review and perspectives Eric Polizzi Department of Electrical and Computer Engineering University of Masachusetts, Amherst, USA Sparse Days, CERFACS, June 25, 2012

More information

Lab 1: Iterative Methods for Solving Linear Systems

Lab 1: Iterative Methods for Solving Linear Systems Lab 1: Iterative Methods for Solving Linear Systems January 22, 2017 Introduction Many real world applications require the solution to very large and sparse linear systems where direct methods such as

More information

Sparse Linear Systems. Iterative Methods for Sparse Linear Systems. Motivation for Studying Sparse Linear Systems. Partial Differential Equations

Sparse Linear Systems. Iterative Methods for Sparse Linear Systems. Motivation for Studying Sparse Linear Systems. Partial Differential Equations Sparse Linear Systems Iterative Methods for Sparse Linear Systems Matrix Computations and Applications, Lecture C11 Fredrik Bengzon, Robert Söderlund We consider the problem of solving the linear system

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Scientific Computing

Scientific Computing Scientific Computing Direct solution methods Martin van Gijzen Delft University of Technology October 3, 2018 1 Program October 3 Matrix norms LU decomposition Basic algorithm Cost Stability Pivoting Pivoting

More information

Roundoff Error. Monday, August 29, 11

Roundoff Error. Monday, August 29, 11 Roundoff Error A round-off error (rounding error), is the difference between the calculated approximation of a number and its exact mathematical value. Numerical analysis specifically tries to estimate

More information

Poisson Solvers. William McLean. April 21, Return to Math3301/Math5315 Common Material.

Poisson Solvers. William McLean. April 21, Return to Math3301/Math5315 Common Material. Poisson Solvers William McLean April 21, 2004 Return to Math3301/Math5315 Common Material 1 Introduction Many problems in applied mathematics lead to a partial differential equation of the form a 2 u +

More information

Fast matrix algebra for dense matrices with rank-deficient off-diagonal blocks

Fast matrix algebra for dense matrices with rank-deficient off-diagonal blocks CHAPTER 2 Fast matrix algebra for dense matrices with rank-deficient off-diagonal blocks Chapter summary: The chapter describes techniques for rapidly performing algebraic operations on dense matrices

More information

Linear Solvers. Andrew Hazel

Linear Solvers. Andrew Hazel Linear Solvers Andrew Hazel Introduction Thus far we have talked about the formulation and discretisation of physical problems...... and stopped when we got to a discrete linear system of equations. Introduction

More information

Lecture 8: Fast Linear Solvers (Part 7)

Lecture 8: Fast Linear Solvers (Part 7) Lecture 8: Fast Linear Solvers (Part 7) 1 Modified Gram-Schmidt Process with Reorthogonalization Test Reorthogonalization If Av k 2 + δ v k+1 2 = Av k 2 to working precision. δ = 10 3 2 Householder Arnoldi

More information

LU Factorization. LU factorization is the most common way of solving linear systems! Ax = b LUx = b

LU Factorization. LU factorization is the most common way of solving linear systems! Ax = b LUx = b AM 205: lecture 7 Last time: LU factorization Today s lecture: Cholesky factorization, timing, QR factorization Reminder: assignment 1 due at 5 PM on Friday September 22 LU Factorization LU factorization

More information

Improvements for Implicit Linear Equation Solvers

Improvements for Implicit Linear Equation Solvers Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often

More information

Solving linear systems (6 lectures)

Solving linear systems (6 lectures) Chapter 2 Solving linear systems (6 lectures) 2.1 Solving linear systems: LU factorization (1 lectures) Reference: [Trefethen, Bau III] Lecture 20, 21 How do you solve Ax = b? (2.1.1) In numerical linear

More information

Domain decomposition on different levels of the Jacobi-Davidson method

Domain decomposition on different levels of the Jacobi-Davidson method hapter 5 Domain decomposition on different levels of the Jacobi-Davidson method Abstract Most computational work of Jacobi-Davidson [46], an iterative method suitable for computing solutions of large dimensional

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training

More information

Sparse BLAS-3 Reduction

Sparse BLAS-3 Reduction Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc

More information

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Introduction Introduction We wanted to parallelize a serial algorithm for the pivoted Cholesky factorization

More information

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria 1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation

More information

Parallel Transposition of Sparse Data Structures

Parallel Transposition of Sparse Data Structures Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science 412 (2011) 1484 1491 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: wwwelseviercom/locate/tcs Parallel QR processing of Generalized

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

6 Linear Systems of Equations

6 Linear Systems of Equations 6 Linear Systems of Equations Read sections 2.1 2.3, 2.4.1 2.4.5, 2.4.7, 2.7 Review questions 2.1 2.37, 2.43 2.67 6.1 Introduction When numerically solving two-point boundary value problems, the differential

More information

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 0

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 0 CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 0 GENE H GOLUB 1 What is Numerical Analysis? In the 1973 edition of the Webster s New Collegiate Dictionary, numerical analysis is defined to be the

More information

Applied Numerical Linear Algebra. Lecture 8

Applied Numerical Linear Algebra. Lecture 8 Applied Numerical Linear Algebra. Lecture 8 1/ 45 Perturbation Theory for the Least Squares Problem When A is not square, we define its condition number with respect to the 2-norm to be k 2 (A) σ max (A)/σ

More information

Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method

Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method S. Contassot-Vivier, R. Couturier, C. Denis and F. Jézéquel Université

More information

Lecture 17: Iterative Methods and Sparse Linear Algebra

Lecture 17: Iterative Methods and Sparse Linear Algebra Lecture 17: Iterative Methods and Sparse Linear Algebra David Bindel 25 Mar 2014 Logistics HW 3 extended to Wednesday after break HW 4 should come out Monday after break Still need project description

More information

Accelerating the spike family of algorithms by solving linear systems with multiple right-hand sides

Accelerating the spike family of algorithms by solving linear systems with multiple right-hand sides Accelerating the spike family of algorithms by solving linear systems with multiple right-hand sides Dept. of Computer Engineering & Informatics University of Patras Graduate Program in Computer Science

More information

Ax = b. Systems of Linear Equations. Lecture Notes to Accompany. Given m n matrix A and m-vector b, find unknown n-vector x satisfying

Ax = b. Systems of Linear Equations. Lecture Notes to Accompany. Given m n matrix A and m-vector b, find unknown n-vector x satisfying Lecture Notes to Accompany Scientific Computing An Introductory Survey Second Edition by Michael T Heath Chapter Systems of Linear Equations Systems of Linear Equations Given m n matrix A and m-vector

More information

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Scalable Non-blocking Preconditioned Conjugate Gradient Methods Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William

More information

Integration of PETSc for Nonlinear Solves

Integration of PETSc for Nonlinear Solves Integration of PETSc for Nonlinear Solves Ben Jamroz, Travis Austin, Srinath Vadlamani, Scott Kruger Tech-X Corporation jamroz@txcorp.com http://www.txcorp.com NIMROD Meeting: Aug 10, 2010 Boulder, CO

More information

Chapter 12 Block LU Factorization

Chapter 12 Block LU Factorization Chapter 12 Block LU Factorization Block algorithms are advantageous for at least two important reasons. First, they work with blocks of data having b 2 elements, performing O(b 3 ) operations. The O(b)

More information

Exploiting off-diagonal rank structures in the solution of linear matrix equations

Exploiting off-diagonal rank structures in the solution of linear matrix equations Stefano Massei Exploiting off-diagonal rank structures in the solution of linear matrix equations Based on joint works with D. Kressner (EPFL), M. Mazza (IPP of Munich), D. Palitta (IDCTS of Magdeburg)

More information

Arnoldi Methods in SLEPc

Arnoldi Methods in SLEPc Scalable Library for Eigenvalue Problem Computations SLEPc Technical Report STR-4 Available at http://slepc.upv.es Arnoldi Methods in SLEPc V. Hernández J. E. Román A. Tomás V. Vidal Last update: October,

More information

Parallel sparse direct solvers for Poisson s equation in streamer discharges

Parallel sparse direct solvers for Poisson s equation in streamer discharges Parallel sparse direct solvers for Poisson s equation in streamer discharges Margreet Nool, Menno Genseberger 2 and Ute Ebert,3 Centrum Wiskunde & Informatica (CWI), P.O.Box 9479, 9 GB Amsterdam, The Netherlands

More information

CS 542G: Conditioning, BLAS, LU Factorization

CS 542G: Conditioning, BLAS, LU Factorization CS 542G: Conditioning, BLAS, LU Factorization Robert Bridson September 22, 2008 1 Why some RBF Kernel Functions Fail We derived some sensible RBF kernel functions, like φ(r) = r 2 log r, from basic principles

More information

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM CSE Boston - March 1, 2013 First: Joint work with Ruipeng Li Work

More information

Efficient Preprocessing in the Parallel Block-Jacobi SVD Algorithm. Technical Report June Department of Scientific Computing

Efficient Preprocessing in the Parallel Block-Jacobi SVD Algorithm. Technical Report June Department of Scientific Computing Efficient Preprocessing in the Parallel Block-Jacobi SVD Algorithm Gabriel Okša a Marián Vajteršic a Mathematical Institute, Department of Informatics, Slovak Academy of Sciences, Bratislava, Slovak Republic

More information

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for 1 Logistics Notes for 2016-09-14 1. There was a goof in HW 2, problem 1 (now fixed) please re-download if you have already started looking at it. 2. CS colloquium (4:15 in Gates G01) this Thurs is Margaret

More information

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8

More information

Preface to the Second Edition. Preface to the First Edition

Preface to the Second Edition. Preface to the First Edition n page v Preface to the Second Edition Preface to the First Edition xiii xvii 1 Background in Linear Algebra 1 1.1 Matrices................................. 1 1.2 Square Matrices and Eigenvalues....................

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

F04JGF NAG Fortran Library Routine Document

F04JGF NAG Fortran Library Routine Document F4 Simultaneous Linear Equations F4JGF NAG Fortran Library Routine Document Note. Before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

Some Geometric and Algebraic Aspects of Domain Decomposition Methods

Some Geometric and Algebraic Aspects of Domain Decomposition Methods Some Geometric and Algebraic Aspects of Domain Decomposition Methods D.S.Butyugin 1, Y.L.Gurieva 1, V.P.Ilin 1,2, and D.V.Perevozkin 1 Abstract Some geometric and algebraic aspects of various domain decomposition

More information

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Outline A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Azzam Haidar CERFACS, Toulouse joint work with Luc Giraud (N7-IRIT, France) and Layne Watson (Virginia Polytechnic Institute,

More information

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology, USA SPPEXA Symposium TU München,

More information

A Review of Matrix Analysis

A Review of Matrix Analysis Matrix Notation Part Matrix Operations Matrices are simply rectangular arrays of quantities Each quantity in the array is called an element of the matrix and an element can be either a numerical value

More information

Chapter 7 Iterative Techniques in Matrix Algebra

Chapter 7 Iterative Techniques in Matrix Algebra Chapter 7 Iterative Techniques in Matrix Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematics University of California, Berkeley Math 128B Numerical Analysis Vector Norms Definition

More information

Elementary Linear Algebra

Elementary Linear Algebra Matrices J MUSCAT Elementary Linear Algebra Matrices Definition Dr J Muscat 2002 A matrix is a rectangular array of numbers, arranged in rows and columns a a 2 a 3 a n a 2 a 22 a 23 a 2n A = a m a mn We

More information

libsupermesh version 1.0.1

libsupermesh version 1.0.1 libsupermesh version 1.0.1 James R. Maddison School of Mathematics and Maxwell Institute for Mathematical Sciences, University of Edinburgh, UK Iakovos Panourgias EPCC, University of Edinburgh, UK, Patrick

More information

Module 6.6: nag nsym gen eig Nonsymmetric Generalized Eigenvalue Problems. Contents

Module 6.6: nag nsym gen eig Nonsymmetric Generalized Eigenvalue Problems. Contents Eigenvalue and Least-squares Problems Module Contents Module 6.6: nag nsym gen eig Nonsymmetric Generalized Eigenvalue Problems nag nsym gen eig provides procedures for solving nonsymmetric generalized

More information

PFEAST: A High Performance Sparse Eigenvalue Solver Using Distributed-Memory Linear Solvers

PFEAST: A High Performance Sparse Eigenvalue Solver Using Distributed-Memory Linear Solvers PFEAST: A High Performance Sparse Eigenvalue Solver Using Distributed-Memory Linear Solvers James Kestyn, Vasileios Kalantzis, Eric Polizzi, Yousef Saad Electrical and Computer Engineering Department,

More information

FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING

FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING Daniel Thuerck 1,2 (advisors Michael Goesele 1,2 and Marc Pfetsch 1 ) Maxim Naumov 3 1 Graduate School of Computational Engineering, TU Darmstadt

More information

Introduction - Motivation. Many phenomena (physical, chemical, biological, etc.) are model by differential equations. f f(x + h) f(x) (x) = lim

Introduction - Motivation. Many phenomena (physical, chemical, biological, etc.) are model by differential equations. f f(x + h) f(x) (x) = lim Introduction - Motivation Many phenomena (physical, chemical, biological, etc.) are model by differential equations. Recall the definition of the derivative of f(x) f f(x + h) f(x) (x) = lim. h 0 h Its

More information

Linear Algebra. PHY 604: Computational Methods in Physics and Astrophysics II

Linear Algebra. PHY 604: Computational Methods in Physics and Astrophysics II Linear Algebra Numerical Linear Algebra We've now seen several places where solving linear systems comes into play Implicit ODE integration Cubic spline interpolation We'll see many more, including Solving

More information

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros

More information

F08BEF (SGEQPF/DGEQPF) NAG Fortran Library Routine Document

F08BEF (SGEQPF/DGEQPF) NAG Fortran Library Routine Document NAG Fortran Library Routine Document Note. Before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

Finite-choice algorithm optimization in Conjugate Gradients

Finite-choice algorithm optimization in Conjugate Gradients Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the

More information

Computational Linear Algebra

Computational Linear Algebra Computational Linear Algebra PD Dr. rer. nat. habil. Ralf Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2017/18 Part 2: Direct Methods PD Dr.

More information

ARPACK. A c++ implementation of ARPACK eigenvalue package.

ARPACK. A c++ implementation of ARPACK eigenvalue package. !"#!%$& '()%*,+-. (/10)24365 78$9;:!?A@ B2CED8F?GIHKJ1GML NPORQ&L8S TEJUD8F"VXWYD8FFZ\[]O ^ L8CETIFJ1T_F ; Y`baU c?d%de%c +-)?+-=fc%cgd hjikmlnio

More information

Linear Algebra and Eigenproblems

Linear Algebra and Eigenproblems Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details

More information