Adaptive Spike-Based Solver 1.0 User Guide

Size: px

Start display at page:

Download "Adaptive Spike-Based Solver 1.0 User Guide"

Marshall Horton
5 years ago
Views:

1 Intel r Adaptive Spike-Based Solver 10 User Guide I V 1 W 2 I V 2 W 3 I V 3 W 4 I 1

2 Contents 1 Overview 4 11 A Quick What, Why, and How 4 12 A Hello World Example 7 13 Future Developments 8 14 User Guide Outline 9 2 The SPIKE Subroutine Setting the environment Autoadapt Data Disabling Spike Adapt Running the spike adaptexe command 16 3 Separate calls 18 4 Banded Preconditioner 20 5 Manual Data Partition Dense Banded Format Sparse CSR Format 22 6 SPIKE Examples Example1: Automatic Partitioning Example2: Automatic Partitioning and Multiple RHS Example3: Automatic Partitioning and Multiple RHS with Separate Factorization and Solution Example4: Manual Partitioning Example5: Automatic Partitioning Using the CSR Input Format Example 6: Automatic Partitioning Using the CSR Input Format with a Preconditioner Toeplitz Matrix Example Sparse Banded Matrix Example Calling SPIKE from C Programs 39 7 Reference guide SPIKE 10 directory structure SPIKE and ScaLAPACK 44 2

3 73 Spike Default Spike Spike Begin Spike Preprocess Spike Process Spike End spike param details matrix data details info details 50 Bibliography 52 A Mathematical Description of Key Strategies 57 A1 Az = r via TU 58 A2 Az = r via FL 60 A3 Az = r via RL/RP 60 A4 Az = r via TA 62 A5 Az = r via EA 63 B How Spike Adapt Works 65 B1 Why is Spike Adapt Necessary? 65 B2 How Does Spike Adapt Work? 65 B3 Spike Adapt Return Codes 66 C MPI Compatibility Library 67 3

4 Chapter 1 Overview 11 A Quick What, Why, and How Intel r Adaptive Spike-Based Solver (SPIKE in short) is a software package for solving large, banded linear systems on parallel computers Solving banded linear systems is a crucial step in many high-performance computing (HPC) applications For example, they frequently arise after a general sparse matrix is reordered in some fashion In other instances, banded systems are used as effective preconditioners to general sparse systems where they are solved via iterative methods Existing parallel software using direct methods for banded matrices are mostly based on LU factorizations In contrast, SPIKE is based on a different decomposition method that increases arithmetic costs but naturally leads to lower communication overhead, which is advantageous on modern parallel architectures where arithmetic performance has outpaced memory and network performance Thus, SPIKE offers HPC users a new and valuable tool The central idea behind SPIKE is a different decomposition of a matrix [10, 4, 6, 2, 5, 8, 9] compared to the common LU decomposition that represents a matrix A as a product of lower and upper triangular matrices A = LU Consequently, solving AX = F can be achieved by solutions of two triangular systems LG = F and UX = G In contrast, SPIKE is based on a decomposition motivated by the important case where A is a banded matrix Figure 11 shows a banded matrix and its partitioning for parallel processing The decomposition takes the form A = DS Here, D is A 1 B 1 A 2 A = Partitioned C 2 B 2 C 3 A 3 Figure 11: A banded matrix with a conceptual partition 4

5 block diagonal matrix consisting of all the A j blocks (see Figure 11) and S is D 1 A, assuming for the moment that the A j blocks are non-singular Matrix S has the structure of an identity matrix with some extra spikes, hence the name of the package (Figure 12) In practice, D and S may not A = D S A 1 A 1 B 1 A 2 I V 1 C 2 A 2 W 2 I V 2 B 2 C 3 A 3 A 3 W 3 I Figure 12: Decomposition where A = DS, S = D 1 A be obtained exactly, either intentionally or due to limitations such as singularity Instead, the numerical algorithm yields D and S that resemble the structures of D and S in Figure 12 and satisfy an equation of the form A = D S + R for some residual R Even when R is non-zero, it is by design small in some sense The basic method employed in SPIKE is as follows: solve ( D S + R)X = F via a preconditioned iterative method (with preconditioner M = D S); solve systems of the form MZ = Y for varying Y s; end The key step of this iterative method is the solution of systems with the D S matrix Solving AX = F can now be seen as involving three steps conceptually: 1 Solving the block-diagonal system DG = F Because D consists of decoupled systems of each diagonal block Ãi, they can be solved in parallel without synchronization between the individual systems A number of strategies based on the LU decomposition of each Ãi can be applied here These include variants such as LU without pivoting, LU with pivoting, as well as a combination of LU and UL decompositions with or without pivoting 2 Solving the system SY = G This system has the wonderful characteristic that it is also largely decoupled Except for a reduced system near the junction between the identity blocks, the rest are independent The natural way to tackle this system is to first solve the reduced system using parallel algorithms that require interprocessor communication, followed by retrieval of the rest of the solution without requiring further interprocess communication Here again, a number of different strategies exist for solving the reduced system 5

6 3 Depending on how D and S were obtained earlier, which is related to the exact strategy used in the two previous steps, R can be zero or non-zero If R is zero, then of course the Y obtained is the desired solution to AX = F Otherwise, some corrections must be computed This can be accomplished by a number of standard iterative methods such as iterative refinement, GMRES, or BiCGStab, just to name a few All in all, a large variety of strategies can be applied based on the basic decomposition A = DS and the realization of the approximations D and S; ie, A = D S+R in which R is a correction, where M = D S is an effective preconditioner for a variety of iterative schemes SPIKE offers a number of choices to solve AX = F based on the framework of this decomposition SPIKE can compute the solution of AX = F by a single call where the specific strategy can be selected automatically or manually A user can also solve a system by issuing several step-by-step calls similar to separating the LU factorization and the forward/backward substitutions in LAPACK [1] In this case, the user can handle more interesting situations including the solution of different right-hand sides (RHS) at different times, AX i = F i while amortizing those one-time computation costs related to the same matrix A To summarize, Intel r Adaptive Spike-Based Solver 10 aims to solve AX = F in parallel where A is a banded matrix It currently supports users using MPI to express parallelism The algorithmic framework is based on a decomposition of the form A = D S + R This framework allows many different strategies that can exploit special properties of the underlying processor architectures, network properties, as well as the numerical nature of the input matrix A SPIKE 10 consists of two main layers: a computational layer called Spike Core and a strategy selection layer called Spike Adapt Spike Core consists of the necessary linear algebra software to support different solution strategies whereas Spike Adapt is an independent layer that selects an efficient strategy based on the characteristics of the input matrix A and the underlying computer system By default, Spike Adapt automatically picks a strategy on the user s behalf Nevertheless, expert users have the option to pick a strategy manually A strategy is defined by algorithmic choices for each of the three steps (involving D, S, and as needed for non-zero R) outlined previously A user can ask for the solution to the problem AX = F via a single call to SPIKE This is covered in Chapter 2 Alternatively, this single function call can be replaced by separate calls similar to separating the calls to triangular factorization and the subsequent triangular solves This added complexity is especially worthwhile when solutions with different RHS for the same matrix A are needed at different times, allowing the common preprocessing cost pertaining to A to be amortized Invoking SPIKE with multiple function calls is covered in Chapter 3 Finally, concerning data distribution, the user can provide the complete matrix A and the RHS in the MPI master process and rely on SPIKE to distribute the data to the remaining MPI processes Alternatively, the user can manually distribute the data Chapter 5 covers the data distribution options in greater detail 6

7 12 A Hello World Example This example solves a 32-by-32 tridiagonal Toeplitz system with 6 on the diagonal, -1 on the two off-diagonals, and the constant vector 1 as the RHS That is, solve for X where X = A single call to the SPIKE subroutine takes care of data distribution and strategy selection The user only needs to set a few global parameters such as number of processors, the local MPI rank, and the structure and the bandwidth of the matrix The matrix and RHS data are stored initially on the MPI master process (ie, process-0) The source code of hello worldf90 is listed in Figure 13 To create the executable, compile the source program INCLUDE s p i k e f i program h e l l o w o r l d c o d e use s p i k e m o d u l e use mpi! b e f o r e t h e MPI INIT c a l l i n g s e q u e n c e s integer : : i, rank, nb procs, code integer : : i n f o type ( spike param ) : : p s p i k e! S p i k e p a r a m e t e r d a t a s t r u c t u r e type ( m a t r i x d a t a ) : : mat! S p i k e m a t r i x d a t a s t r u c t u r e double precision, dimension ( :, : ), a l l o c a t a b l e : : f! r h s c a l l MPI INIT ( code ) c a l l MPI COMM SIZE(MPI COMM WORLD, nb procs, code ) c a l l MPI COMM RANK(MPI COMM WORLD, rank, code )! s e t up S p i k e p a r a m e t e r d a t a s t r u c t u r e on a l l p r o c e s s o r s pspike%nbprocs=nb procs ; pspike%rank=rank c a l l SPIKE DEFAULT( p s p i k e )! d e f a u l t v a l u e s f o r p s p i k e pspike%autoadapt= true! autoadapt i s on! s e t up S p i k e m a t r i x d a t a p a r a m e t e r s on a l l p r o c e s s o r s mat%format = D ; mat%a s t r u = G ; mat%diagdo = Y mat%n = 3 2 ; mat%kl = 1 ; mat%ku = 1! c r e a te i np ut matrix and rhs on Processor 0 i f ( rank == 0) then a l l o c a t e ( mat%a( 1 : mat%kl+mat%ku+1, mat%n ) ) a l l o c a t e ( f ( 1 : mat%n, 1 : 1 ) ) mat%a( 1, : ) = 1 0 d0 ; mat%a( 2, : ) = 6 0 d0 ; mat%a( 3, : ) = 1 0 d0 f = 1 0 d0 end i f! one c a l l t o S p i k e f o r s o l v i n g Ax=f c a l l SPIKE( pspike, mat, f, i n f o )! s o l u t i o n i s i n f which r e s i d e s i n P r o c e s s o r 0 i f ( i n f o >=0) then i f ( rank == 0) then do i =1,mat%n print, i, f ( i, 1 ) end do end i f end i f c a l l MPI FINALIZE( code ) end program h e l l o w o r l d c o d e Figure 13: A very simple example 7

8 and link it with the Intel r Adaptive Spike-Based Solver 10 libraries which also provide BLAS and LAPACK libraries Assuming that SPIKE has been installed in a directory called <SPIKE directory> and the user is compiling the source program called hello worldf90: mpiifort hello worldf90 -o hello worldexe \ -I<SPIKE directory>/include \ -L<SPIKE directory>/lib/<arch> \ -lspike -lspike mpi comm \ -lspike adapt -lspike adapt de -lspike adapt grid f \ -lmkl solver -lmkl lapack -lmkl -lguide -lpthread where mpiifort is the Fortran compiler driver for the Intel MPI Library and <arch> is either 64, for IA-64 architecture or em64t, for Intel r 64 architecture A run of the resulting executable hello worldexe may look like mpirun np 4 hello worldexe and the following is the output of the run: >>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>> Spike S t r a t e g y EA3 TIME FOR PARTITIONING e 02 TIME FOR SPIKE BANDED FACT e 02 TIME FOR SPIKE BANDED SOLV e 03 TIME FOR SPIKE (FACT+SOLV) e 02 RESIDUAL e 16 # Outside i t e r a t i o n s : 0 SPIKE has s u c c e e d e d ( to r e a c h the a c c u r a c y p s p i k e%e p s o u t ) Future Developments Enhancements to SPIKE 10 will be made in several orthogonal areas: the kinds of sparse matrices handled via added utility functions, the set of so- 8

9 lution strategies it encompasses, and the variety of parallel environments it supports When A is a general sparse matrix, often times reordering can transform it either into a banded matrix, or a low-rank perturbation of a banded matrix We intend to offer utilities for matrix reordering and capabilities to handle more general sparse matrices In addition to the current LU-based strategies for handling the diagonal blocks of the D matrix, we intend to add other strategies (eg, based on least-squares) to handle very ill-condition systems Other data distribution strategies that exhibit better load-balancing properties will also be added MPI is the only parallel environment supported currently but alternative parallel environments may be considered in future releases 14 User Guide Outline The remainder of this guide describes the usage of SPIKE 10 in greater detail Chapter 2 focuses on invoking SPIKE with a single function call to obtain the solution X to the equation AX = F where both A and F are stored in the MPI master process Chapter 3 describes how to solve AX = F using multiple SPIKE functions The motivating example is the solution of multiple RHS for AX k = F k where F k are available at different times This way the step that performs setup related to A can be done just once We assume A and F k are initially stored in the MPI master process Chapter 5 describes how the user can distribute A and F across multiple MPI processes This avoids the overhead of data distribution and allows the solver to use the aggregate memory of a distributed-memory parallel computer SPIKE supports several distribution schemes including ScaLAPACK s format Thus, ScaLAPACK programs can be modified to use SPIKE with very little effort Chapter 6 presents a number of SPIKE examples illustrating its uses Chapter 7 provides detailed reference material on the SPIKE directory structure and each SPIKE function 9

10 Chapter 2 The SPIKE Subroutine SPIKE 10 contains two main components: Spike Core is the component that implements the underlying numerical methods including for example the solution of the S system in A = D S + R, factorization of the D system, and outer iterations to deal with a non-zero R The second component Spike Adapt implements a strategy selection method based on information about the underlying architecture, computer platform, and the linear system in question The single driver Spike conveniently integrates and makes available the functionalities offered by these two components to the user via a single call In brief, this driver exercises the strategy selection mechanism and then proceeds to solve AX = F for X given A and F using the selected strategy The user can find out what strategy was chosen by examining several parameters in the program, or by running the standalone binary executable spike adaptexe (at command line) that comes with SPIKE The user also has the option of selecting a strategy manually through setting several parameters, but this requires more detailed knowledge of how the strategies work To this end, this chapter also gives a brief guideline on choosing strategies, but defers to the Appendix for a more mathematical description The single driver call is call Spike(pspike, mat, f, info) Related details are given in the rest of this chapter 21 Setting the environment SPIKE provides scripts to automatically initialize the user environment They are located in the <SPIKE directory>/tools/environment directory where <SPIKE directory> is the SPIKE main directory after installation For example, it could be /opt/intel/spike/10 10

11 These scripts set environment variables that are needed to build and run SpikePACK applications Select the appropriate script for the Linux shell and architecture For example, to initialize SPIKE for the BASH shell on an Intel r EM64T system, execute the following command: > source spikevarsem64tsh To initialize SPIKE for CSH on an Itanium r processor system, use the following command: > source spikevars64csh It is recommended that the initialization command be placed in the appropriate shell startup file in $HOME; cshrc or bashrc for the CSH and BASH shells, respectively 22 Autoadapt As illustrated in the hello world program in Figure 13, parameters contained in two components of the derived type spike param variable pspike need to be set While the type spike param has many components, only two need to be set manually by the user; the rest can be assigned default values by making a call to the routine Spike Default The two components that need to be set are Component Type Description nbprocs integer number of processors - MPI related rank integer rank of the local processor - MPI related The rest of the components can be set to their default by calling the routine Spike Default For example call Spike Default(pspike) will set those components in the derived type spike param variable pspike to their default values These default values are given in Table 21 Note that some of these components are inout in nature which means SPIKE may actually overwrite the input values as a result of executing the software The spike param derive type consists of a host of other output components Refer to Section 79 for comprehensive information 23 Data In this section we explain how we can set up the parameters within the type matrix data variable mat that we use in our calling sequence example The type matrix data main purpose is to hold the matrix represented in a number of popular representation In SPIKE 10, both the LAPACK banded-type storage format (without additional storage for pivoting) or CSR (Compressed Sparse Row) format are supported Depending on the 11

12 Component Type Default Description autoadapt logical true strategy automatically picked if true RSS char R Reduced System Strategy: R, T, or F DFS char P Diagonal Factorization Strategy: P, L, U, or A OIS integer 3 Outer Iteration Strategy: 3 (more options in the future) The three components above together specify a strategy for solving a banded system using the Spike framework When autoadapt is set to true(which is the default value), the input values of these three components are ignored and overwritten to record the automatically chosen strategy Section 24 has more details on manual strategy selection BPS integer 0 Banded Preconditioner Strategy 0 User does not specify a banded preconditioner -1 A banded preconditioner is specified by user threads integer 1 value of the OpenMP environment variable OMP NUM THREADS if mat%format= S (ie # of threads for the PARDISO solver on each partition) nbit out integer 50 max # of outer iteration eps out double 10 7 accuracy residual outer iteration nbit in integer 100 max # of inner iteration eps in double 10 7 accuracy residual inner iteration nzero double 10 9 new zero value for diagonal boosting O ɛ if pivot < O ɛ 1 then pivot pivot ± O ɛ 1 tp integer 0 data distribution: 0 data in Proc 0 1 data on each Procs (cf Chapter 5) memfree logical false deallocate memory for matrix (the case when tp=0) residual logical true compute the L relative residual norm timing logical false provide timing information comd logical false provide detailed running information file output integer 6 print information to screen if 6, file ID for spikeoutput otherwise Table 21: List of input components for the derived type spike param Note: RSS, DFS, OIS are inout whereas the rest are input only 12

13 the specific value of pspike%tp in the variable pspike being passed to the routine, the mat variable on Processor 0 may be used to hold the full original, or the mat variable on each of the Processor may be used to hold part of the original matrix In the former case, Spike Core will partition the data held on Processor 0 and distribute them to the other processors under the hood In the latter case, the user needs to manually put the appropriate part of the matrix in each of the different processors Chapter 5 will give the necessary details for one to perform this task For now, Table 22 gives details of the matrix data structure relevant for pspike%tp=0, that is, the user put the complete matrix into the mat variable on Processor 0 mat% Type Distribution Description format char (in) global matrix format: D : Dense; S : Sparse CSR astru char (in) global matrix structure: G : General non-symmetric diagdo char (inout) global diagonal dominance Y : Yes; N : No; I : Investigate vdiagdo double (out) global SPIKE computed diagonal dominance value if mat%diagdo= I or pspike%autoadapt =true n integer (in) global matrix dimension The input field below is for the case mat%format= D kl integer (inout) global # of subdiagonals in matrix ku integer (inout) global # of superdiagonals in matrix A double(bwd,mat%n) rank 0 LAPACK banded matrix format, no extra pivoting space bwd = mat%kl+mat%ku+1 The input fields below are associated with the sparse CSR format (mat%format= S ) nbsa integer rank 0 # of non-zero matrix elements sa double(mat%nbsa) rank 0 CSR format, matrix elements jsa integer(mat%nbsa) rank 0 CSR format, column indices isa integer(mat%n+1) rank 0 CSR format, start-of-row indicies Table 22: List of parameter fields of the type matrix data variable mat Here all the matrix data are stored in Processor 0 If space for mat%a in Processor 0 is allocated dynamically, the user may want to have it deallocated automatically by setting pspike%memfree = true All the other parameter fields must be declared as global (ie commun for each processors) Finally, if the matrix data have been defined in rank 0, the rhs parameter should also be defined in rank 0 using: 13

14 Parameter Type Distribution Description f double(mat%n,nrhs) (inout) rank 0 Right-hand side f (in) Solution x of Ax=f (out) nrhs stands for # of RHS Table 23: Definition of the RHS (in) and solution (out) stored in rank 0 24 Disabling Spike Adapt While we recommend that the user set the autoadapt component to true, it is possible to disable automatic strategy selection by setting the autoadapt component to false In this case, the strategy is defined by the values set in the three components (RSS, DFS, OIS), which are set to ( R, P,3) by Spike Default We explain in this section what these parameters mean and offer a general guideline on how to set strategy manually Recall that the computational framework of SPIKE is based on the decomposition A = D S + R where the structure of D and S are depicted in Figure 12 The generic way used to solve the system AX = F can be described as: Solve AX = F by a preconditioned iterative method Use M as preconditioner where M = D S The preconditioning step solving MZ = Y The three components of a strategy are: Reduced System Strategy RSS: The crux of a preconditioned iterative scheme is the solution involving the preconditioner M The key parallel algorithm is the handling of the S matrix The portion S red of S near the partition boundaries constitute a reduced system; and the key in solving system with S lies in the solution of this reduced system S red There are several strategies in solving the reduced system: R: stands for recursive A recursive algorithm can be applied to the reduced system F: stands for on the fly The reduced system can be solved using an iterative method In this situation, there is no need to have to compute the S red matrix explicitly as one only need to compute the action of S red on vectors These are computed on the fly based mostly on the A matrix itself E: stands for explicit Here the V j and W j blocks of the S matrix are explicitly computed The reduced system is solved in an iterative manner T: stands for truncated This is based on an exploitation of the special structure of S Should the top and bottom portions of 14

15 suitable sizes of the V j and W j blocks be zero, solution of the reduced system S red becomes extremely easy This strategy sets those blocks to zero deliberately (hence truncating the V j and W j submatrices) and trade the ease of of solution of this slightly wrong S red system at the expense of corrective effort elsewhere Diagonal Factorization Strategy DFS: Solving D SZ = Y naturally involves in one form or another solutions of system with the D matrix, which is block diagonal in structure For SPIKE 10, we rely on various direct factorization algorithms to tackle this problem The strategies here correspond to factorizations of those diagonal block matrices Note however that while these strategies normally correspond to familiar methods designed for dense matrices, they can be overloaded to represent direct sparse matrix factorizations motivated by the corresponding dense versions For example, in the case of sparse bands, L refers to the factorization provided by the popular package PARDISO [11] P: stands for pivoting This is LU factorization with partial pivoting L: stands for LU This is the LU factorization without pivoting U: stands for UL This is obtaining both the LU and UL factorizations, neither with pivoting A: stands for alternate This alternate from block to block between LU and UL factorizations, without pivoting Outer Iteration Strategy OIS: represents the iterative method use in the outermost layer An integer value is used to direct a specific choice For the current release SPIKE 10, we only support BiCGStab iterative scheme which corresponds to the value 3 While RSS and DFS are mostly orthogonal, they are not completely so Indeed, some factorization strategies are motivated and consequently applicable only to some particular reduced system strategies Therefore, not all combinations of choices in RSS with DFS are supported or in fact meaningful In the current release, the following six combinations of (RSS,DFS) are supported: (T,U), (F,L), (R,L), (R,P), (T,A), (E,A) Moreover, if mat%format= D the setting of the tp component of the spike param variable as well as the number of processors also affect the applicability of these six choices In this case, Table 24 tabulates the applicable strategies under different tp and nbprocs setting In the case where mat%format= S only the combination (F,L) is allowed while Spike Adapt is turned off 15

16 pspike%tp pspike%nbprocs n (n > 1) Even ( 2 n ) Odd TU FL 0 RL RP All All TU FL EA TA TU FL 1 None All TU FL TU FL RL RP Table 24: This table illustrates how the type of matrix partitioning and the number of MPI processes affect the choice of (RSS,DFS) for the Spike Core strategy In future developments of SPIKE 10, the choice of (RSS,DFS) will be independent of the setting of the tp component 25 Running the spike adaptexe command User applications do not call Spike Adapt directly Rather, Spike Core calls Spike Adapt if the autoadapt component element of the spike param structure is set to true Note that in this case the user-specified (RSS,DFS,OIS) values are ignored and in fact will be overwritten Nevertheless, a standalone executable spike adaptexe is provided by SPIKE 10 in the location <SPIKE directory>/bin/<arch> where arch is either 64, for IA-64 architecture, or em64t, for Intel r 64 architecture Given a set of input characteristics (matrix size, bandwidth, number of MPI processes, sparsity, diagonal dominance, the number of righthand sides, type of matrix partitioning), this executable will suggest an optimal Spike Core strategy Edit the Fortran NAMELIST file, ivarsnml, to specify the matrix parameters, eg: &IVAR matrix_size = bandwidth = 161 n_proc = 4 sparsity = 09d0 diagonal_dominance = 12d0 n_rhs = 1 tp = 0 / Simply run spike adaptexe in the same directory as ivarsnml to get a recommended Spike Core strategy, eg: [cluster0]$ spike_adaptexe /spike_adaptexe Bandwidth = 161 Diagonal dominance = Matrix size = Sparsity =

17 # RHS = 1 # Procs = 4 Type of partition: 0 The Spike_Adapt performance models selected fl3 17

18 Chapter 3 Separate calls A single call to Spike CALL Spike(pspike,mat,f,info) can be split into a calling sequence of four separate operations: where CALL Spike Begin(pspike,mat,pre,info) CALL Spike Preprocess(pspike,pre,info) CALL Spike Process(pspike,mat,pre,f,info) CALL Spike End(pspike,mat,pre,info) Spike Begin: beginning of the calling sequence; Spike Preprocess: preprocessing of the preconditioner data structure; Spike Process: processing of the matrix, preconditioner and the righthand side; Spike End: ending of the calling sequence We can see in additional to pspike, mat, f and info, there is a new parameter pre needed for the split calls This parameter pre is of type matrix data and pertains to a preconditioner However, the user needs not set any of the component values Consider it a work array of some sort that the software uses internally Splitting a single call to SPIKE is useful for applications having iterations with changing right-hand-sides but using the same original matrix The following program invokes Spike Process multiple times rather than invoking Spike multiple times Figure 31 presents a program solving two different right hand sides: (1, 0, 0, 0, 0, 0, 0, 0) T and then (0, 1, 0, 0, 0, 0, 0, 0) T Note that the program uses the global partitioning scheme, so the right hand sides are set up in node 0 In the program, Spike Begin, Spike Preprocess and Spike End are called once while Spike Process is called twice (once for each right hand side) This program is expected to run faster than an equivalent one with 18

19 ! D e c l a r e v a r i a b l e s u s e d by SpikePACK integer : : i n f o type ( spike param ) : : p s p i k e type ( m a t r i x d a t a ) : : mat, pre double precision, dimension ( 8, 1 ) : : f! S e t up p s p i k e and mat as u s u a l! The f o l l o w i n g two c a l l s a r e c a l l e d once c a l l Spike Begin ( pspike, mat, pre, i nf o ) c a l l Spike Preprocess ( pspike, pre, i nf o )! S o l v e f o r t h e f i r s t r i g h t hand s i d e i f ( rank == 0) then f =00 d0 f ( 1, 1 ) = 1 0 d0 end i f! S p i k e P r o c e s s ( ) i s i n v o k e d f o r t h e f i r s t r i g h t hand s i d e c a l l Spike Process ( pspike, mat, pre, f, in f o )! The s o l u t i o n o f t h e f i r s t RHS i s s t o r e d i n f a f t e r S p i k e P r o c e s s ( )! S o l v e f o r t h e s e c o n d r i g h t hand s i d e i f ( rank == 0) then f=f 0 1 d0 end i f! S p i k e P r o c e s s ( ) i s i n v o k e d f o r t h e s e c o n d r i g h t hand s i d e c a l l Spike Process ( pspike, mat, pre, f, in f o )! The s o l u t i o n o f t h e s e c o n d RHS i s s t o r e d i n f a f t e r S p i k e P r o c e s s ( )! The f o l l o w i n g c a l l i s c a l l e d once c a l l Spike End ( pspike, mat, pre, i n f o ) Figure 31: A program solving two right hand side using separate Spike calls two Spike calls because this program only initializes and frees Spike data structures once while a program calling Spike twice would have duplicated these works 19

20 Chapter 4 Banded Preconditioner SPIKE can be used as a framework for solving banded systems to be used as effective preconditioners for general sparse systems, which are solved via iterative methods In future releases, SPIKE will offer different options for enabling an automatic derivation of a robust banded preconditioner from an arbitrary general sparse systems In particular the component %BPS for the derived type spike param in Table 21, has been introduced to such effect For the current SPIKE version 10, the component %BPS can only take two values: 0 (no preconditioner default value) or 1 where the banded preconditioner has to be set by the user Some users may take advantage of this option in the case where banded preconditioners can be constructed directly from an application at hand, such as in nanoelectronics nanowire simulations [7] Using the separate SPIKE calling sequences presented in Chapter 3, one can decide on a preconditioner pre that will be called by the preprocessing sequence, while the processing sequence takes advantage of the obtained factorization of the preconditioner to accelerate the outer-iterative schemes Therefore, with the option %BPS= 1 the user has the possibility of defining his own banded preconditioner (either dense or sparse within the band) for solving iteratively an original system matrix that can be general sparse Depending on the data distribution format (component %tp), the user must define the preconditioner pre using the derived type spike param in a similar way he defines the original matrix mat either using Table 22 (%tp=0) or Table 51 (%tp=1) In Chapter 6, Example 6 illustrates the use of the option %BPS= 1 20

21 Chapter 5 Manual Data Partition It has been assumed until now that all of the matrix and RHS data reside in the MPI master process (ie, process-0) This is specified by setting the spike param tp parameter to zero When the matrix and RHS data are entirely in process-0, SPIKE automatically distributes a portion of the data to each MPI process before invoking the solver The price paid for this convenience is the overhead associated with the data distribution and potential limits on the overal problem size Specifically, the problem size is limited to the memory available to process-0 Alternatively, SPIKE allows the user to partition dense matrices and RHS s among the MPI processes before calling Spike Core This chapter describes the local partitioning schemes supported by SPIKE 10 Let pspike and mat be the variables of type spike param and matrix data, respectively, used during calls to Spike Default, Spike, Spike Begin, Spike Process, etc The dense banded format is specified by mat%format = D, while the sparse CSR format is specified by mat%format = S In the following pspike%tp is set to 1 to manually distribute the matrix and RHS to the MPI processes 51 Dense Banded Format Consider a (complete) matrix of dimension n and bandwidth bwd, where bwd = mat%kl + mat%ku + 1 If SPIKE were to distribute the data automatically (ie, tp=0), one would allocate a space of bwd-by-n for mat%a Here Table 51 gives details of the matrix data structure relevant for pspike%tp=1, that is, the user distributes manually the complete matrix into the local mat variable on each processors Figure 51 illustrates this partitioning scheme The user must distribute this bwd-by-n array into pspike%nbprocs arrays of dimension bwd-by-n j where the values of n j satisfying n = nbprocs j=1 are set by the user The values of n j are stored globally (ie commun for all processors) in the array of integer mat%sizea of dimension nbprocs, 21 n j

22 such that mat%sizea=(n 1, n 2,, n nbprocs ) The matrix elements are stored locally on each processors in mat%a The RHS s are distributed by rows in a natural way Each MPI process j 1 will have an array of dimension n j -by-nrhs, for j = 1, 2,, nbprocs Figure 51: Illustration of a matrix in LAPACK banded storage format distributed to four MPI processes 52 Sparse CSR Format Consider a (complete) sparse matrix of dimension n, if SPIKE were to distribute the data automatically (ie, tp=0), one would use a CSR format and allocate in processor 0 the set of arrays mat%sa, mat%isa, mat%isa However, with tp=1, the user must distribute the complete sparse matrix by block of rows into %nbprocs set of arrays in CSR format where the number of non-zero elements of each submatrices nnz j and the number of rows n j satisfying n = nbprocs j=1 are set by the user Figure 52 illustrates this partitioning scheme and Table 51 gives details of the matrix data structure relevant for pspike%tp=1 n j Figure 52: Illustration of a matrix in CSR sparse storage format distributed to four MPI processes The values of nnz j are stored locally (ie on each processors) in the integer mat%nbsa The matrix elements are also stored locally on each pro- 22

23 mat% Type Distribution Description format char (in) global matrix format: D : Dense; S : Sparse CSR astru char (in) global matrix structure: G : General non-symmetric diagdo char (inout) global diagonal dominance Y : Yes; N : No; I : Investigate vdiagdo double (out) global SPIKE computed diagonal dominance value if mat%diagdo= I or pspike%autoadapt =true n integer (in) global matrix dimension sizea integer(pspike%nbprocs) (in) global set of partitions dimensions with mat%sizea=(n 1, n 2,, n nbprocs ) The input field below is for the case mat%format= D kl integer (inout) global # of subdiagonals in matrix ku integer (inout) global # of superdiagonals in matrix A double(bwd,mat%sizea(i+1)) rank i LAPACK banded matrix format, no extra pivoting space bwd = mat%kl+mat%ku+1 The input fields below are associated with the sparse CSR format (mat%format= S ) nbsa integer rank i # of non-zero matrix elements nnz j for partition j=i+1 sa double(mat%nbsa) rank i CSR format, matrix elements jsa integer(mat%nbsa) rank i CSR format, column indices isa integer(mat%sizea(i+1)+1) rank i CSR format, start-of-row indicies Table 51: List of parameter fields of the type matrix data variable mat Here all the matrix data are distributed on each processors with pspike%tp=1 23

24 cessors in the arrays of integer mat%sa, mat%jsa, mat%isa with dimension mat%nbsa, mat%nbsa and n j + 1, repectively The RHS s are distributed by rows in a natural way Each MPI process j 1 will have an array of dimension n j -by-nrhs, for j = 1, 2,, nbprocs 24

25 Chapter 6 SPIKE Examples This section shows sample programs illustrating the SPIKE calling sequences In examples 1, 2, 3 and 4, SPIKE solves the following linear system of size n = 8: x 1 f x 2 f x 3 f x x 5 = f 4 f x 6 f x 7 f x 8 f 8 Note that examples 1, 2, 3, and 5 can use 1, 2, or 4 MPI processes Example 4 is designed for only 2 MPI processes 61 Example1: Automatic Partitioning In this example, partitioning of the coefficient matrix and the RHS is done by SPIKE The RHS is (1, 1, 1, 1, 1, 1, 1, 1) T This example calls the SPIKE subroutine INCLUDE s p i k e f i program example1 use use s p i k e m o d u l e mpi i m p l i c i t none integer : : rank, code, nb procs, i double precision, dimension ( :, : ), a l l o c a t a b l e : : f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! integer : : i n f o type ( spike param ) : : p s p i k e type ( m a t r i x d a t a ) : : mat! c a l l MPI INIT ( code ) c a l l MPI COMM SIZE(MPI COMM WORLD, nb procs, code ) c a l l MPI COMM RANK(MPI COMM WORLD, rank, code ) c a l l MPI Errhandler set (MPI COMM WORLD, MPI ERRORS RETURN, code ) ;!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! p s p i k e%nbprocs=n b p r o c s 25

26 p s p i k e%rank=rank c a l l SPIKE DEFAULT( p s p i k e )!! c h a n g e s from d e f a u l t p s p i k e%autoadapt = f a l s e!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER MATRIX and RHS!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! A l l p r o c e s s o r s mat%format = D mat%astru= G mat%diagdo= Y mat%n=8 mat%k l=2 mat%ku=2!! G l o b a l m a t r i x i s d e f i n e d o n l y on p r o c e s s o r 0 i f ( rank ==0) then!! o n l y on p r o c e s s o r 0 ( g l o b a l m a t r i x ) allocate ( mat%a( 1 : mat%kl+mat%ku+1,mat%n ) ) mat%a( mat%ku+1,:)=60 d0 mat%a( mat%ku 1,:)= 10 d0 mat%a( mat%ku,:)= 10 d0 mat%a( mat%ku+2,:)= 10 d0 mat%a( mat%ku+3,:)= 10 d0!! RHS a l l o c a t e ( f ( 1 : mat%n, 1 : 1 ) ) f =10 d0 end i f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! CALLING SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! c a l l SPIKE( pspike, mat, f, i n f o ) i f ( i n f o >=0) then!!!!!! G l o b a l S o l u t i o n i f ( rank ==0) then print, Global s o l u t i o n do i =1,mat%n print, i, f ( i, 1 ) end do end i f end i f c a l l MPI FINALIZE( code ) end program example1 We get the following output: >>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>> Spike S t r a t e g y RP3 TIME FOR PARTITIONING e 03 TIME FOR SPIKE BANDED FACT e 04 TIME FOR SPIKE BANDED SOLV e 04 TIME FOR SPIKE (FACT+SOLV) e 03 RESIDUAL e 16 # Outside i t e r a t i o n s : 0 SPIKE has s u c c e e d e d ( to r e a c h the a c c u r a c y p s p i k e%e p s o u t ) SPIKE WARNING 1 Global s o l u t i o n Example2: Automatic Partitioning and Multiple RHS In this example, two systems with same coefficient matrix are solved The RHS are (1, 0, 0, 0, 0, 0, 0, 0) T and (0, 1, 0, 0, 0, 0, 0, 0) T This example calls 26

27 the SPIKE subroutine INCLUDE s p i k e f i program use example2 s p i k e m o d u l e use mpi i m p l i c i t none integer : : rank, code, nb procs, i double precision, dimension ( :, : ), a l l o c a t a b l e : : f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! integer : : i n f o type ( spike param ) : : p s p i k e type ( m a t r i x d a t a ) : : mat! c a l l MPI INIT ( code ) c a l l MPI COMM SIZE(MPI COMM WORLD, nb procs, code ) c a l l MPI COMM RANK(MPI COMM WORLD, rank, code ) c a l l MPI Errhandler set (MPI COMM WORLD, MPI ERRORS RETURN, code ) ;!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! p s p i k e%nbprocs=n b p r o c s p s p i k e%rank=rank c a l l SPIKE DEFAULT( p s p i k e )!! c h a n g e s from d e f a u l t p s p i k e%autoadapt = f a l s e pspike%dfs= L!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER MATRIX and RHS!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! mat%format = D mat%astru= G mat%diagdo= Y mat%n=8 mat%k l=2 mat%ku=2 i f ( rank ==0) then allocate ( mat%a( 1 : mat%kl+mat%ku+1,mat%n ) ) mat%a( mat%k l +1,:)=60 d0 mat%a( mat%kl 1,:)= 10 d0 mat%a( mat%kl,:)= 10 d0 mat%a( mat%kl +2,:)= 10 d0 mat%a( mat%kl +3,:)= 10 d0!! RHS a l l o c a t e ( f ( 1 : mat%n, 1 : 2 ) ) f =00 d0 f ( 1, 1 ) = 1 0 d0 f ( 2, 2 ) = 1 0 d0 end i f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! CALLING SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! c a l l SPIKE( pspike, mat, f, i n f o ) i f ( i n f o >=0) then!!!!!! G l o b a l S o l u t i o n i f ( rank ==0) then print, Global s o l u t i o n do i =1,mat%n print, i, f ( i, 1 ), f ( i, 2 ) end do end i f!!!!!!!!!! end i f c a l l MPI FINALIZE( code ) end program example2 We get the following output: >>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>> Spike S t r a t e g y RL3 TIME FOR PARTITIONING e 03 TIME FOR SPIKE BANDED FACT e 04 27

28 TIME FOR SPIKE BANDED SOLV e 04 TIME FOR SPIKE (FACT+SOLV) e 03 RESIDUAL e 16 # Outside i t e r a t i o n s : 0 SPIKE has s u c c e e d e d ( to r e a c h the a c c u r a c y p s p i k e%e p s o u t ) SPIKE WARNING 1 Global s o l u t i o n E E E E E E E E E E E E E E Example3: Automatic Partitioning and Multiple RHS with Separate Factorization and Solution In this example, we again use two RHS but this time the SPIKE calling sequence is separated into factorization and solves where factorization is done once and there are two solves for each RHS (1, 0, 0, 0, 0, 0, 0, 0) T and (0, 1, 0, 0, 0, 0, 0, 0) T INCLUDE s p i k e f i program use example3 s p i k e m o d u l e use mpi i m p l i c i t none integer : : rank, code, nb procs, i double precision, dimension ( :, : ), a l l o c a t a b l e : : f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! integer : : i n f o type ( spike param ) : : p s p i k e type ( m a t r i x d a t a ) : : mat, pre! c a l l MPI INIT ( code ) c a l l MPI COMM SIZE(MPI COMM WORLD, nb procs, code ) c a l l MPI COMM RANK(MPI COMM WORLD, rank, code ) c a l l MPI Errhandler set (MPI COMM WORLD, MPI ERRORS RETURN, code ) ;!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! p s p i k e%nbprocs=n b p r o c s p s p i k e%rank=rank c a l l SPIKE DEFAULT( p s p i k e )!! c h a n g e s from d e f a u l t p s p i k e%autoadapt = f a l s e!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER MATRIX and RHS!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! mat%format = D! D f o r Dense Banded, S f o r Sparse banded, G f o r General Sparse mat%astru= G!!! G e n e r a l s t r u c t u r e ( non symmetric ) mat%diagdo= Y mat%n=8 mat%k l=2 mat%ku=2 i f ( rank ==0) then allocate ( mat%a( 1 : mat%kl+mat%ku+1,mat%n ) ) mat%a( mat%ku+1,:)=60 d0 mat%a( mat%ku 1,:)= 10 d0 mat%a( mat%ku,:)= 10 d0 mat%a( mat%ku+2,:)= 10 d0 mat%a( mat%ku+3,:)= 10 d0!! RHS a l l o c a t e ( f ( 1 : mat%n, 1 : 1 ) ) end i f 28

29 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! CALLING SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! c a l l SPIKE BEGIN( pspike, mat, pre, i n f o ) i f ( ( rank ==0)and ( i n f o <0)) then print, 1 Spike INFO e x i t / E r r o r Code :, i n f o, p s p i k e%e r r o r c o d e end i f c a l l SPIKE PREPROCESS( pspike, pre, i n f o ) i f ( ( rank ==0)and ( i n f o <0)) then print, 2 Spike INFO e x i t / E r r o r Code :, i n f o, p s p i k e%e r r o r c o d e end i f i f ( rank ==0) then f =00 d0 f ( 1, 1 ) = 1 0 d0 end i f c a l l SPIKE PROCESS( pspike, mat, pre, f, i n f o ) i f ( ( rank ==0)and ( i n f o <0)) then print, 3 Spike INFO e x i t / E r r o r Code :, i n f o, p s p i k e%e r r o r c o d e end i f i f ( i n f o >=0) then!!!!!! G l o b a l S o l u t i o n 1 i f ( rank ==0) then print, Global s o l u t i o n 1 do i =1,mat%n print, i, f ( i, 1 ) end do end i f!!!!!!!!!! end i f i f ( rank ==0) then f =00 d0 f ( 2, 1 ) = 1 0 d0 end i f c a l l SPIKE PROCESS( pspike, mat, pre, f, i n f o ) i f ( ( rank ==0)and ( i n f o <0)) then print, 4 Spike INFO e x i t / E r r o r Code :, i n f o, p s p i k e%e r r o r c o d e end i f i f ( i n f o >=0) then!!!!!! G l o b a l S o l u t i o n 2 i f ( rank ==0) then print, Global s o l u t i o n 2 do i =1,mat%n print, i, f ( i, 1 ) end do end i f!!!!!!!!!! end i f c a l l SPIKE END( pspike, mat, pre, i n f o ) c a l l MPI FINALIZE( code ) end program example3 We get the following output: Global s o l u t i o n E E E E E E E 003 Global s o l u t i o n E E E E E E E

30 >>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>> Spike S t r a t e g y RP3 TIME FOR PARTITIONING e 03 TIME FOR SPIKE BANDED FACT e 04 TIME FOR SPIKE BANDED SOLV e 04 TIME FOR SPIKE (FACT+SOLV) e 03 RESIDUAL e 16 # Outside i t e r a t i o n s : 0 SPIKE has s u c c e e d e d ( to r e a c h the a c c u r a c y p s p i k e%e p s o u t ) SPIKE WARNING 1 64 Example4: Manual Partitioning In this example, partitioning of the coefficient matrix and the RHS is done manually on 2 processors The RHS is (1, 1, 1, 1, 1, 1, 1, 1) T INCLUDE s p i k e f i program use example4 s p i k e m o d u l e use mpi i m p l i c i t none integer : : rank, code, nb procs, i double precision, dimension ( :, : ), a l l o c a t a b l e : : f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! integer : : i n f o type ( spike param ) : : p s p i k e type ( m a t r i x d a t a ) : : mat! c a l l MPI INIT ( code ) c a l l MPI COMM SIZE(MPI COMM WORLD, nb procs, code ) c a l l MPI COMM RANK(MPI COMM WORLD, rank, code ) c a l l MPI Errhandler set (MPI COMM WORLD, MPI ERRORS RETURN, code ) ;!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! p s p i k e%nbprocs=n b p r o c s p s p i k e%rank=rank c a l l SPIKE DEFAULT( p s p i k e )!! c h a n g e s from d e f a u l t p s p i k e%tp =1!! c u s t o m i z e d l o c a l p a r t i t i o n i n g o f t y p e 1 p s p i k e%autoadapt = f a l s e!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER MATRIX and RHS!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! mat%format = D!! d e n s e banded f o r m a t mat%astru= G mat%diagdo= Y! g l o b a l d a t a mat%n=8 mat%k l=2 mat%ku=2 a l l o c a t e ( mat%s i z e A ( 1 : 2 ) )!! o n l y 2 p a r t i t i o n s a r e c o n s i d e r e d mat%s i z e A (1)= 4 mat%s i z e A (2)= 4! l o c a l d a t a f o r p a r t i t i o n number rank+1 allocate ( mat%a( 1 : mat%kl+mat%ku+1,mat%sizea ( rank +1))) mat%a( mat%ku+1,:)=60 d0 mat%a( mat%ku 1,:)= 10 d0 mat%a( mat%ku,:)= 10 d0 mat%a( mat%ku+2,:)= 10 d0 mat%a( mat%ku+3,:)= 10 d0!! RHS ( l o c a l ) a l l o c a t e ( f ( 1 : mat%s i z e A ( rank + 1 ), 1 : 1 ) ) f =10 d0!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! CALLING SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! c a l l SPIKE( pspike, mat, f, i n f o ) 30

31 i f ( i n f o >=0) then!!!!!! L o c a l S o l u t i o n print, L o c a l s o l u t i o n f o r p a r t i t i o n, rank+1 do i =1,mat%s i z e A ( rank +1) print, i, f ( i, 1 ) end do e n d i f c a l l MPI FINALIZE( code ) end program example4 We get the following output: >>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>> Spike S t r a t e g y RP3 TIME FOR PARTITIONING e 03 TIME FOR SPIKE BANDED FACT e 04 TIME FOR SPIKE BANDED SOLV e 04 TIME FOR SPIKE (FACT+SOLV) e 03 RESIDUAL e 16 # Outside i t e r a t i o n s : 0 SPIKE has s u c c e e d e d ( to r e a c h the a c c u r a c y p s p i k e%e p s o u t ) SPIKE WARNING 1 L o c a l s o l u t i o n f o r p a r t i t i o n L o c a l s o l u t i o n f o r p a r t i t i o n Example5: Automatic Partitioning Using the CSR Input Format The following system in compressed sparse row (CSR) format is solved using the SPIKE subroutine x 1 x 2 x 3 x 4 x 5 x 6 x 7 x = INCLUDE s p i k e f i program use example5 s p i k e m o d u l e use mpi i m p l i c i t none integer : : rank, code, nb procs, i double precision, dimension ( :, : ), a l l o c a t a b l e : : f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! integer : : i n f o type ( spike param ) : : p s p i k e type ( m a t r i x d a t a ) : : mat 31

32 ! c a l l MPI INIT ( code ) c a l l MPI COMM SIZE(MPI COMM WORLD, nb procs, code ) c a l l MPI COMM RANK(MPI COMM WORLD, rank, code ) c a l l MPI Errhandler set (MPI COMM WORLD, MPI ERRORS RETURN, code ) ;!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! p s p i k e%nbprocs=n b p r o c s p s p i k e%rank=rank c all SPIKE DEFAULT( p s p i k e )!! c h a n g e s from d e f a u l t p s p i k e%autoadapt = f a l s e p s p i k e%rss= F pspike%dfs= L!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER MATRIX and RHS!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! mat%format = S!! CSR mat%astru= G mat%diagdo= Y mat%n=8 i f ( rank ==0) then mat%nbsa =20!! number o f non z e r o e l e m e n t s i n CSR f o r m a t a l l o c a t e ( mat%sa ( 1 : mat%nbsa ) )! a r r a y f o r v a l u e s a l l o c a t e ( mat%j s a ( 1 : mat%nbsa ) )! a r r a y f o r column i n d e x e s a l l o c a t e ( mat%i s a ( 1 : mat%n +1))! a r r a y f o r row CSR i n d e x e s mat%sa =(/6, 1,6, 1, 1,6, 1, 1,6, 1, 1,6, 1, 1,6, 1, 1,6, 1,6/) mat%j s a = ( / 1, 3, 2, 4, 1, 3, 5, 2, 4, 6, 3, 5, 7, 4, 6, 8, 5, 7, 6, 8 / ) mat%i s a = ( / 1, 3, 5, 8, 1 1, 1 4, 1 7, 1 9, 2 1 / )!! RHS a l l o c a t e ( f ( 1 : mat%n, 1 : 1 ) ) f =10 d0 end i f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! CALLING SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! c a l l SPIKE( pspike, mat, f, i n f o ) i f ( i n f o >=0) then!!!!!! G l o b a l S o l u t i o n i f ( rank ==0) then print, Global s o l u t i o n do i =1,mat%n print, i, f ( i, 1 ) end do end i f e n d i f c a l l MPI FINALIZE( code ) end program example5 We get the following output: >>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>> Spike S t r a t e g y FL3 TIME FOR PARTITIONING e 04 TIME FOR SPIKE BANDED FACT e 01 TIME FOR SPIKE BANDED SOLV e 03 TIME FOR SPIKE (FACT+SOLV) e 01 RESIDUAL e 16 # Outside i t e r a t i o n s : 0 SPIKE has s u c c e e d e d ( to r e a c h the a c c u r a c y p s p i k e%e p s o u t ) SPIKE WARNING 1 Global s o l u t i o n

33 66 Example 6: Automatic Partitioning Using the CSR Input Format with a Preconditioner Let us define the following general sparse system: x x x x x 5 = x x x 8 0 This linear system is solved iteratively with the following dense, banded preconditioner: M = INCLUDE s p i k e f i program use example6 s p i k e m o d u l e use mpi i m p l i c i t none integer : : rank, code, nb procs, i double precision, dimension ( :, : ), a l l o c a t a b l e : : f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! integer : : i n f o type ( spike param ) : : p s p i k e type ( m a t r i x d a t a ) : : mat, pre! c a l l MPI INIT ( code ) c a l l MPI COMM SIZE(MPI COMM WORLD, nb procs, code ) c a l l MPI COMM RANK(MPI COMM WORLD, rank, code ) c a l l MPI Errhandler set (MPI COMM WORLD, MPI ERRORS RETURN, code ) ;!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! p s p i k e%nbprocs=n b p r o c s p s p i k e%rank=rank c a l l SPIKE DEFAULT( p s p i k e )!! c h a n g e s from d e f a u l t p s p i k e%autoadapt = f a l s e pspike%dfs= L p s p i k e%bps= 1! a banded p r e c o n d i t i o n e r i s p r o v i d e d by t h e u s e r!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER MATRIX and RHS!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 33

34 i f mat%format = S!! CSR mat%astru= G mat%diagdo= Y mat%n=8 ( rank ==0) then mat%nbsa=16 a l l o c a t e ( mat%sa ( 1 : mat%nbsa ) ) a l l o c a t e ( mat%j s a ( 1 : mat%nbsa ) ) a l l o c a t e ( mat%i s a ( 1 : mat%n+1)) mat%sa =(/6, 1,6, 1,6, 1,6, 1, 1,6, 1,6, 1,6, 1,6/) mat%j s a = ( / 1, 8, 2, 7, 3, 6, 4, 5, 4, 5, 3, 6, 2, 7, 1, 8 / ) mat%i s a = ( / 1, 3, 5, 7, 9, 1 1, 1 3, 1 5, 1 7 / )!! RHS a l l o c a t e ( f ( 1 : mat%n, 1 : 1 ) ) f =00 d0 f ( 1, 1 ) = 1 0 d0 end i f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INPUT PARAMETER PRECONDITIONER!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! pre%format = D! Dense Banded f o r m a t pre%astru= G pre%diagdo= Y pre%n=8 pre%k l=1 pre%ku=1 i f ( rank ==0) then allocate ( pre%a( 1 : pre%kl+pre%ku+1, pre%n ) ) pre%a( pre%ku +1,:)=60 d0 pre%a( pre%ku,:)= 10 d0 pre%a( pre%ku+2,:)= 10 d0 end i f!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! CALLING SPIKE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! c a l l SPIKE BEGIN( pspike, mat, pre, i n f o ) c a l l SPIKE PREPROCESS( pspike, pre, i n f o ) c a l l SPIKE PROCESS( pspike, mat, pre, f, i n f o ) c a l l SPIKE END( pspike, mat, pre, i n f o ) i f ( i n f o >=0) then!!!!!! G l o b a l S o l u t i o n i f ( rank ==0) then print, Global s o l u t i o n do i =1,mat%n print, i, f ( i, 1 ) end do end i f e n d i f c a l l MPI FINALIZE( code ) end program example6 We get the following output: >>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>> Spike S t r a t e g y RL3 TIME FOR PARTITIONING e 03 TIME FOR SPIKE BANDED FACT e 03 TIME FOR SPIKE BANDED SOLV e 04 TIME FOR SPIKE (FACT+SOLV) e 03 RESIDUAL e 08 # Outside i t e r a t i o n s : 4 SPIKE has s u c c e e d e d ( to r e a c h the a c c u r a c y p s p i k e%e p s o u t ) Global s o l u t i o n E E E E E

35 E E Toeplitz Matrix Example This example solves a large Toeplitz matrix with RHS (1, 1,, 1, 1) T Source code is not shown for this example and can be found in <SPIKE dir>/examples/examples f90/source The input matrix elements and properties must be defined in the file <SPIKE dir>/examples/examples f90/data/matrix toeplitzin The following is a sample input file for a banded matrix (n = 48, 000), 3 on the main diagonal, 4 on the upper and lower off-diagonals, 01 on the other off-diagonals, and upper and lower bandwidths of 80 (total bandwidth is 161): !! n, s i z e o f t h e m a t r i x 8 0!! k l, Lower band 8 0!! ku, Upper band 3 0 d0!! d i a g o n a l e l e m e n t 40 d0!! f i r s t l o w e r o f f d i a g o n a l e l e m e n t 40 d0!! f i r s t upper o f f d i a g o n a l e l e m e n t 0 1 d0!! OTHERS o f f d i a g o n a l e l e m e n t 1!! s, number of RHS (THE v a l ue of the RHS are generated by the code ) N!! DIAGDO? Y ( Yes ), N ( No ), I ( I n v e s t i g a t e ) Some of the components for the derived type spike param variable can be changed from their default values while modying the input file <SPIKE dir>/examples/examples f90/data/spike toeplitzin Here a sample input file which selects the (R,L) strategy: R!! RSS? E ( E x p l i c i t ), F ( on t h e F l y ), T ( Truncated ), R ( R e c u r s i v e ) L!! DFS? L ( LU), U ( LU, and UL ), P ( LU w i t h p i v o t i n g 3!! OIS? 0 ( DIRECT ), 2 ( ITREFINEM), 3 ( BiCGStab ) 1D 7!! e ps o u t!! ACCURACY BiCGstab OUTSIDE 5 0!! n b i t o u t!! NBRE MAX o f ITERATIONS OUTSIDE 1D 5!! e p s i n!! ACCURACY BiCGstab INSIDE 3 0!! n b i t i n!! NBRE MAX of ITERATIONS INSIDE 1D 10!! New zero machine f o r d i ag o n al BOOSTing procedure 0!! t y p e o f p a r t i t i o n n i n g ( o : g l o b a l, 1 : l o c a l ) t r u e!! t i m i n g t r u e!! d e t a i l e d i n f o r m a t i o n o f t h e s i m u l a t i o n 6!! i n f o p r i n t e d on s c r e e n i f = 6, or on f i l e s p i k e o u t p u t i f /=6 f a l s e!! t o e n a b l e s p i k e a d a p t Finally one can run example toeplitz program with the command mpirun -np 2 toeplitz to get the following output: SPIKE INFO!! NB PROCESSORS? 4!! NB PARTITIONS? 4!! SPIKE ADAPT? F!! ALGORITHM? R!! FACTORIZATION? L!! TYPE OF SOLVER? 3!! ACCURACY OUT? e 07!! NB ITMAX OUT? 50!! ACCURACY IN? e 05!! NB ITMAX IN? 30!! NEW ZERO PIVOT? e 09!! BOOST? e 10 35

36 !! Orign P a r t i t i o n? 0!! S i z e f i r s t l a s t p a r t i t i o n? 12000!! S i z e p a r t i t i o n m i d d l e? 12000!! Free memory? T!! Compute R e s i d u a l? T!! ADD MEMORY NEEDED ( Mb ) e+02 MATRIX INFO!! MATRIX FORMAT? D!! MATRIX STRUCT? G!! Diag Dominant? N!! SIZE MATRIX? DENSE BANDED MATRIX!! Lower band? 80!! Upper band? 80 DETAILED TIME o f PREPROCESS NORM L1 o f Aj ( 1 s t p a r t i t i o n ) e+01 TIME FACT LU ( < to copy UL+FACT LU, i f any ) e 01 TIME FOR COMPUTING THE SPIKES e 01 > TIME FOR SPIKE PREPROCESSING e 01 RHS INFO!! Number o f RHS? 1 DETAILED TIME o f PROCESS TIME FOR MODIFIED RHS e 01 TIME FOR REDUCED SYSTEM e 02 TIME FOR RETRIEVE e 03 RESIDUAL BEFORE OUTSIDE ITERATION e 10 TIME postprocess MATMUL e+00 TIME postprocess SOLVE e+00 > TIME FOR SPIKE PROCESSING e 01 >>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>> Spike S t r a t e g y RL3 TIME FOR PARTITIONING e 01 TIME FOR SPIKE BANDED FACT e 01 TIME FOR SPIKE BANDED SOLV e 01 TIME FOR SPIKE (FACT+SOLV) e 01 RESIDUAL e 10 # Outside i t e r a t i o n s : 0 SPIKE has s u c c e e d e d ( to r e a c h the a c c u r a c y p s p i k e%e p s o u t ) 68 Sparse Banded Matrix Example This example reads and solves a sparse banded matrix in CSR format The source code and a sample input matrix is provided in <SPIKE dir>/examples/examples f90/source/sparsef90 The input matrix file is defined in <SPIKE dir>/examples/examples f90/data/matrix sparsein and it contains the following fields 36

37 c s r f i l e!! g e n e r i c name o f s p a r s e f o r m a t I!! DIAGDO? Y ( Yes ), N ( No ), I ( I n v e s t i g a t e ) f a l s e!! s p a r s e 2 d e n s e banded ( t r u e or f a l s e ) The sparse system matrix is stored using four files where the generic name of those file is defined by the first line of the input file above (ie here the name is csrfile) The names of these four files (located in the same directory above) are: csrfilesa for the matrix elements, csrfilejsa for the column indices, csrfileisa for the start-of-row indicies and csrfilesf for the right-hand-side elements The number of non-zero elements is indicated at the beginning of the first two files, while the beginning of the last two indicates the number of rows In addition, the first line of csrfilesf contains the number of right-hand-side as well (if this number is greater than one, the elements should be stored in multicolumns) Similarly to the toeplitz example, some of the components for the derived type spike param variable can be changed from their default values while modying the input file <SPIKE dir>/examples/examples f90/data/spike sparsein In SPIKE 10, only the (F,L) strategy is allowed for solving sparse banded systems However, the last field of the file matrix sparsein is an utility routine which gives the option to the user to transform the CSR input matrix to a dense banded matrix It will then set the option mat%format= D for SPIKE enabling the use of all the other strategies for dense banded systems Finally one can run example toeplitz program with the command mpirun -np 4 sparse to get the following output: Matrix l o a d e d n= nnz= SPIKE INFO!! NB PROCESSORS? 4!! NB PARTITIONS? 4!! SPIKE ADAPT? F!! ALGORITHM? F!! FACTORIZATION? L!! TYPE OF SOLVER? 3!! ACCURACY OUT? e 07!! NB ITMAX OUT? 50!! ACCURACY IN? e 05!! NB ITMAX IN? 30!! NEW ZERO PIVOT? e 09!! BOOST? e 10!! Orign P a r t i t i o n? 0!! S i z e f i r s t l a s t p a r t i t i o n? 240!! S i z e p a r t i t i o n m i d d l e? 240!! Free memory? T!! Compute R e s i d u a l? T!! ADD MEMORY NEEDED ( Mb ) e 01 MATRIX INFO!! MATRIX FORMAT? S!! MATRIX STRUCT? G!! Diag Dominant? N!! Degree of Diag Dominant? e 01!! Degree o f S p a r s i t y ( w i t h i n t h e band )? e 01!! SIZE MATRIX? 960 SPARSE BANDED MATRIX!! Lower band? 43!! Upper band? 43!! # o f non z e r o e l?

38 DETAILED TIME o f PREPROCESS Pardiso Reorder e 01 Pardiso Factor e 02 TIME FACT LU ( < to copy UL+FACT LU, i f any ) e 01 TIME FOR COMPUTING THE SPIKES e 05 > TIME FOR SPIKE PREPROCESSING e 01 RHS INFO!! Number o f RHS? 1 DETAILED TIME o f PROCESS RESIDUAL BEFORE BICGSTAB IN ITERATION e+00\ t1 0 E e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e 05 TIME p o s t p r o c e s s MATMUL e 02 TIME p o s t p r o c e s s SOLVE e 05 TIME FOR MODIFIED RHS e 04 TIME FOR REDUCED SYSTEM e 02 TIME FOR RETRIEVE e 04 RESIDUAL BEFORE OUTSIDE ITERATION e 06 RESIDUAL BEFORE BICGSTAB IN ITERATION e+00\ t1 0 E e e e e e e e e e e e e e e e e e e e e e e 04 38

39 e e 06 TIME p o s t p r o c e s s MATMUL e 02 TIME p o s t p r o c e s s SOLVE e 06 TIME FOR MODIFIED RHS e 04 TIME FOR REDUCED SYSTEM e 02 TIME FOR RETRIEVE e e e 11 TIME p o s t p r o c e s s MATMUL e 05 TIME p o s t p r o c e s s SOLVE e 02 > TIME FOR SPIKE PROCESSING e 01 >>>>>>>>>>>>>>>>>>> SPIKE SUMMARY >>>>>>>>>>>>>>>>>>>>> Spike S t r a t e g y FL3 TIME FOR PARTITIONING e 03 TIME FOR SPIKE BANDED FACT e 01 TIME FOR SPIKE BANDED SOLV e 01 TIME FOR SPIKE (FACT+SOLV) e 01 RESIDUAL e 11 # Outside i t e r a t i o n s : 1 SPIKE has s u c c e e d e d ( to r e a c h the a c c u r a c y p s p i k e%e p s o u t ) 69 Calling SPIKE from C Programs SPIKE can also be called from C programs The data structures in the C interface are available in the header file: <SPIKE dir>/include/spikeh It is very important to know the difference between the Fortran and C input formats Inside the spikeh header file, the integer variables comd, autoadapt, failed, timing, memfree, residual, singular, blocked, boost, and custom pre of the spike param c interface data structure are actually logical variables in Fortran Therefore, to initialize these variables in C, set them to the -1 for trueand 0 for false To call SPIKE, add the following lines to the C code: #include <mpi h> #include spike h / B e f o r e the MPI INIT c a l l i n g s e q u e n c e / i n t rank, nb procs, code, i n f o [ 4 ] ; s p i k e p a r a m c i n t e r f a c e p s p i k e ; / Data s t r u c t u r e a s s o c i a t e d with a g i v e n SPIKE environment / m a t r i x d a t a c i n t e r f a c e mat, pre ; / Data s t r u c t u r e a s s o c i a t e d with the o r i g i n a l matrix mat and pre ( i f s e p a r a t e c a l l i n g i s used ) / / In s ide main function / / / code = MPI Init (&argc, & argv ) ; code = MPI Comm size (MPI COMM WORLD, & nb procs ) ; code = MPI Comm rank (MPI COMM WORLD, & rank ) ; / / / A f t e r the MPI INIT c a l l i n g s e q u e n c e s / pspike nbprocs=nb procs ; pspike rank=rank ; s p i k e d e f a u l t (& p s p i k e ) ; / Default values f o r pspike / / CALL FOR SPIKE with DEFINITION o f INPUT PARAMETERS / 39

40 / End o f main f u n c t i o n / The C version of the Toeplitz example program as well as examples 1-5 are available in the directory: <SPIKE dir>/examples/examples c If necessary, modify both the makefile and makefiletarget to use the desired compiler and MPI implementation The examples use the Intel compilers and MPI library by default Moreover, the makefile should (i) link the libspikea library, (ii) link the BLAS and LAPACK libraries, and (iii) specify the path to the spikeh header file 40

41 Chapter 7 Reference guide 71 SPIKE 10 directory structure High-level Directory Structure The table below shows a high-level structure for SPIKE 10 after installation All directories are under the SPIKE main directory, for example /opt/intel/spike/10 Directory Comment bin/64 bin/em64t doc examples/examples c examples/examples f90 include lib/64 lib/em64t spike adapt/64 spike adapt/em64t Itanium2 r binary executable Intel64 r binary executable Documentation C source code and data for examples Fortran 90 source code and data for examples C headers, Fortran 90 module interfaces, and MPI wrappers Itanium2 r static libraries Intel64 r static libraries Spike Adapt data files, Itanium2 r Spike Adapt data files, Intel64 r Detailed Directory Structure The information in the table below shows detailed structure of the directories of SPIKE Again, all directories are under the SPIKE main directory, for example, /opt/intel/spike/10 41

42 Directory and Files Contents bin/64 Binaries directory, Itanium2 r ivarsnml Fortran NAMELIST file storing the input characteristics used by spike adaptexe spike adaptexe Standalone executable to query Spike Adapt bin/em64t Binaries directory, Intel64 r ivarsnml Fortran NAMELIST file storing the input characteristics used by spike adaptexe spike adaptexe Standalone executable to query Spike Adapt doc Documentation directory Installtxt Installation Guide spikeeulatxt SPIKE license spike ugpdf The SPIKE User Guide(PDF format) spike ugps The SPIKE User Guide(Postscript format) examples/examples c C example source code and data source Source code subdirectory data Data files subdirectory makefile[target] Makefiles to build examples examples/examples f90 Fortran 90 example source code and data source Source code subdirectory data Data files subdirectory makefile[target] Makefiles to build examples include Headers, Interfaces, wrappers spikefi Fortran 90 module interface spikeh, spike c wrapperh C headers spike mpi commf90 source of MPI wrapper lib/64 Itanium2 r static libraries libguidea Intel r Legacy OpenMP run-time library for static linking libguideso Intel r Legacy OpenMP run-time library for dynamic linking libmkl corea Kernel library for IA-64 architecture libmkl coreso Library dispatcher for dynamic load of processor-specific kernel library libmkl intel lp64a LP64 interface library for Intel compiler libmkl intel lp64so LP64 interface library for Intel compiler libmkl intel threada Parallel drivers library supporting Intel compiler libmkl intel threadso Parallel drivers library supporting Intel compiler 42

43 libmkl lapacka Dummy library Contains references to Intel MKL libraries libmklso Dummy library Contains references to Intel MKL libraries libmkl solvera Dummy library Contains references to Intel MKL libraries libmkl solver lp64a Sparse Solver, Interval Solver, and GMP routines library supporting LP64 interface libspikea Spike Core routines libspike adapta Spike Adapt routines libspike adapt deso Spike Adapt routines, performance model specific libspike adapt grid fa libspike mpi comma Spike Adapt routines, grid specific Default MPI wrapper copied from libspike mpi comm intelmpia User can build their own See Appendix C for detail libspike mpi comm intelmpia MPI wrapper supporting Intel MPI Library for Linux libspike mpi comm mpich1a MPI wrapper supporting MPICH 1 libspike mpi comm mpich2a MPI wrapper supporting MPICH 2 libspike mpi comm openmpia MPI wrapper supporting Open MPI lib/em64t Intel64 r static libraries libguidea Intel r Legacy OpenMP run-time library for static linking libguideso Intel r Legacy OpenMP run-time library for dynamic linking libmkl corea Kernel library for libmkl coreso libmkl intel lp64a libmkl intel lp64so libmkl intel threada libmkl intel threadso Intel64 r architecture Library dispatcher for dynamic load of processor-specific kernel library LP64 interface library for Intel compiler LP64 interface library for Intel compiler Parallel drivers library supporting Intel compiler Parallel drivers library supporting Intel compiler libmkl lapacka Dummy library Contains references to Intel MKL libraries libmklso Dummy library Contains references to Intel MKL libraries libmkl solvera Dummy library Contains references to Intel MKL libraries 43

44 libmkl solver lp64a Sparse Solver, Interval Solver, and GMP routines library supporting LP64 interface libspikea Spike Core routines libspike adapta Spike Adapt routines libspike adapt deso Spike Adapt routines, performance model specific libspike adapt grid fa libspike mpi comma Spike Adapt routines, grid specific Default MPI wrapper copied from libspike mpi comm intelmpia User can build their own See Appendix C for detail libspike mpi comm intelmpia MPI wrapper supporting Intel MPI Library for Linux libspike mpi comm mpich1a MPI wrapper supporting MPICH 1 libspike mpi comm mpich2a MPI wrapper supporting MPICH 2 libspike mpi comm openmpia MPI wrapper supporting Open MPI spike adapt/64 Itanium2 r Spike Adapt data files de Subdirectory, calibration data files spike adapt/em64t Intel64 r Spike Adapt data files de Subdirectory, calibration data files tools/environment Initialization shell scripts spikevars64csh Itanium2 r platforms; C shell spikevars64sh Itanium2 r platforms; Bourne shell spikevarsem64tcsh Intel64 r platforms; C shell spikevarsem64tsh Intel64 r platforms; Bourne shell Table 71: Detailed SPIKE directory structure 72 SPIKE and ScaLAPACK This section is addressed to ScaLAPACK users who would like to experiment with SPIKE, making only minor changes to their code for solving dense banded linear systems (data in double precision) We describe a practical way to insert SPIKE calling sequences in place of ScaLAPACK ones The ScaLAPACK calling sequences that are concerned with this migration procedure are: For non-diagonally dominant systems PDGBSV: Single calling sequence Factorization+Solve PDGBTRF, PDGBTRS: Separated calling sequences Factorization and Solve For diagonally dominant systems PDDBSV: Single calling sequence Factorization+Solve PDDBTRF, PDGBTRS: Separated calling sequences Factorization and Solve 44

45 As described in the documentation, SPIKE can also handle single or separated calling sequences In contrast to ScaLAPACK, the diagonally dominant property does not involve new calling sequences but can be defined in the data structure matrix data within the parameter mat%diagdo Let us consider the following ScaLAPACK code: Call PDGBSV(N, BWL, BWU, NRHS, A, JA, DESCA, IPIV, B, IB, DESCB, WORK, LWORK, INFO ) where we suppose the users to be familiar with all the above parameters (as described in the ScaLAPACK user guide [3]) This calling sequence can be replaced by the following one: Call Spike(pspike, mat, B, info spike) where the parameters pspike, mat, info spike need to be declared at the beginning of the program as described in this documentation, while the parameter B which contains the RHS and solution is identical to the ScaLAPACK one Before the call to SPIKE, the other parameters need to be declared as follows: p s p i k e%rank=rank! w i t h rank t h e u s e r v a r i a b l e name f o r p r o c e s s o r rank p s p i k e%nbprocs=n b p r o c s! with nb procs the user v a r i a b l e name! f o r # o f p r o c e s s o r s c a l l S p i k e D e f a u l t ( p s p i k e ) p s p i k e%tp =1! d a t a l o c a l d i s t r i b u t i o n o f t y p e 1 i s c o m p a t i b l e w i t h ScaLAPACK! i f t h e u s e r wants t o t u r n o f f s p i k e a d a p t by p s p i k e%a u t o a d a p t = f a l s e! t h e u s e r can s e l e c t h e r e h i s own S p i k e C o r e s t r a t e g y ( RSS, DFS, OIS ) mat%format = D! d o u b l e p r e c i s i o n d a t a mat%a s t r u = G! g e n e r a l non symmetric mat%n=n! N as i n ScaLAPACK mat%k l=bwl! BWL as i n ScaLAPACK mat%ku=bwu! BWU as i n ScaLAPACK mat%diagdo = N! N i f ScalAPACK command s t a r t s w i t h PDGB! Y i f ScalAPACK command s t a r t s w i t h PDDB mat%aj=aa! AA i s t h e m a t r i x A i n ScaLAPACK w i t h o u t e x t r a s p a c e f o r p i v o t i n g! i f mat%d i a g d o = Y AA i s i d e n t i c a l t o A and one can s i m p l y! u s e mat%aj=>a ( w i t h a t t r i b u t i o n t a r g e t f o r A)! i f mat%d i a g d o = N t h e u s e r may f i r s t want t o s u p p r e s s t h e e x t r a! s t o r a g e s p a c e i n t h e a l l o c a t i o n o f A and t h e n! u s e mat%aj=a a l l o c a t e ( mat%s i z e A ( 1 : n b p r o c s ) ) mat%s i z e A ( 1 : nb procs 1)=DESCA( 4 )! ScaLAPACK v a r i a b l e! s i z e o f t h e l o c a l p a r t i t i o n mat%s i z e A ( n b p r o c s )=n ( nb procs 1) mat%s i z e A ( 1 )! s i z e o f t h e l a s t p a r t i t i o n In the case of separated calling sequences, the setup of the above parameters is identical Also the BLACS command introduced in ScaLAPACK are unnecessary for SPIKE and can be removed (SPIKE is independent of the library BLACS) 73 Spike Default Set the default values on all the applicable components within the type spike param variable 45

46 Syntax CALL Spike Default(pspike) Description The routine assigns defaults values to those input and inout components of the type spike param variable pspike that have default Other components remain unchanged Input Parameters pspike SPIKE data structure of type spike param described in Section 22 Output Parameters pspike SPIKE data structure described in Section 22 On exit, the components of pspike tabulated in Table 21 will be assinged their default values specified there 74 Spike Spike solver driver solves complete system via one call Syntax CALL Spike(pspike,mat,f,info) Description The routine solves the system specified by a matrix contained in mat with the right hand side(s) contained in f Input Parameters pspike mat f SPIKE type spike param data structure described in Section 22 matrix data structure of type matrix data described in Section 23 and Chapter 5 double precision array containing the right hand side(s) Depending on the value of pspike%tp, f may be global on rank 0 or locally distributed on each processor 46

47 Output Parameters pspike SPIKE data structure described in Section 22 f the computed solution of the system info return the error code If info=0 the execution is successful If info 0, SPIKE encountered a problem and has stopped unexpectedly, the detail description of the meaning of error code is presented in Section Spike Begin Begin the calling sequence Syntax CALL Spike Begin(pspike,mat,pre,info) Description The routine partitions the matrix and allocates a work table for SPIKE Morever, Spike Adapt may be invoked in this routine Input Parameters pspike mat pre SPIKE data structure of type spike param described in Section 22 On entry, if pspike%autoadapt is true, Spike Adapt will be invoked to select a SPIKE strategy matrix data structure of type matrix data described in Section 23 and Chapter 5 preconditioner data structure of type matrix data The use of banded preconditioner is described in chapter 4 47

48 Output Parameters pspike SPIKE data structure described in Section 22 On exit, if Spike Adapt was invoked, pspike%dfs, pspike%rss and pspike%ois will be updated mat matrix data structure described in Section 23 If the matrix is defined with global data as input, on exit, mat will contain the local partitioning of the matrix on each processors (the memory of the global matrix in rank 0 is deallocated if pspike%memfree is set to true) pre Contents set by Spike Begin It contains the local partitioning of the preconditioner (it may just be a copy of the matrix) that will be used in Spike Preprocess info 76 Spike Preprocess Preprocess the preconditioner data Syntax return the error code If info=0, the execution is successful If info 0, SPIKE encountered a problem and has stopped unexpectedly, the detail description of the meaning of error code is presented in Section 711 CALL Spike Preprocess(pspike,pre,info) Description The routine factorizes the preconditioner pre using the SPIKE strategy specified in pspike Note that pre could be an explicit preconditioner supplied by the user or is just in fact a copy (made automatically by SPIKE ) of the original system Input Parameters pspike SPIKE data structure described in Section 22 pre the output from Spike Begin after the Spike Begin call 48

49 Output Parameters pspike SPIKE data structure described in Section 22 pre Contents modified, it contains the factorization of the preconditioner ready to be used in Spike Process multiple number of times info return the error code If info=0 the execution is successful If info 0, SPIKE encountered a problem and has stopped unexpectedly, the detail description of the meaning of error code is presented in Section Spike Process Process the matrix, preconditioner and the right-hand side Syntax CALL Spike Process(pspike,mat,pre,f,info) Description The routine solves the reduced system then retrieves the overall solution In this verision of SPIKE, the solver includes outer-iterations The preconditioner is defined by pre, and the original matrix is defined by mat The routine Spike Process can be repeated if needed for applications that involves iterations with changing right-hand-sides f but with the same original matrix of coefficients Input Parameters pspike SPIKE data structure described in Section 22 mat matrix data structure On entry, the matrix data should have been processed by a previous Spike Begin call, so that data have been distributed to all processors pre f Output Parameters set up by Spike Preprocess in a previous call On entry, f stores the right-hand side Depending on the value of pspike%tp, f may be global on rank 0 or locally distributed on each processor pspike SPIKE data structure described in Section 22 f On exit, f stores the solution of the system Depending on the value of pspike%tp, f may be global on rank 0 or locally distributed on each processor info return the error code If info=0 the execution is successful If info 0, SPIKE encountered a problem and has stopped unexpectedly, the detail description of the meaning of error code is presented in Section

50 78 Spike End End of the calling sequence Syntax CALL Spike End(pspike,mat,pre,info) Description The routine clears the memory space, deallocating all local partitioning for mat and pre Input Parameters pspike SPIKE data structure described in Section 22 mat matrix data structure described in Section 23 pre preconditioner data structure Output Parameters pspike SPIKE data structure described in Section 22 mat matrix data structure described in Section 23 On exit, several components of mat are deallocated pre On exit, pre is deallocated info return the error code If info=0 the execution is successful If info 0, SPIKE encountered a problem and has stopped unexpectedly, the detail description of the meaning of error code is presented in Section spike param details The type spike param has a number of input components that has possible default values listed in Table 21 Furthermore, this type has a number of output components This is listed in the follow Table matrix data details The derived type matrix data is used for storage of matrices In SPIKE 10, this is exclusively used for the matrix representing the linear system In the future, the user can explicitly store, using this type, a separate matrix used as a preconditioner to the linear system The components and meaning of this type is given previously in Chapter info details Errors and warnings encountered during a run of SPIKE are stored in an integer variable, info All MPI, LAPACK and PARDISO errors are fatal; 50

51 Component Type(Intent) Distribution Description boost logical (out) local Return trueif a zero-pivot is detected pivot > 0 ɛ nb boost integer (out) global # of boost performed nbit out0 integer (out) global # of outer iteration nbit in0 integer (out) global # of inner iteration memory double (out) global Total amount of memory (in Mb) needed by Spike Core maxres double (out) global If component residual is set to true return the maximum relative residual for all rhs failed logical (out) global Return trueif Spike Core fails to reach the accuracy specified in the eps out component error code integer (out) global If info 0 in the SPIKE calling sequences returns the error code as presented in Section 711 Below are the output components fields for timing information if the timing component is set to true tspike adapt double (out) global Time spent in Spike Adapt tspike preparation double (out) global Preparation time (with Spike Adapt) tspike prep double (out) global Preprocessing time tspike process double (out) global Processing time tspike residual double (out) global Time spent to compute the residual Table 72: List of output components for the derived type spike param The variable of this type can be local on each partition or global (ie common to all partitions) 51

52 in other words, execution of the program is terminated if an error is encountered Other possible sources of warnings and errors are Spike Core and Spike Adapt errors If the output info parameter is not zero, either an error (info< 0) or warning (info> 0) was encountered The possible return values for the info parameter are given in Table 73 info Classification Description 3 Warning Spike Adapt could not make a prediction 2 Warning A zero-pivot has been detected, OIS has been set to 3 due to boosting 1 Warning this matrix (or precondioner if any) is not narrow banded, this will affect the spike performances 0 Successful exit -1 Error Spike Core error -2 Error Spike Adapt error -3 Error MPI error -4 Error LAPACK error -5 Error PARDISO error Table 73: SPIKE return code descriptions for the parameter info If info< 0 the user can determine whether Spike Core, Spike Adapt, MPI, LAPACK, or PARDISO is responsable for the unexpected termination The correponding error code is stored in the component pspike%error code Please refer to Table 74 for possible return codes on pspike%error code if a fatal error is encountered in Spike Core (info= 1), and similarly refer to Table 75 if a fatal error is encountered in Spike Adapt (info= 2) When info equals 3, 4, 5, the error code is also stored in pspike%error code, and the user should consult the MPI, LAPACK, or PARDISO documentation, respectively 52

53 info= 1 Description 0 Successful exit - Default value -200 memory allocation error -201 rho = 0, BiCGStab(out) failed -202 omega =0, BiCGStab(out) failed -303 cannot select Spike Adapt if you want to use your own preconditioner %BPS= the format of the preconditioner is incorrect, it should be pre%format= D or S -305 the preconditioner should be banded -306 the preconditioner should be the same size as the matrix -307 if preconditioner (option %BPS= 1), one needs to use iterative methods %OIS 0) -308 the preconditioner cannot be used with DFS= P -309 either upper or lower bandwidth is too small for the size of the partitions -310 number of processors has to be even for RSS= A or P -313 the size of the matrix mat%n must be > mat%kl and mat%ku must be the format of the matrix is incorrect, it should be mat%format= D or S -320 Spike Adapt cannot be selected if only one processor -399 wrong value for %tp -400 combinations (DFS, RSS) not supported by SPIKE DFS= L or P are only possible options if one processor is used -402 DFS= A cannot be used here see Table RSS= R cannot be used here see Table only tp=0 can handle one processor run Table 74: SPIKE return code descriptions for %error code 53

54 info= 2 Classification Description 1 Information Spike Core strategy selected by grid lookup 2 Information Spike Core strategy selected by performance models 3 Warning Spike Core strategy selected arbitrarily -310 Error pspike%tp=2 requires an even number if MPI processes -312 Error pspike%tp=2 requires RSS = A -313 Error pspike%tp=1 cannot be used when RSS = A -402 Error Memory allocation failed during model evaluation -403 Error SPIKE ADAPT DATA environment variable not set -404 Error Error reading directory specified by SPIKE ADAPT DATA environment variable -405 Error Performance models not found in directory specified by SPIKE ADAPT DATA environment variable -406 Error Could not open performance models -407 Error Could not read performance models Table 75: This table contains descriptions of the Spike Adapt return codes for %error code 54

55 Bibliography [1] E Anderson, Z Bai, C Bischof, J Demmel, J Dongarra, J DuCroz, A Greenbaum, S Hammarling, A McKenney, and D Sorensen LA- PACK: A portable linear algebra library for high-performance computers Technical report, Knoxville, 1990 [2] Michael W Berry and Ahmed Sameh Multiprocessor schemes for solving block tridiagonal linear systems The International Journal of Supercomputer Applications, 1(3):37 57, 1988 [3] L S Blackford, J Choi, A Cleary, E D Azevedo, J Demmel, I Dhillon, J Dongarra, S Hammarling, G Henry, A Petitet, K Stanley, D Walker, and R C Whaley ScaLAPACK: a linear algebra library for message-passing computers In Proceedings of the Eighth SIAM Conference on Parallel Processing for Scientific Computing (Minneapolis, MN, 1997), page 15 (electronic), Philadelphia, PA, USA, 1997 Society for Industrial and Applied Mathematics [4] S C Chen, D J Kuck, and A H Sameh Practical parallel band triangular system solvers ACM Transactions on Mathematical Software, 4(3): , 1978 [5] Jack J Dongarra and Ahmed H Sameh On some parallel banded system solvers Parallel Computing, 1(3): , 1984 [6] D H Lawrie and A H Sameh The computation and communication complexity of a parallel banded system solver ACM Trans Math Softw, 10(2): , 1984 [7] E Polizzi and A Sameh Numerical parallel algorithms for large-scale nanoelectronics simulations using nessie Journal of Computational Electronics, (3), 3-4: , 2005 [8] Eric Polizzi and Ahmed H Sameh A parallel hybrid banded system solver: the spike algorithm Parallel Comput, 32(2): , 2006 [9] Eric Polizzi and Ahmed H Sameh Spike: A parallel environment for solving banded linear systems Computers & Fluids, 36(1): , 2007 [10] A H Sameh and D J Kuck On stable parallel linear system solvers J ACM, 25(1):81 91,

56 [11] O Schenk and K Gärtner Solving unsymmetric sparse systems of linear equations with pardiso Journal of Future Generation Computer Systems, 20(3): ,

57 Appendix A Mathematical Description of Key Strategies In the following sections, we outline the algorithms corresponding to the six (RSS,DFS) combinations supported in SPIKE 10 Since OIS is always 3 in the current release, and since BiCGStab is a well-documented method, we will not explain it here The following descriptions assume four MPI processes (RSS, DFS, 3): Refine the solution of Ax = f using the BiCGStab iterative solver solve Ax = f via preconditioned BiCGStab (with preconditioner M); solve Mz = r using (RSS,DFS); end The exact spike factorization consists of A = D S Each computational scheme, however, only produces an approximation D of D and S of S In other words, for solving Ax = f via an iterative scheme we use M = D S as a preconditioner Here, A = M + R where R is a correction term The preconditioner M is defined as shown in Table A1 for each (RSS, DSS) pair Table A1: Preconditioners for different schemes (RSS,DFS) Preconditioner T A T U F L RL RP EA M T A = D T A S T A M T U = D T U S T U M F L = D F L S F L M RL = D RL S RL M RP = D S M EA = D EA S EA Note that D T A = D EA and D F L = D RL The reduced system in F L is solved iteratively without forming the coefficient matrix explicitly Also, in EA, the reduced system is solved iteratively and formed explicitly The details of how diagonal and spike systems are treated are given in following sections Throughout, we present the solution process of Az = r in which z 57

58 is the action M 1 r A1 Az = r via TU The matrix, RHS, and solution are distributed among the MPI processes as shown in Figure A1 A 1 B 1 z 1 r 1 (1) A = C 2 A 2 B 2 C 3 A 3 B 3 z = z 2 z 3 r = r 2 r 3 (2) (3) C 4 A 4 z 4 r 4 (4) Figure A1: Illustration of the partitioning of the linear system The T U scheme consists of the following steps: 1 Compute the LU and U L factorizations without pivoting (apply diagonal boosting if needed) L j U j A j for j = 1, 2, 3 U j Lj A j for j = 2, 3, 4 2 Compute the tips of the spikes V, W in Figure A2 as follows Solve for V (b) L j U j V (b) j j : Solve for W (t) U j Lj j : W (t) j = = 0 0 B j C j 0 0 for j = 1, 2, 3 for j = 2, 3, 4 This process is described in detail in Figure A3 58

59 S = I I W 2 * * * * V 1 I I W 3 * * * * V 2 I I W 4 * * * * V 3 I I (1) (2) (3) (4) Figure A2: SPIKE matrix L U V b j = 0 B j Figure A3: The bottom of the V j spike can be computed using only the bottom m m blocks of L and U Similarly, the top of the W j spike may be obtained if one performs the UL-factorization 3 Modify the RHS by solving: L j U j g j = r j (j = 1, 2) and U j Lj g j = r j (j = 3, 4) 4 Solve the truncated, reduced system (block diagonal) via a direct scheme where each block has the following form: ( ) ( ) ( ) I m V (b) j z (b) W (t) j g (b) j+1 I m z (t) j = j+1 g (t) (j = 1, 2, 3) j+1 5 Solve 0 A j z j = r j 0 B j z(t) j+1 C j 0 0 z(b) j 1 using the LU or U L factorization of A j (j = 1, 2, 3, 4; C 1 = 0; and B 4 = 0) 59

60 A2 Az = r via FL The matrix, RHS, and solution are distributed among the MPI processes as shown in Figure A1 The F L scheme consists of the following steps: 1 Compute the LU factorization without pivoting (apply diagonal boosting, if needed) L j U j A j for j = 1, 2, 3, 4 2 Modify the RHS by solving: L j U j g j = r j (j = 1, 2, 3, 4) 3 Solve the reduced system iteratively I m W (t) W (b) V (b) 1 2 I m V (t) 2 2 I m V (b) 2 W (t) 3 I m V (t) 3 W (b) 3 I m V (b) 3 W (t) 4 I m z (b) 1 z (t) 2 z (b) 2 z (t) 3 z (b) 3 z (t) 4 = g (b) 1 g (t) 2 g (b) 2 g (t) 3 g (b) 3 g (t) 4 where actions of the multiplications with W (t) j, W (b) j, V (t) j and V (b) j are realized via ( I m 0 ) ( ) A 1 Im j C 0 j, ( ( ) ) 0 I m A 1 Im j C 0 j, ( I m ( ) ( ) 0 Im A 1 Im j B j, respectively This requires solving systems in- 0 volving A j using the previously computed LU factorizations 4 Solve 0 A j z j = r j 0 B j z(t) j+1 C j 0 0 z(b) j 1 0 ) A 1 j using the LU factorization of A j (j = 1, 2, 3, 4; C 1 = 0; and B 4 = 0) A3 Az = r via RL/RP The matrix, RHS, and solution are distributed among the MPI processes as shown in Figure A1 The RP scheme consists of the following steps: 1 Compute the LU factorization with (RP ) or without pivoting (RL) (in case no pivoting is used, apply diagonal boosting, if needed) L j U j P j A j for j = 1, 2, 3, 4 (P j = I for RL) 2 Solve for V j : ( ) Im B 0 j, 60

61 0 L j U j V j = for j = 1, 2, 3 0 B j 3 Solve for W j : C j 0 L j U j W j = for j = 2, 3, Modify the RHS by solving: L j U j g j = r j (j = 1, 2, 3, 4) 5 Form the reduced system and partition it as follows I m I m V (t) 1 V (b) 1 W (t) 2 I m V (t) 2 W (b) 2 I m V (b) 2 W (t) 3 I m V (t) 3 Ã 1 C 2 B1 Ã2 W (b) 3 I m V (b) 3 W (t) 4 I m z 1 z 2 = g 1 g 2 W (b) 4 I m z (t) 1 z (b) 1 z (t) 2 z (b) 2 z (t) 3 z (b) 3 z (t) 4 z (b) 4 = g (t) 1 g (b) 1 g (t) 2 g (b) 2 g (t) 3 g (b) 3 g (t) 4 g (b) 4 6 Solve for Ṽ1 and W 2 in 0 C 2 Ã 1 Ṽ 1 = 0, Ã 0 2 W2 = B Modify the RHS Ã 1 1 g 1 = h 1 and Ã 1 2 g 2 = h 2 8 Solve the reduced system via a direct scheme ( ) ( ) I m Ṽ (b) 1 z (b) ( h(b) ) W (t) 1 2 I m z (t) = 1 2 h (t) 2 9 Retrieve z 1 and z 2 z 1 = h (t) 1 Ṽ1 z 2 z 2 = h 2 W 2 z (b) 1 61

62 10 Retrieve z j (j = 1, 2, 3, 4) z j = r j V j z (t) j+1 W jz (b) j 1 (V 4 = 0 and W 1 = 0) A4 Az = r via TA The matrix, RHS, and solution are distributed among the MPI processes as shown in Figure A4 A 1 B 1 z 1 r 1 (1) A = C 2 A 2 z = B 2 C 3 A 3 z 2 z 3 r = r 2 r 3 (2, 4) (3) Figure A4: Illustration of the partitioning of the linear system The T A scheme consists of the following steps: 1 Compute the LU and U L factorizations without pivoting (apply diagonal boosting, if needed) L j U j A j for j = 1, 2 (processes 1, 2) U j Lj A j for j = 2, 3 (processes 4, 3) 2 Solve for V (b) L j U j V (b) j j : 3 Solve for W (t) U j Lj W (t) j j : = = 0 0 B j C j 0 0 for j = 1, 2 for j = 2, 3 This process is described in detail in Figure A3 4 Modify the RHS by solving: L j U j g j = r j (j = 1, 2) and U j Lj g j = r j (j = 3) 62

63 5 Solve the truncated reduced system (block diagonal) via a direct scheme where each block has the following form: ( ) ( ) ( ) I m V (b) j z (b) W (t) j g (b) j+1 I m z (t) j = j+1 g (t) (j = 1, 2) j+1 6 Solve 0 A j z j = r j 0 B j z(t) j+1 C j 0 0 z(b) j 1 using the LU or U L factorization of A j (j = 1, 2, 3; C 1 = 0; and B 3 = 0) A5 Az = r via EA The matrix, RHS, and solution are distributed among the MPI processes as shown in Figure A4 The EA scheme consists of the following steps: 1 Compute the LU and U L factorizations without pivoting (apply diagonal boosting if needed) L j U j A j for j = 1, 2 (processes 1, 2) U j Lj A j for j = 2, 3 (processes 4, 3) 2 Solve for V j : 0 L j U j V j = for j = 1, 2 0 B j 3 Solve for W j : C j 0 U j Lj W j = for j = 2, Modify the RHS by solving: L j U j g j = r j (j = 1, 2) and U j Lj g j = r j (j = 3) 5 Solve the reduced system via preconditioned BiCGStab I m W (t) W (b) V (b) 1 2 I m V (t) 2 2 I m V (b) 2 W (t) 3 I m z (b) 1 z (t) 2 z (b) 2 z (t) 3 63 = g (b) 1 g (t) 2 g (b) 2 g (t) 3

64 with a truncated preconditioner M r = I m W (t) V (b) 1 6 Solve 0 A j z j = r j 0 B j 2 I m V (b) 2 W (t) 3 I m z(t) j+1 C j 0 0 z(b) j 1 using the LU or U L factorization of A j (j = 1, 2, 3; C 1 = 0; and B 3 = 0) 64

65 Appendix B How Spike Adapt Works B1 Why is Spike Adapt Necessary? Spike Core is a poly-algorithm implementing many different strategies The RSS, DFS, and OIS parameters can take many different values, leading to numerous possibilities Selecting an optimal strategy requires detailed knowledge of Spike Core For example, what strategies are best when the matrix is not diagonally dominant? How does the matrix bandwidth affect the choice of strategy? Spike Adapt relieves users from questions like these It is designed to select an optimal strategy based on the following matrix characteristics: matrix size, bandwidth, sparsity, and diagonal dominance It also takes the number of MPI processes and the type of partitioning into account when making a decision (Table 24) B2 How Does Spike Adapt Work? Spike Adapt automatically sets the RSS, DFS, and OIS parameters when the autoadapt element of the spike param structure is set to true It currently supports six Spike Core strategies (RSS,DFS): TU, RL, RP, FL, TA, and EA Note that OIS is basically orthogonal to (RSS,DFS) Moreover, for SPIKE 10, OIS is always set to 3 (BiCGStab) and FL is always chosen when the input matrix is in CSR format Spike Adapt uses a three-step selection process It first checks the type of matrix partitioning and the number of MPI processes to determine which strategies are allowed (Table 24) Next, it performs a grid lookup based on the matrix size, bandwidth, and diagonal dominance (Figure B1) The optimal Spike Core strategy for some matrices is best determined by a grid lookup However, if the grid does not enclose the current matrix, Spike Adapt evaluates performance models for the relevant Spike Core strategies and decides which is best If neither the grid lookup nor the performance models can make a selection, a Spike Core strategy will be chosen arbitrarily However, this should be rare and usually indicates a problem in Spike Adapt 65

Figure B1: This schematic illustrates how Spike Adapt might select an optimal Spike Core strategy using grid lookup The horizontal and vertical axes represent two of the relevant matrix

66 Figure B1: This schematic illustrates how Spike Adapt might select an optimal Spike Core strategy using grid lookup The horizontal and vertical axes represent two of the relevant matrix characteristics (eg, matrix size and bandwidth) If the grid encloses this matrix, an optimal Spike Core strategy, represented by the different colors, is selected based on proximity B3 Spike Adapt Return Codes In the event of an error, Spike Adapt sets info=-1 and returns to Spike Core The actual error code is stored in the ierr spike adapt parameter of spike param structure Spike Adapt error codes range from -499 to -400 The meaning of each error code is shown below Spike Adapt sets info=0 if it is able to select a Spike Core strategy In general, knowing how Spike Adapt selects a particular Spike Core strategy is unimportant However, this knowledge could be useful if the user suspects that Spike Adapt is choosing a suboptimal strategy The spike adapt status parameter of the spike param structure tells how the Spike Core strategy was selected: spike adapt status Description 1 Grid lookup used to select Spike Core strategy 2 Performance models used to select Spike Core strategy 3 The Spike Core strategy was selected arbitrarily -402 Spike Adapt could not allocate memory -403 SPIKE ADAPT DATA environment variable not set -404 Directory containing Spike Adapt performance models not found -405 Spike Adapt model files not found -406 Could not open Spike Adapt models files -407 Error reading Spike Adapt model files Table B1: Spike Adapt Return Codes As mentioned above, arbitrary selection usually indicates a Spike Adapt problem that should be reported to technical support 66

Matrix Eigensystem Tutorial For Parallel Computation

Matrix Eigensystem Tutorial For Parallel Computation High Performance Computing Center (HPC) http://www.hpc.unm.edu 5/21/2003 1 Topic Outline Slide Main purpose of this tutorial 5 The assumptions made