Sparse Direct Solvers

Size: px

Start display at page:

Download "Sparse Direct Solvers"

Brian Byrd
5 years ago
Views:

1 Sparse Direct Solvers Alfredo Buttari 1 (with slides from Prof P. Amestoy, Dr. J.-Y. L Excellent and Dr. B. Uçar) 1 alfredo.buttari@enseeiht.fr

2 Sparse Matrix Factorizations A 2< m m, symmetric positive definite! LL T = A Ax = b A 2< m m,symmetric! LDL T = A Ax = b A 2< m m,unsymmetric! LU = A Ax = b A 2< m n, m 6= n! QR = A min x kax bk if m>n min kxk such that Ax = b if n>m

3 Factorization of sparse matrices: problems The factorization of a sparse matrix is problematic due to the presence of fill-in. The basic LU step: a (k+1) i,j = a (k) i,j a (k) i,k a(k) k,j a (k) k,k Even if a (k) i,j is null, a(k+1) i,j can be a nonzero a fa d o o o o h oo 0

4 Factorization of sparse matrices: problems Which kind of problems does fill-in pose? more expensive is the factorization higher amount of memory required more complicated algorithms to achieve the factorization The amount of fill-in must be predicted using elimination trees reduced (possibly) using ordering algorithms These steps, moreover, must complete much faster than the actual factorization The basic tools to achieve all this are GRAPHS

5 Graph theory definitions and notions

6 Graph notations and definitions A graph G =(V,E) consists of a finite set V,calledthevertexset and a finite, binary relation E on V, called the edge set. Three standard graph models Undirected graph: The edges are unordered pair of vertices, i.e., {u, v} 2E for some u, v 2 V. Directed graph: The edges are ordered pair of vertices, that is, (u, v) and (v, u) are two di erent edges. Bipartite graph: G =(U [ V,E) consists of two disjoint vertex sets U and V such that for each edge (u, v) 2 E, u 2 U and v 2 V. An ordering or labelling of G =(V,E) having n vertices, i.e., V = n, is a mapping of V onto 1, 2,...,n.

7 Matrices and graphs: Rectangular matrices The rows/columns and nonzeros of a given sparse matrix correspond (with natural labelling) to the vertices and edges, respectively, of a graph. Bipartite graph Rectangular matrices A = A 3 r 1 r 2 r 3 c 1 c 2 c 3 c 4 The nodes corresponding to rows of the matrix are grouped into a vertex set R while the columns into the other vertex set C such that for each a ij 6=0, (r i,c j ) is an edge.

8 Matrices and graphs: Square unsymmetric pattern The set of rows/cols corresponds the vertex set V such that for each a ij 6=0, (v i,v j ) is an edge. Transposed view possible too, i.e., the edge (v i,v j ) directed from column i to row j. Usually self-loops are omitted. The rows/columns and nonzeros of a given sparse matrix correspond (with natural labelling) to the vertices and edges, respectively, of a graph. Graph models Square unsymmetric pattern matrices A = A 3 Bipartite graph as before. Directed graph v 1 v v 2 3

9 Matrices and graphs: Symmetric pattern The rows/columns and nonzeros of a given sparse matrix correspond (with natural labelling) to the vertices and edges, respectively, of a graph. Square symmetric pattern matrices A = A 3 Graph models Bipartite and directed graphs as before. Undirected graph v 1 v v 2 3 The set of rows/cols corresponds the vertex set V such that for each a ij,a ji 6=0, {v i,v j } is an edge. No self-loops.

10 Definitions: Edges, degrees, and paths Many definitions for directed and undirected graphs are the same. We will use (u, v) to refer to an edge of an undirected or directed graph to avoid repeated definitions. An edge (u, v) is said to incident on the vertices u and v. For any vertex u, the set of vertices in adj(u) ={v :(u, v) 2 E} are called the neighbors of u. Theverticesinadj(u) are said to be adjacent to u. The degree of a vertex is the number of edges incident on it. A path p of length k is a sequence of vertices hv 0,v 1,...,v k i where (v i 1,v i ) 2 E for i =1,...,k. The two end points v 0 and v k are said to be connected by the path p, andthevertexv k is said to be reachable from v 0.

11 Definitions: Components An undirected graph is said to be connected if every pair of vertices is connected by a path. The connected components of an undirected graph are the equivalence classes of vertices under the is reachable from relation. A directed graph is said to be strongly connected if every pair of vertices are reachable from each other. The strongly connected components of a directed graph are the equivalence classes of vertices under the are mutually reachable relation.

12 Definitions: Trees and spanning trees A tree is a connected, acyclic, undirected graph. If an undirected graph is acyclic but disconnected, then it is a forest. Properties of trees Any two vertices are connected by a unique path. E = V 1 A rooted tree is a tree with a distinguished vertex r, calledtheroot. There is a unique path from the root r to every other vertex v. Any vertex y in that path is called an ancestor of v. Ify is an ancestor of v, thenv is a descendant of y. The subtree rooted at v is the tree induced by the descendants of v, rooted at v. A spanning tree of a connected graph G =(V,E) is a tree T =(V,F), suchthatf E.

13 Ordering of the vertices of a rooted tree A topological ordering of a rooted tree is an ordering that numbers children vertices before their parent. A preorder of a rooted tree is an ordering that numbers children vertices after their parent. A postorder is a topological ordering which numbers the vertices in any subtree consecutively. v z y u x w Connected graph G 6 z 5 y v 2 4 x 1 3 u w Rooted spanning tree with topological ordering 1 u 4 y 3 x 6 z 2 w 5 v Rooted spanning tree with postordering

14 Permutation matrices A permutation matrix is a square (0, 1)-matrix where each row and column has a single 1. If P is a permutation matrix, PP T = I, i.e., it is an orthogonal matrix. Let, A = 2@ A 3 and suppose we want to permute columns as [2, 1, 3]. Define p 2,1 =1, p 1,2 =1, p 3,3 =1,andB = AP (if column j to be at position i, setp ji =1) B = @ A = @ A 2@ 1 A 3 3 1

15 Definitions: Reducibility Reducible matrix: Ann n square matrix is reducible if there exists an n n permutation matrix P such that P AP T A11 A = 12, O A 22 where A 11 is an r r submatrix, A 22 is an (n r) (n r) submatrix, where 1 apple r<n. Irreducible matrix: There is no such permutation matrix. Theorem: An n n square matrix is irreducible i its directed graph is strongly connected. Proof: Follows by definition. Why is reducibility important?

16 Definitions: Cliques and independent sets Clique In an undirected graph G =(V,E), a set of vertices S V is a clique if for all s, t 2 S, wehave(s, t) 2 E. In a symmetric matrix A, a clique corresponds to a subset of rows R and the corresponding columns such that the matrix A(R, R) is full.

17 Depth First Search of a graph It s a searching algorithm that advances as deep as possible by exploring children of a node before siblings:

18 Depth First Search of a graph It s a searching algorithm that advances as deep as possible by exploring children of a node before siblings:

19 Depth First Search of a graph It s a searching algorithm that advances as deep as possible by exploring children of a node before siblings:

20 Depth First Search of a graph It s a searching algorithm that advances as deep as possible by exploring children of a node before siblings:

21 Depth First Search of a graph It s a searching algorithm that advances as deep as possible by exploring children of a node before siblings:

22 Depth First Search of a graph It s a searching algorithm that advances as deep as possible by exploring children of a node before siblings:

23 Depth First Search of a graph It s a searching algorithm that advances as deep as possible by exploring children of a node before siblings:

24 Depth First Search of a graph It s a searching algorithm that advances as deep as possible by exploring children of a node before siblings:

25 Depth First Search of a graph It s a searching algorithm that advances as deep as possible by exploring children of a node before siblings:

26 Depth First Search of a graph It s a searching algorithm that advances as deep as possible by exploring children of a node before siblings:

27 Depth First Search of a graph It s a searching algorithm that advances as deep as possible by exploring children of a node before siblings:

28 Depth First Search of a graph It s a searching algorithm that advances as deep as possible by exploring children of a node before siblings:

29 Depth First Search of a graph It s a searching algorithm that advances as deep as possible by exploring children of a node before siblings: It produces a spanning forest of a graph and the order in which the nodes are visited corresponds to its preorder.

30 Analysis

31 Matrices and graphs Predicting structure helps in reducing the memory requirements, in achieving high performance, in simplifying the algorithms. We will consider the Cholesky factorization A = LL T.Inthiscase, structural and numerical aspects are neatly separated. For general case (e.g., LU factorization) pivoting is necessary and depends on the actual numerical values. There are combinatorial tools for these, but we will not cover. Structure prediction algorithms should run, preferably, faster than the numerical computations that will follow.

32 Symmetric matrices and graphs Assumptions: A symmetric and pivots are chosen on the diagonal Structure of A symmetric represented by the graph G =(V,E) Vertices are associated to columns: V = {1,...,n} Edges E are defined by: (i, j) 2 E $ a ij 6=0 G undirected (symmetry of A)

33 Symmetric matrices and graphs Remarks: Number of nonzeros in column j = adj G (j) Symmetric permutation renumbering the graph Symmetric matrix Corresponding graph

34 The elimination model for symmetric matrices A symmetric, positive definite matrix can be factorized by means of the Cholesky algorithm for k =1,...,n q do (k 1) l kk = a kk for i = k +1,...,n do (k 1) l ik = a ik /l kk for j = k +1,...,i do a (k) ij = a (k) ij l ik l jk end for end for end for 0 a 11 a 21 a 22 a 31 a 32 a 33 a 41 a 42 a 43 a 44 1 C A

35 The elimination graph model for symmetric matrices Let A be a symmetric positive definite matrix of order n The LL T factorization can be described by the equation: d1 v A = A 0 = 1 T v 1 A 1 p! p! d v d1 p1 T = pd1 v d1 1 I n 1 0 A 1 0 I n 1 = L 1 A 1 L T 1,where A 1 = A 1 v 1 v T 1 d 1 The basic step is applied on A 1 A 2 to obtain : A =(L 1 L 2 L n 1 ) I n L T n 1...LT 2 LT 1 = LL T

36 The basic step: A 1 = A 1 v 1 v T 1 d 1 What is v 1 v T 1 in terms of structure? v1 is a column of A, hencethe neighbors of the corresponding vertex. v 1 v1 T results in a dense subblock in A 1, i.e., the elimination of a node results in the creation of a clique that connects all the neighbors of the eliminated node. If any of the nonzeros in dense submatrix are not in A, thenwe have fill-ins.

37 The elimination process in the graphs G U (V,E) undirected graph of A for k =1:n 1 do V V {k} {remove vertex k} E E {(k, `) :` 2 adj(k)}[{(x, y) :x 2 adj(k) and y 2 adj(k)} G k (V,E) {for definition} end for G k are the so-called elimination graphs (Parter, 61) G0 : H0 =

38 A sequence of elimination graphs G 0 : G 1 : G 2 : A 0 = A 1 = A 2 = l 11 l 21 l 22 l 32 l 33 l 42 l 43 l 44 l 53 l 54 l 55 l 61 l 62 l 63 l 64 l 65 l 66 1 C A G 3 : A 3 = 4 5 6

39 Elimination process: Formal definitions Deficiency of a vertex: D(v) is the set of edges defined by D(v) = {(x, y) :x 2 adj(v) and y 2 adj(v) and y/2 adj(x) and x 6= y} v-elimination graph: Apply the elimination process to the vertex v of G to obtain G v = V {v},e(v {v}) [ D(v)). v G v

40 Elimination process: Formal definitions For a graph G =(V,E), theelimination process P (G) =[G = G 0,G 1,G 2,...,G n 1 ] is the sequence of elimination graphs defined by G 0 = G, G i =(G i 1 ) i Let G i =(V i,e i ) for i =0, 1,...,n 1. Thefill-in F (G) is defined by F (G) =[ n 1 i=1 i where i = D(i) in G i 1,andtheelimination graph is defined by G + =(V,E [ F (G)) For a matrix A, i corresponds to the new nonzeros elements, the fill-ins, created during i the step of elimination.

41 Elimination process: Formal definitions Continuing from the previous sample matrix, we have the filled-graph G + (A) G (A) = G(F) 3 F = L + L T

42 Elimination process Fill-path theorem [Rose, Tarjan, Lueker 76] Let G =(V,E) be an ordered graph. Then (v, w) is an edge of G + =(V,E [ F (G)) i there exists a path µ =[v = v 1,v 2,...,v k+1 = w] in G such that v i < min{v, w}, 2 apple i apple k

43 Elimination process Fill-path theorem [Rose, Tarjan, Lueker 76] Let G =(V,E) be an ordered graph. Then (v, w) is an edge of G + =(V,E [ F (G)) i there exists a path µ =[v = v 1,v 2,...,v k+1 = w] in G such that v i < min{v, w}, 2 apple i apple k

44 Elimination process Fill-path theorem [Rose, Tarjan, Lueker 76] Let G =(V,E) be an ordered graph. Then (v, w) is an edge of G + =(V,E [ F (G)) i there exists a path µ =[v = v 1,v 2,...,v k+1 = w] in G such that v i < min{v, w}, 2 apple i apple k

45 Elimination process Fill-path theorem [Rose, Tarjan, Lueker 76] Let G =(V,E) be an ordered graph. Then (v, w) is an edge of G + =(V,E [ F (G)) i there exists a path µ =[v = v 1,v 2,...,v k+1 = w] in G such that v i < min{v, w}, 2 apple i apple k

46 Set-up Reminder A spanning tree of a connected graph G =(V,E) is a tree T =(V,F), suchthatf E. A topological ordering of a rooted tree is an ordering that numbers children vertices before their parent. A postorder is a topological ordering which numbers the vertices in any subtree consecutively. Let A be an n n symmetric positive-definite and irreducible matrix, A = LL T its Cholesky factorization, and G + (A) its filled graph (graph of F = L + L T ).

47 Afirstdefinition Since A is irreducible, each of the first n least one o -diagonal nonzero (prove?). 1 columns of L has at For each column j<nof L, remove all the nonzeros in the column j except the first one below the diagonal. Let L t denote the remaining structure and consider the matrix F t = L t + L T t. The graph G(F t ) is a tree called the elimination tree. a fa d o d I o o h o oo 0 oj

48 Afirstdefinition The elimination tree of A is a spanning tree of G + (A) satisfying the relation P ARENT [j] =min{i >j: `ij 6=0}. fa o o o h d I o oo 0 oj j 10 a c g 1 a 3 c 7 g 9 i h d i e G(A) f j b 8 9 h i 4 d e 5 10 j b 2 + G (A) = G(F) f 6 e 5 b 2 3 c 1 a 8 h 7 g f T(A) 4 d 6

49 Aseconddefinition:Representscolumndependencies Dependency between columns of L: Column i>jdepends on column j i `ij 6=0 Use a directed graph to express this dependency (edge from j to i, if column i depends on column j) Simplify redundant dependencies (transitive reduction) The transitive reduction of the directed filled graph gives the elimination tree structure. Remove a directed edge (j, i) if there is a path of length greater than one from j to i. j i j i

50 Directed filled graph and its transitive reduction h 8 d 4 e 5 b 2 f 6 c 3 a 1 i 9 g 7 j 10 d 4 i 9 e 5 b 2 a 1 c 3 g 7 h 8 f 6 j 10 d 4 i 9 e 5 b 2 a 1 c 3 g 7 h 8 f 6 j 10 T(A) Directed filled graph Transitive reduction

51 Athirddefinition:DFStree Theorem The elimination tree T (A) of a connected graph G(A) is a depth-first search tree of the filled graph G + (A). Proof. Let x 1,x 2,...,x n be the node ordering of G + (A). Consider the depth-first search subject to the following tie-breaking rule: when there is a choice of more than one node to explore next, always pick the one with largest subscript. With this additional rule the depth-first search will construct T (A).

52 Adepth-firstsearchtreeofthefilledgraph Any DFS on an undirected graph produces only Tree and Back edges. 6/7 5/8 a c g 4/11 j i 3/12 h 13/14 d f j 9/10 1/20 e b h g d c f i e b 2/19 15/18 16/17 a

53 Path characterization of filled edges Because there is no cross edge in the DFS tree of an undirected graph, two nodes that belong to two distinct subtrees cannot be connected Nonzeros of L If `ij 6=0,thennodex i is an ancestor of x j in T (A). Some zeros of L Let T [x i ] and T [x j ] be two disjoint subtrees of T (A). Then`st =0 for any x s 2 T [x i ] and x t 2 T [x j ].

54 Fill-in entries Fill-path theorem [Rose, Tarjan, Lueker 76] Let G =(V,E) be an ordered graph. Then (v, w) is an edge of G + =(V,E [ F (G)) i there exists a path µ =[v = v 1,v 2,...,v k+1 = w] in G such that v i < min{v, w}, 2 apple i apple k Restating using the elimination tree Let i>j.then`ij 6=0i there exists a path x i,x p1,...,x pk,x j in the graph of A such that {x p1,...,x pk } T [x j ].HereT [x j ] is the set of nodes in the subtree rooted at x j.

55 Uses of the elimination tree The elimination tree has several uses in the factorization of a sparse matrix: it expresses the order in which variables can be eliminated: because the elimination of a variable only a ects the elimination of its ancestors, any topological order of the elimination tree will lead to a correct result and to the same fill-in it expresses concurrence: because variables in separate subtrees do not a ect each other, they can be eliminated in parallel it can be used to characterize the structure of the factors: elimination graphs can be obviously used to determine the structure and size of the factors but this results in a complexity proportional to the number of nonzeroes in the factors. Instead, the elimination tree can be used to compute the row and column count of the factors with a cost proportional to O(nnz(A)) for big enough matrices

56 The analysis phase The determination of the structure of the factors is commonly referred to as symbolic factorization because it only does symbolic computations that do not involve numerical operations. In modern software packages, the symbolic factorization is commonly done is a preprocessing phase called the analysis phase. The analysis phase is essential for the actual factorization of a matrix and may include many other (symbolic) operations as we will see later. Once the analysis phase is complete, the actual matrix factorization can take place.

57 Matrix factorization

58 Cholesky on a dense matrix left-looking Cholesky for k =1,...,n do for i = k,..., n do for j =1,...,k 1 do a (k) ik = a(k) ik l ij l kj end for end for q (k 1) l kk = a kk for i = k +1,...,n do (k 1) end for end for l ik = a ik /l kk right-looking Cholesky for k =1,...,n q do (k 1) l kk = a kk for i = k +1,...,n do (k 1) l ik = a ik /l kk for j = k +1,...,i do a (k) ij = a (k) ij l ik l jk end for end for end for Left looking Right looking used for modification modified

59 Cholesky on a sparse matrix The Cholesky factorization of a sparse matrix can be achieved with a left-looking, right-looking or multifrontal method. Reference case: regular 3 3 grid ordered by nested dissection. Nodes in the separators are ordered last (see the section on orderings) Notation: cdiv(j): divide column j by a scalar cmod(j,k): update column j with column k, k<j struct(l(1:k,j)): the structure of L(1:k,j) submatrix

60 Sparse left-looking Cholesky left-looking for j=1 to n do for k in struct(l(j,i:j-1)) do cmod(j,k) end for cdiv(j) end for In the left-looking method, before variable j is eliminated, column j is updated with all the columns that have a nonzero on line j. Inthe example above, struct(l(7,1:6))={1, 3, 4, 6}. this corresponds to receiving updates from nodes lower in the subtree rooted at j the filled graph is necessary to determine the structure of each line

61 Sparse right-looking Cholesky right-looking for k=1 to n do cdiv(k) for j in struct(l(k+1:n,k)) do cmod(j,k) end for end for In the right-looking method, after variable k is eliminated, column k is used to update all the columns corresponding to nonzeros in column k. In the example above, struct(l(4:9,3))={7, 8, 9}. this corresponds to sending updates to nodes higher in the elimination tree the filled graph is necessary to determine the structure of each column

62 The Multifrontal method Take as an example a simple 3 3 sparse matrix where no fill-in is generated: a 11 a 13 a 22 a 23 a 31 a 32 a 33 Its factorization can be achieved in three simple steps. The right-looking approach results in: 8 < Step 1 : l 11 = p a 11 l 31 = a 31 /l 11 Step 2 a 0 33 = a 33 l 31 l 31 These computations are ine 8 < : cient: heavy use of indirect addressing no vectorization nor cache reuse 1 A l 22 = p a 22 p l 32 = a 32 /l 22 Step 3 l 33 = a 00 a = 33 a0 33 l 32 l 32

63 BLAS operations High e ciency can be achieved if the computations of a sparse matrix can be rearranged as operations on dense matrices/blocks. This allows the use of e cient BLAS routines: Level-1 BLAS: vector-vector operations like inner product or vector sum. O(n) operations are performed on O(n) data. Vectorizable but limited by bus speed Level-2 BLAS: matrix-vector operations like matrix-vector product. O(n 2 ) operations are performed on O(n 2 ) data. Vectorizable but limited by bus speed Level-3 BLAS: matrix-matrix operations like matrix-matrix product or rank-k update. O(n 3 ) operations are performed on O(n 2 ) data. Vectorizable and very e cient thanks to good exploitation of memory hierarchy

64 The Multifrontal Method REMEMBER: each time a pivot is eliminated, a clique is formed in the graph. A clique is a set of nodes fully connected, i.e., a graph associated to a dense submatrix U L A k The nonzero values concerned by an elimination step can be stored in a dense matrix and, thus, operations can be carried on by means of BLAS operation

65 The Multifrontal Method Thanks to associativity of the addition operation, the three steps before can be rewritten as: Step 1 a 11 a 13 a 22 a 23 a 31 a 32 a 33 1 A! a11 a 13 a 31 0! l11 l 31 b, b = l 31 l 31 Step a 11 a a 22 a 23 A a22 a! 23 a a 31 a 32 a ! l22 l 32 c, c = l 32 l 32 Step 3 l 33 = p a 33 + b + c

66 The Multifrontal Method In the general case b and c are dense submatrices (Schur complements) called contribution blocks and will be assembled in some sophisticated way into other dense matrices f 11 f 12 f 1n l 11 f 21 f 22 f 2n l C.. A! 21 cb 22 cb 2n C.. A f n1 f n2 f nn l n1 cb n2 cb nn where l 11 = p f 11, cb 22 cb 2n C B.. A cb n2 cb nn 0 l 21. l n1 f n2 f nn 1 0 f 21 C B A f n1 f 22 f 2n C A 0 1 C A /l 11 l 21. l n1 1 0 l 21 C B l n1 1T C A

67 The Multifrontal Method The elimination tree can be regarded as a graph of dependencies which defines where/how to assemble the elimination blocks and which variable to eliminate at each step

68 The Multifrontal Method The elimination tree can be regarded as a graph of dependencies which defines where/how to assemble the elimination blocks and which variable to eliminate at each step a 44 a 46 a 47 a a A! l 44 l 64 b 66 b 67 l 74 b 76 b 77 1 A

69 The Multifrontal Method The elimination tree can be regarded as a graph of dependencies which defines where/how to assemble the elimination blocks and which variable to eliminate at each step a 55 a 56 a 59 a a A! l 55 l 65 c 66 c 69 l 95 c 96 c 99 1 A

70 The Multifrontal Method The elimination tree can be regarded as a graph of dependencies which defines where/how to assemble the elimination blocks and which variable to eliminate at each step a 66 0 a a C A + 0 0! b 66 b b 76 b C A + l 66 l 76 d 77 d 78 d 79 l 86 d 87 d 88 d 89 l 86 d 97 d 98 d C A c c c c 99 1 C A

71 The Multifrontal Method A dense matrix, called frontal matrix, is associated at each node of the elimination tree. The Multifrontal method consists in a bottom-up traversal of the tree where at each node two operations are done: Assembly Nonzeros from the original matrix are assembled together with the contribution blocks from children nodes into the frontal matrix arrowhead + cb cb-n Elimination A partial factorization of the frontal matrix is done. The variable associated to the node of the frontal tree (called fully assembled) can be eliminated. This step produces part of the final factors and a Schur complement (contribution block) that will be assembled into the father node frontal matrix L cb

72 The Multifrontal Method: example

73 Solve

74 Solve Once the matrix is factorized, the problem can be solved against one or more right-hand sides: AX = LL T X = B, A, L 2 R n n, X 2 R n k, B 2 R n k The solution of this problem can be achieved in two steps: forward substitution LZ = B backward substitution L T X = Z

75 Solve: left-looking

76 Solve: right-looking

77 Direct solvers: resume The solution of a sparse linear system can be achieved in three phases that include at least the following operations: Analysis Factorization Solve Symbolic factorization Elimination tree computation 2 The actual matrix factorization Forward substitution Backward substitution These phases, especially the analysis, can include many other operations. Some of them are presented next. 2 note that the elimination tree is strictly needed only for the multifrontal method but in every case it can always be used to compute the symbolic factorization in a cheap way

78 Orderings

79 Asequenceofeliminationgraphs The nodes don t have to be necessarily eliminated in the natural order: G 0 : G 4 : G 2 : A 0 = A 4 = A 2 = l 44 l 24 l 22 l 32 l 33 l 12 l 13 l 11 l 53 l 51 l 55 l 61 l 65 l 66 1 C A G 3 : 1 A 3 =

80 Elimination process: ordered graphs For an undirected graph G =(V,E) with V = n, anordering of V is a bijection : {1,...,n}$V. For an ordered graph G =(V,E, ), theelimination process P (G )=[G = G 0,G 1,G 2,...,G n 1 ] is the sequence of elimination graphs defined by G 0 = G, G i =(G i 1 ) (i) Let G i =(V i,e i ) for i =0, 1,...,n 1. Thefill-in F (G ) is defined by where i = D( (i)) in G i F (G )=[ n 1 i=1 i 1,andtheelimination graph is defined by G + =(V,E [ F (G )) For a matrix A, i corresponds to the new nonzeros elements, the fill-ins, created during i the step of elimination.

81 Elimination process Fill-path theorem for ordered graphs Let G =(V,E, ) be an ordered graph. Then (v, w) is an edge of G + =(V,E [ F (G )) i there exists a path µ =[v = v 1,v 2,...,v k+1 = w] in G such that 1 (v i ) < min{ 1 (v), 1 (w)}, 2 apple i apple k = {4, 2, 3, 1, 5, 6} G (A) = G(F) 3 F = L + L T

82 Elimination process: Formal definitions Given a graph G =(V,E), anordering of V is a perfect elimination ordering of G if F (G )=;. The ordering is a perfect elimination ordering if w 2 adj(v), x 2 adj(v), and 1 (v) < min{ 1 (w), 1 (x)} in G,implyeither (w, x) 2 E or w = x. In other words, when v is to be eliminated (both w and x are not eliminated yet), there is an edge (w, x). A graph which has a perfect elimination ordering is a perfect elimination graph. Any elimination graph G + is a perfect elimination graph, since is a perfect ordering.

83 Fill-reducing ordering methods Three main classes of methods for minimizing fill-in during factorization Local approaches : At each step of the factorization, selection of the pivot that is likely to minimize fill-in. Method is characterized by the way pivots are selected. Markowitz criterion (for a general matrix) Minimum degree (for symmetric matrices) Global approaches : The matrix is permuted so as to confine the fill-in within certain parts of the permuted matrix Cuthill-McKee, Reverse Cuthill-McKee Nested dissection Hybrid approaches : First permute the matrix globally to confine the fill-in, then reorder small parts using local heuristics.

84 Local heuristics to reduce fill-in during factorization Let G(V,E) be the graph associated to a matrix A that we want to order using local heuristics. Let Metric such that Metric(v i ) <Metric(v j ) implies v i is better than v j 1: G U (V,E) undirected graph of A 2: for i =1:n 1 do 3: let k be a vertex that minimizes a metric 4: V V {k} {remove vertex k} 5: E E {(k, `) :` 2 adj(k)}[{(x, y) :x 2 adj(k) and y 2 adj(k)} 6: update Metric(v j ) for all non-selected nodes v j 7: end for Step 6 should only be applied to nodes for which the Metric value might have changed.

85 Reordering unsymmetric matrices: Markowitz criterion At step k of Gaussian elimination: r (k) i = number of non-zeros in row i of A (k) c (k) j = number of non-zeros in column j of A (k) U L A k Markowitz criterion: Candidate pivot a ij should minimize (r (k) i 1) (c (k) j 1) 8i, j k Minimum degree, i.e., Markowitz criterion for symmetric matrices: Candidate pivot a ij should minimize (c (k) j 1) 8j k

86 Minimum degree algorithm Step 1: Select the vertex that possesses the smallest number of neighbors in G (a) Sparse symmetric matrix (b) Elimination graph The node/variable selected is 1 of degree 2.

87 Illustration Step 1: elimination of pivot (a) Elimination graph (b) Factors and active submatrix Initial nonzeros Nonzeros in factors Fill in

88 Minimum degree algorithm based on elimination graphs 8i 2 [1 n] d i = adj G 0(i) For k =1to n Do p =min i2vk 1 (d i ) For each i 2 adj G k 1(p) Do adj G k(i) = (adj G k 1(i) [ adj G k 1(p)) \{i, p} d i = adj G k(i) EndFor V k = V k 1 \ p EndFor

89 Illustration (cont d) Graphs G 1,G 2,G 3 and corresponding reduced matrices (a) Elimination graphs (b) Factors and active submatrices Original nonzero Original nonzero modified Fill in Nonzeros e in factors

90 Minimum degree algorithm Minimum Degree does not always minimize fill-in!!! Consider the following matrix Remark: Using initial ordering No fill in Corresponding elimination graph Step 1 of Minimum Degree: 2 7 Select pivot 5 (minimum degree = 2) Updated graph Add (4,6) i.e. fill in 3 8

91 E cient implementation of Minimum degree Reduce time complexity 1. Accelerate selection of pivots and update of the graph: 1.1 Supervariables (or indistinguishable nodes): if several variables have the same adjacency structure in G k,theycanbeeliminated simultaneously. 1.2 Two non-adjacent nodes of same degree can be eliminated simultaneously (multiple eliminations). 1.3 Degree update of neighbours of the pivot can be e ected in an approximate way (Approximate Minimum Degree).

92 E cient implementation of Minimum degree Reduce memory complexity 2. Decrease size of working space Using the elimination graph, working space is of order O(nnz(L)). Fill-in: Let pivot be the pivot at step k If i 2 Adj G k 1(pivot) then Adj G k 1(pivot) Adj G k(i) Structure of pivot column included in filled structure of column i. We can then use an implicit representation of fill-in by defining the notion of element (variable already eliminated) and quotient graph. A variable of the quotient graph is adjacent to variables and elements. One can show that 8k 2 [1...n], the size of the quotient graph is O(nnz(A))

93 Influence on the structure of factors Harwell-Boeing matrix: dwt 592.rua, structural computing on a submarine. NZ(LU factors)= nz = nz = 58202

94 Influence on the structure of factors 0 Structure of factors after permutation Minimum Degree MMD ( ) nz = nz = Detection of supervariables allows to build more regularly structured factors (easier factorization).

95 Comparison of 3 implementations of Minimum Degree V0 is the initial algorithm (based on the elimination graph) MMD is V / + 1.2/ + 2/ (Multiple Minimum Degree, Liu, 85, 89) AMD is V / + 1.3/ + 2/ (Approximate Minimum Degree, Amestoy et al., 95). Execution times on a SunSparc 10: Matrix Order Nonzeros Minimum Degree V0 MMD AMD dwt Min. memory size 250KB 110KB 110KB Wang Orani Fill-in is similar Memory space for MMD and AMD : 2 NZ integers V0 was not able to perform reordering for the 2 last matrices (lack of memory after 2 hours of computations)

96 Minimum fill based algorithm Metric(v i ) is the amount of fill-in that v i would introduce if it were selected as a pivot. This corresponds to the cardinality of the deficiency of a vertex Illustration: r has a degree d =4and a fill-in metric of d (d 1)/2 =6whereas s has degree d =5but a fill-in metric of d (d 1)/2 9=1. j2 j1 j3 r j4 s i1 i3 i4 i2 i5

97 Minimum fill-in properties The situation typically occurs when {i 1,i 2,i 3 } and {i 2,i 3,i 4,i 5 } were adjacent to two already selected nodes (here e 2 and e 1 ) j2 j1 j3 r j4 s i2 i1 i3 i5 i4 e1 e2 r s i1 i2 i3 i4 i5 j1 j2 e1 and e2 are previously selected nodes The elimination of a node v k a ects the degree of nodes adjacent to v k. The fill-in metric of adj(adj(v k )) is also a ected. Illustration: selecting r a ects the fill-in metric of i 1 (because of fill edge (j 3,j 4 )). j3 j4

98 How to compute the fill-in metrics Only nodes adjacent to current pivot are updated. Only approximated metrics (using clique structures) are computed Let d k be the degree of node k; d k (d k 1)/2 is an upper bound of the fill (s! d s =5! d s (d s 1)/2 = 10). Several possibilities: 1. Deduct the clique area of the last selected pivot adjacent to k (s! clique of e 2 ). 2. Deduct the largest clique area of all adjacent selected pivots (s! clique of e 1 ) 3. If for d k we use instead AMD then cliques of all adjacent selected pivots can be deducted.

99 CM and RCM: Definitions Bandwidth: AstructurallysymmetricmatrixA is said to have bandwidth 2m +1,ifmisthe smallest integer such that a ij =0, whenever i j >m. If no interchanges are performed during elimination, fill-in occurs only within the band. Profile: Definebandwidthforeachrowi: m(i) is the smallest integer such that a ij =0,wheneveri j>m(i) for j<i.then profile of a symmetric matrix is P i m(i). If no interchanges are performed during elimination, no fill-in occurs ahead of the first entry in each row. Block tridiagonal form: Nonzeros are on the diagonal blocks or in a block just above the diagonal or just below the diagonal.

100 CM: Algorithm Level sets are built from the vertex of minimum degree. At any level, priority is given to a vertex with smaller number of neighbors. pick a vertex v and order it as the first vertex S {v} while S 6= V do S 0 all vertices in V \ S which are adjacent to S order vertices in S 0 in increasing order of degrees S S [ S 0 end while (example from Du, Erisman, Reid)

101 CM vs RCM RCM: Simply reverse the order found by the CM algorithm. It does not change the bandwidth but improves the storage requirements. (Liu and Sherman, 76 )

102 Illustration: Reverse Cuthill-McKee on matrix dwt 592 Harwell-Boeing matrix: dwt 592, structural computing on a submarine. NZ(LU factors)= Original matrix 0 Factorized matrix nz = nz = 58202

103 Illustration: Reverse Cuthill-McKee on matrix dwt 592 NZ(LU factors)= Permuted matrix (RCM) Factorized permuted matrix nz = nz = 16924

104 Nested dissection Fill-path theorem [Rose, Tarjan, Lueker 76] Let G =(V,E) be an ordered graph. Then (v, w) is an edge of G + =(V,E [ F (G)) i there exists a path µ =[v = v 1,v 2,...,v k+1 = w] in G such that v i < min{v, w}, 2 apple i apple k All the paths connecting one subdomain to the other are contained in the separator (bisector). Thus, there cannot be any l ij where v i 2 D 1 and v j 2 D 2

105 ND of a regular square mesh The nested dissection method aims at partitioning the domain so that the fill-in is only generated internally on each subdomain and on the interface by recursively computing bisectors The nested dissection method also produces an elimination tree.

106 Nested dissection A matrix from UFL sparse matrix collection, with 4045 nonzeros nz = nz = nz = nnz(l) = nz = nnz(lrcm ) = nz = nz = nnz(ln D ) = 4287

107 Orderings: comparison of methods Million of entries in the original matrix and in the factors: A METIS AMF AMD gupta ship twotone wang xenon METIS (Karypis and Kumar) and SCOTCH (Pellegrini) are global strategies (recursive nested dissection based orderings). PORD (Schulze, Paderborn Univ.) recursive dissection based on a bottom up strategy to build the separator AMD (Amestoy, Davis and Du ) is a local strategy based on Approximate Minimum Degree. AMF (Amestoy) is a local strategy based on Approx. Minimum Fill.

108 Impact of fill-reducing heuristics Number of operations (millions): METIS SCOTCH PORD AMF AMD gupta ship twotone wang xenon METIS (Karypis and Kumar) and SCOTCH (Pellegrini) are global strategies (recursive nested dissection based orderings). PORD (Schulze, Paderborn Univ.) recursive dissection based on a bottom up strategy to build the separator AMD (Amestoy, Davis and Du ) is a local strategy based on Approximate Minimum Degree. AMF (Amestoy) is a local strategy based on Approx. Minimum Fill.

109 Tree Amalgamation

110 Tree Amalgamation The whole factorization is recast into a sequence of partial dense factorizations of the type: l 11 = p f 11, l 21. l n1 C A = f 21. f n1 C A /l 11 0 cb cb 2n cb n2 cb nn. 1 0 C B A f f 2n f n2 f nn. 1 C A 0 l 21. l n1 1 0 C B l 21. l n1 1 C A T This is still only Level-2 BLAS operations. How to get the e of Level-3 BLAS? ciency

111 Tree Amalgamation 9 L= Amalgamation without fill-in consists in merging all the frontal matrices related to pivots whose columns in the factor L have the same structure. The subset of nodes containing this pivots is called a supernode. All the pivots in a supernode can thus be eliminated at once within the same frontal matrix

112 Tree Amalgamation L= Amalgamation with fill-in is based on the same principle except that it groups together pivots whose column structure in L is not exactly the same. If the generated fill-in does not exceed a certain threshold, the extra cost is overcome by e ciency

113 Tree Amalgamation After amalgamation: L 11 A = A21 A22 L21 CB L 11 L T 11 = A 11 (Cholesky factorization) L 21 = A 21 L11 T CB = A 22 L 21 L T 21 All the operation related to the frontal matrix can be done through Level-3 BLAS routines

114 Tree Amalgamation Liu, Ng, Peyton 93 Column j is the first node in a fundamental supernode if and only if node j has two or more children in the elimination tree, or j is a leaf of some row subtree of the elimination tree of A

115 Tree traversal orders

116 Equivalent reorderings A tree can be regarded as dependency graph where node produce data that is, then, processed by its parent. By this definition, a tree has to be traversed in topological order. But there are still many topological traversals of the same tree. Definition Two orderings P and Q are equivalent if the structures of the filled graphs of P AP T and QAQ T are the same (that is they are isomorphic). Equivalent orderings result in the same amount of fill-in and computation during factorization. To ease the notation, we discuss only one ordering wrt A, i.e.,p is an equivalent ordering of A if the filled graph of A and that of P AP T are isomorphic.

117 Equivalent reorderings Any topological ordering on T (A) are equivalent Let P be the permutation matrix corresponding to a topological ordering of T (A). Then,G + (P AP T ) and G + (A) are isomorphic. Any topological ordering on T (A) are equivalent Let P be the permutation matrix corresponding to a topological ordering of T (A). The elimination tree T (P AP T ) and T (A) are isomorphic. Because the fill-in won t change, we have the freedom to choose any specific topological order that will provide other properties

118 Tree traversal orders Which specific topological order to choose? postorder: why? Because data produced by nodes is consumed by parents in a LIFO order. In the multifrontal method, we can thus use a stack memory where contribution blocks are pushed as soon as they are produced by the elimination on a frontal matrix and popped at the moment where the father node is assembled. This provides a better data locality and makes the memory management easier.

119 The multifrontal method (Du, Reid 83) 1 2 A= 3 L+U I= Fill in Memory is divided into two parts (that can overlap in time): the factors the active memory Factors Contribution block Factors Active frontal matrix Stack of contribution blocks Active Memory Elimination tree represents tasks dependencies

120 Example 1: Processing a wide tree Memory unused memory space stack memory space factor memory space non-free memory space Active memory

121 Example 2: Processing a deep tree Memory Allocation of Assembly step for Factorization step for 3 + Stack step for unused memory space factor memory space stack memory space non-free memory space

122 Postorder traversals: memory Postorder provides a good data locality and better memory consumption that a general topological order since father nodes are assembled as soon as its children have been processed. But there are still many postorders of the same tree. Which one to choose? the one that minimizes memory consumption Best (abcdefghi) Worst (hfdbacegi) i Root i g g e e c c Leaves a b d f h h f d b a

123 Modelization of the problem M i : memory peak for complete subtree rooted at i, temp i :temporarymemoryproducedbynodei, m parent :memoryforstoringtheparent. M(parent) temp1 temp2 temp3 M1 M2 M3 M parent = max( max nbchildren j=1 (M j + P j 1 k=1 temp k), m parent + P nbchildren j=1 temp j ) (1)

124 Modelization of the problem M i : memory peak for complete subtree rooted at i, temp i :temporarymemoryproducedbynodei, m parent :memoryforstoringtheparent. M(parent) temp1 temp2 temp3 M1 M2 M3 M parent = max( max nbchildren j=1 (M j + P j 1 k=1 temp k), m parent + P nbchildren j=1 temp j ) (1) Objective: orderthechildrentominimizem parent

125 Memory-minimizing schedules Theorem [Liu,86] The minimum of max j (x j + P j 1 i=1 y i) is obtained when the sequence (x i, y i ) is sorted in decreasing order of x i y i. Corollary An optimal child sequence is obtained by rearranging the children nodes in decreasing order of M i temp i. Interpretation: At each level of the tree, child with relatively large peak of memory in its subtree (M i large with respect to temp i ) should be processed first. ) Apply on complete tree starting from the leaves (or from the root with a recursive approach)

126 Optimal tree reordering Objective: Minimize peak of stack memory Tree Reorder (T ): for all i in the set of root nodes do Process Node(i); end for Process Node(i): if i is a leaf then M i =m i else for j =1to nbchildren do Process Node(j th child); end for Reorder the children of i in decreasing order of (M j temp j ); Compute M parent at node i using Formula (1); end if

127 Parallelism

128 Parallelization: two levels of parallelism tree parallelism arising from sparsity, it is formalized by the fact that nodes in separate subtrees of the elimination tree can be eliminated at the same time node parallelism within each node: parallel dense LU factorization (BLAS) Decreasing tree parallelism Increasing node parallelism U L L U U L

129 Exploiting the second level of parallelism is crucial Multifrontal factorization (1) (2) Computer #procs MFlops (speed-up) MFlops (speed-up) Alliant FX/ (1.9) 34 (4.3) IBM 3090J/6VF (2.1) 227 (3.8) CRAY (1.8) 404 (2.3) CRAY Y-MP (2.3) 1119 (4.8) Performance summary of the multifrontal factorization on matrix BCSSTK15. In column (1), we exploit only parallelism from the tree. In column (2), we combine the two levels of parallelism.

130 Other features Dynamic management of parallelism: Pool of tasks for exploiting the two levels of parallelism Assembly operations also parallel (but indirect addressing) L U Dynamic management of data Storage of LU factors, frontal and contribution matrices Amount of memory available may conflict with exploiting maximum parallelism

131 Task mapping and scheduling Assign tasks to processors to achieve a goal: makespan minimization, memory minimization,... many approaches: static: Build the schedule before the execution and follow it at run-time Advantage: verye cientsinceithasaglobalviewofthesystem Drawback: Requires a very-good modelization of the platform dynamic: Take scheduling decisions dynamically at run-time Advantage: Reactive to the evolution of the platform and easy to use on several platforms Drawback: Decisions taken with local criteria (a decision which seems to be good at time t can have very bad consequences at time t +1)

132 Influence of scheduling on the makespan Objective: Assign processes/tasks to processors so that the completion time, also called the makespan is minimized. (We may also say that we minimize the maximum total processing time on any processor.)

133 Task scheduling on shared memory computers The data can be shared between processors without any communication. Dynamic scheduling of the tasks (pool of ready tasks). Each processor selects a task (order can influence the performance). Example of good topological ordering (w.r.t time) Ordering not so good in terms of working memory.

134 Static scheduling: proportional mapping Main objective: reduce the volume of communication between processors. Recursively partition the processors equally between children of a given node. Initially all processors are assigned to root node. Good at localizing communication but not so easy if no overlapping between processor partitions at each step. 18 1,2,3,4,5 1,2,3 16 4, ,3 4 4, Mapping of the tasks onto the 5 processors

135 Mapping of the tree onto the processors Objective :FindalayerL 0 such that subtrees of L 0 can be mapped onto the processor with a good balance. Step A Step B Step C Construction and mapping of the initial level L 0 Let L 0 Roots of the assembly tree repeat Find the node q in L 0 whose subtree has largest computational cost Set L 0 (L 0 \{q}) [{children of q} Greedy mapping of the nodes of L 0 onto the processors Estimate the load unbalance until load unbalance < threshold

136 Decomposition of the tree into levels Determination of Level L 0 based on subtree cost. L 0 Subtree roots Mapping of top of the tree can be dynamic. Could be useful for both shared and distributed memory algo.

137 Assumptions and Notations Assumptions : We assume that each column of L/each node of the tree is assigned to a single processor. Each processor is in charge of computing cdiv(j) for columns j that it owns. Notation : mycols(p) is the set of columns owned by processor p. map(j) gives the processor owning column j (or task j). procs(l(:,k))) = {map(j) j 2 struct(l(:,k))} (only processors in procs(l(:,k)) require updates from column k they correspond to ancestors of k in the tree). father(j) is the father of node j in the elimination tree

138 Computational strategies for parallel direct solvers The parallel algorithm is characterized by: Computational graph dependency Communication graph There are three classical approaches to distributed memory parallelism: 1. Fan-in : The fan-in algorithm is very similar to the left-looking approach and is demand-driven: data required are aggregated update columns computed by sending processor 2. Fan-out : The fan-out algorithms is very similar to the right-looking approach and is data driven: data is sent as soon as it is produced. 3. Multifrontal : The communication pattern follows a bottom-up traversal of the tree. Messages are contribution blocks and are sent to the processor mapped on the father node

139 Fan-in variant (similar to left looking) fan-in for j=1 to n u=0 for all k in (struct(l(j,1:j-1)) \ mycols(p) ) cmod(u,k) end for if map(j)!= p send u to processor map(j) else incorporate u in column j receive all the updates on column j and incorporate them cdiv(j) end if end for

140 Fan-in variant P4 P0 P1 P2 P3

141 Fan-in variant P4 P0 P1 P2 P3

142 Fan-in variant P4 P0 P1 P2 P3

143 Fan-in variant P4 P0 P1 P2 P3

144 Fan-in variant P4 P0 P1 P2 P3

145 Fan-in variant P4 Communication P0 P1 P2 P3

146 Fan-in variant P4 Communication P0 P0 P0 P0 if 8i 2 children map(i) =P 0 and map(father) 6= P 0 (only) one message sent by P 0! exploits data locality of proportional mapping.

147 Fan-out variant (similar to right-looking) fan-out for all leaf nodes j in mycols(p) cdiv(j) send column L(:,j) to procs(l(:,j)) mycols(p) = mycols(p) - {j} end for while mycols(p)!= ; receive any column (say L(:,k)) for j in struct(l(:,k)) \ mycols(p) cmod(j,k) if column j is completely updated cdiv(j) send column L(:,j) to procs(l(:,j)) mycols(p) = mycols(p) - {j} end if end for end while

Sparse Linear Algebra: Direct Methods, advanced features

Sparse Linear Algebra: Direct Methods, advanced features P. Amestoy and A. Buttari (INPT(ENSEEIHT)-IRIT) A. Guermouche (Univ. Bordeaux-LaBRI), J.-Y. L Excellent and B. Uçar (INRIA-CNRS/LIP-ENS Lyon) F.-H.