Sparse Linear Algebra: Direct Methods, advanced features

Size: px
Start display at page:

Download "Sparse Linear Algebra: Direct Methods, advanced features"

Transcription

1 Sparse Linear Algebra: Direct Methods, advanced features P. Amestoy and A. Buttari (INPT(ENSEEIHT)-IRIT) A. Guermouche (Univ. Bordeaux-LaBRI), J.-Y. L Excellent and B. Uçar (INRIA-CNRS/LIP-ENS Lyon) F.-H. Rouet (Lawrence Berkeley National Laboratory) C. Weisbecker INPT(ENSEEIHT)-IRIT

2 Sparse Matrix Factorizations A R m m, symmetric positive definite LL T = A Ax = b A R m m, symmetric LDL T = A Ax = b A R m m, unsymmetric LU = A Ax = b A R m n, m n QR = A min x Ax b if m > n min x such that Ax = b if n > m

3 Cholesky on a dense matrix a 11 a 21 a 22 a 31 a 32 a 33 a 41 a 42 a 43 a 44 left-looking Cholesky for k = 1,..., n do for i = k,..., n do for j = 1,..., k 1 do a (k) ik = a(k) ik l ijl kj end for end for l kk = a (k 1) kk for i = k + 1,..., n do end for end for l ik = a (k 1) ik /l kk Left looking used for modification modified right-looking Cholesky for k = 1,..., n do l kk = a (k 1) kk for i = k + 1,..., n do l ik = a (k 1) ik /l kk for j = k + 1,..., i do a (k) ij = a (k) ij l ik l jk end for end for end for Right looking

4 Factorization of sparse matrices: problems 138 JOSEPH W. H. LIU The factorization of a sparse matrix is problematic due to the presence of fill-in. a The basic LU step: Even if a (k) i,j a (k+1) i,j 138 JOSEPH W. H. LIU = a (k) i,j a(k) i,k a(k) k,j a (k) k,k is null, a(k+1) i,j can be a nonzero d a fa d o o o o h oo 0 d I fa the factorization is more expensive than O(nz) o higher amount of memory required d than O(nz) more complicated algorithms to achieve the factorization I FIG An example of matrix structures.

5 Factorization of sparse matrices: problems Because of the fill-in, the matrix factorization is commonly preceded by an analysis phase where: the fill-in is reduced through matrix permutations the fill-in is predicted though the use of (e.g.) elimination graphs computations are structured using elimination trees The analysis must complete much faster than the actual factorization

6 Analysis

7 Fill-reducing ordering methods Three main classes of methods for minimizing fill-in during factorization Local approaches : At each step of the factorization, selection of the pivot that is likely to minimize fill-in. Method is characterized by the way pivots are selected. Markowitz criterion (for a general matrix) Minimum degree (for symmetric matrices) Global approaches : The matrix is permuted so as to confine the fill-in within certain parts of the permuted matrix Cuthill-McKee, Reverse Cuthill-McKee Nested dissection Hybrid approaches : First permute the matrix globally to confine the fill-in, then reorder small parts using local heuristics.

8 The elimination process in the graphs G U (V, E) undirected graph of A for k = 1 : n 1 do V V {k} {remove vertex k} E E {(k, l) : l adj(k)} {(x, y) : x adj(k) and y adj(k)} G k (V, E) {for definition} end for G k are the so-called elimination graphs (Parter, 61) G0 : H0 =

9 The Elimination Tree The elimination tree is a graph that expresses all the dependencies in the factorization in a compact way. 138 JOSEPH W. H. LIU Elimination tree: basic idea If i depends on j (i.e., l ij 0, i > j) and k depends on i (i.e., l ki 0, k > i), then k depends on j for the transitive relation and therefore, the direct dependency expressed by l kj 0, if it exists, is a redundant For each column j < n of L, remove all the nonzeros in the column j JOSEPH W. H. LIU except the first one below the diagonal. Let L t denote the remaining structure and consider the matrix F t = L t + L T t. The graph G(F t ) is a tree called the elimination tree. d a fa d o d I o o h o oo 0 oj

10 Uses of the elimination tree The elimination tree has several uses in the factorization of a sparse matrix: it expresses the order in which variables can be eliminated: the elimination of a variable only affects (directly or indirectly) its ancestors and only depends on its descendants Therefore, any topological order of the elimination tree leads to a correct result and to the same fill-in it expresses concurrence: because variables in separate subtrees do not affect each other, they can be eliminated in parallel

11 Matrix factorization

12 Cholesky on a sparse matrix The Cholesky factorization of a sparse matrix can be achieved with a left-looking, right-looking or multifrontal method. Reference case: regular 3 3 grid ordered by nested dissection. Nodes in the separators are ordered last (see the section on orderings) Notation: cdiv(j): A(j, j) and then A(j + 1 : n, j)/ A(j, j) cmod(j,k): A(j : n, j) A(j, k) A(j : n, k) struct(l(1:k,j)): the structure of L(1:k,j) submatrix

13 Sparse left-looking Cholesky 9 left-looking for j=1 to n do for k in struct(l(j,i:j-1)) do cmod(j,k) end for cdiv(j) end for In the left-looking method, before variable j is eliminated, column j is updated with all the columns that have a nonzero on line j. In the example above, struct(l(7,1:6))={1, 3, 4, 6}. this corresponds to receiving updates from nodes lower in the subtree rooted at j the filled graph is necessary to determine the structure of each row

14 Sparse right-looking Cholesky 9 right-looking for k=1 to n do cdiv(k) for j in struct(l(k+1:n,k)) do cmod(j,k) end for end for In the right-looking method, after variable k is eliminated, column k is used to update all the columns corresponding to nonzeros in column k. In the example above, struct(l(4:9,3))={7, 8, 9}. this corresponds to sending updates to nodes higher in the elimination tree the filled graph is necessary to determine the structure of each column

15 The Multifrontal Method Each node of the elimination tree is associated with a dense front The tree is traversed bottom-up and at each node 1. assembly of the front 2. partial factorization

16 The Multifrontal Method Each node of the elimination tree is associated with a dense front The tree is traversed bottom-up and at each node 1. assembly of the front 2. partial factorization a 44 a 46 a 47 a a l 44 l 64 b 66 b 67 l 74 b 76 b 77

17 The Multifrontal Method Each node of the elimination tree is associated with a dense front The tree is traversed bottom-up and at each node 1. assembly of the front 2. partial factorization a 55 a 56 a 59 a a l 55 l 65 c 66 c 69 l 95 c 96 c 99

18 The Multifrontal Method Each node of the elimination tree is associated with a dense front The tree is traversed bottom-up and at each node 1. assembly of the front 2. partial factorization a 66 0 a a b 66 b b 76 b l 66 l 76 d 77 d 78 d 79 l 86 d 87 d 88 d 89 l 86 d 97 d 98 d 99 c c c c 99

19 The Multifrontal Method A dense matrix, called frontal matrix, is associated at each node of the elimination tree. The Multifrontal method consists in a bottom-up traversal of the tree where at each node two operations are done: Assembly Nonzeros from the original matrix are assembled together with the contribution blocks from children nodes into the frontal matrix arrowhead + cb cb-n Elimination A partial factorization of the frontal matrix is done. The variable associated to the node of the frontal tree (called fully assembled) can be eliminated. This step produces part of the final factors and a Schur complement (contribution block) that will be assembled into the father node frontal matrix L cb

20 The Multifrontal Method: example

21 Solve

22 Solve Once the matrix is factorized, the problem can be solved against one or more right-hand sides: AX = LL T X = B, A, L R n n, X R n k, B R n k The solution of this problem can be achieved in two steps: forward substitution LZ = B backward substitution L T X = Z

23 Solve: left-looking Forward Substitution 9 8 = Backward Substitution 9 8 =

24 Solve: right-looking Forward Substitution 9 8 = Backward Substitution 9 8 =

25 Direct solvers: resume The solution of a sparse linear system can be achieved in three phases that include at least the following operations: Analysis Factorization Solve Fill-reducing permutation Symbolic factorization Elimination tree computation The actual matrix factorization Forward substitution Backward substitution These phases, especially the analysis, can include many other operations. Some of them are presented next.

26 Parallelism

27 Parallelization: two levels of parallelism tree parallelism arising from sparsity, it is formalized by the fact that nodes in separate subtrees of the elimination tree can be eliminated at the same time node parallelism within each node: parallel dense LU factorization (BLAS) Decreasing tree parallelism Increasing node parallelism U L L U U L

28 Exploiting the second level of parallelism is crucial Multifrontal factorization (1) (2) Computer #procs MFlops (speed-up) MFlops (speed-up) Alliant FX/ (1.9) 34 (4.3) IBM 3090J/6VF (2.1) 227 (3.8) CRAY (1.8) 404 (2.3) CRAY Y-MP (2.3) 1119 (4.8) Performance summary of the multifrontal factorization on matrix BCSSTK15. In column (1), we exploit only parallelism from the tree. In column (2), we combine the two levels of parallelism.

29 Task mapping and scheduling Task mapping and scheduling: objective Organize work to achieve a goal like makespan (total execution time) minimization, memory minimization or a mixture of the two: Mapping: where to execute a task? Scheduling: when and where to execute a task? several approaches are possible: static: Build the schedule before the execution and follow it at run-time Advantage: very efficient since it has a global view of the system Drawback: Requires a very-good performance model for the platform dynamic: Take scheduling decisions dynamically at run-time Advantage: Reactive to the evolution of the platform and easy to use on several platforms Drawback: Decisions taken with local criteria (a decision which seems to be good at time t can have very bad consequences at time t + 1) hybrid: Try to combine the advantages of static and dynamic

30 Task mapping and scheduling The mapping/scheduling is commonly guided by a number of parameters: First of all, a mapping/scheduling which is good for concurrency is commonly not good in terms of memory consumption Especially in distributed-memory environments, data transfers have to be limited as much as possible

31 Dynamic scheduling on shared memory computers The data can be shared between processors without any communication, therefore a dynamic scheduling is relatively simple to implement and efficient: The dynamic scheduling can be done through a pool of ready tasks (a task can be, for example, seen as the assembly and factorization of a front) as soon as a task becomes ready (i.e., when all the tasks associated with the child nodes are done), it is queued in the pool processes pick up ready tasks from the pool and execute them. As soon as one process finishes a task, it goes back to the pool to pick up a new one and this continues until all the nodes of the tree are done the difficulty lies in choosing one task among all those in the pool because the order can influence both performance and memory consumption

32 Decomposition of the tree into levels Determination of Level L 0 based on subtree cost. 1,2,...,8 1,2,3,4 5,6,7,8 1,2, ,7,8 L 0 3 1,2 6 7, Subtree roots Scheduling of top of the tree can be dynamic.

33 Simplifying the mapping/scheduling problem The tree commonly contains many more nodes than the available processes. Objective : Find a layer L 0 such that subtrees of L 0 can be mapped onto the processor with a good balance. Step A Step B Step C Construction and mapping of the initial level L 0 Let L 0 Roots of the assembly tree repeat Find the node q in L 0 whose subtree has largest computational cost Set L 0 (L 0 \{q}) {children of q} Greedy mapping of the nodes of L 0 onto the processors Estimate the load unbalance until load unbalance < threshold

34 Static scheduling on distributed memory computers Main objective: reduce the volume of communication between processors because data is transferred through the (slow) network. Proportional mapping initially assigns all processes to root node. performs a top-down traversal of the tree where the processes assign to a node are subdivided among its children in a way that is proportional to their relative weight 1,2, , ,2,3,4,5 4, , Good at localizing communication but not so easy if no overlapping between processor partitions at each step. Mapping of the tasks onto the 5 processors

35 Assumptions and Notations Assumptions : We assume that each column of L/each node of the tree is assigned to a single processor. Each processor is in charge of computing cdiv(j) for columns j that it owns. Notation : mycols(p) is the set of columns owned by processor p. map(j) gives the processor owning column j (or task j). procs(l(:,k)) = {map(j) j struct(l(:,k))} (only processors in procs(l(:,k)) require updates from column k they correspond to ancestors of k in the tree). father(j) is the father of node j in the elimination tree

36 Computational strategies for parallel direct solvers The parallel algorithm is characterized by: Computational graph dependency Communication graph There are three classical approaches to distributed memory parallelism: 1. Fan-in : The fan-in algorithm is very similar to the left-looking approach and is demand-driven: data required are aggregated update columns computed by sending processor 2. Fan-out : The fan-out algorithms is very similar to the right-looking approach and is data driven: data is sent as soon as it is produced. 3. Multifrontal : The communication pattern follows a bottom-up traversal of the tree. Messages are contribution blocks and are sent to the processor mapped on the father node

37 Fan-in variant (similar to left looking) fan-in for j=1 to n u=0 for all k in (struct(l(j,1:j-1)) mycols(p) ) cmod(u,k) end for if map(j)!= p send u to processor map(j) else incorporate u in column j receive all the updates on column j and incorporate them cdiv(j) end if end for

38 Fan-in variant P4 P0 P1 P2 P3

39 Fan-in variant P4 P0 P1 P2 P3

40 Fan-in variant P4 P0 P1 P2 P3

41 Fan-in variant P4 P0 P1 P2 P3

42 Fan-in variant P4 P0 P1 P2 P3

43 Fan-in variant P4 Communication P0 P1 P2 P3

44 Fan-in variant P4 Communication P0 P0 P0 P0 if i children map(i) = P 0 and map(father) P 0 (only) one message sent by P 0 exploits data locality of proportional mapping.

45 Fan-out variant (similar to right-looking) fan-out for all leaf nodes j in mycols(p) cdiv(j) send column L(:,j) to procs(l(:,j)) mycols(p) = mycols(p) - {j} end for while mycols(p)!= receive any column (say L(:,k)) for j in struct(l(:,k)) mycols(p) cmod(j,k) if column j is completely updated cdiv(j) send column L(:,j) to procs(l(:,j)) mycols(p) = mycols(p) - {j} end if end for end while 4 5

46 Fan-out variant P4 P0 P1 P2 P3

47 Fan-out variant P4 Communication P0 P0 P0 P0 if i children map(i) = P 0 and map(father) P 0 then n messages (where n is the number of children) are sent by P 0 to update the processor in charge of the father

48 Fan-out variant P4 Communication P0 P0 P0 P0 if i children map(i) = P 0 and map(father) P 0 then n messages (where n is the number of children) are sent by P 0 to update the processor in charge of the father

49 Fan-out variant P4 Communication P0 P0 P0 P0 if i children map(i) = P 0 and map(father) P 0 then n messages (where n is the number of children) are sent by P 0 to update the processor in charge of the father

50 Fan-out variant P4 Communication P0 P0 P0 P0 if i children map(i) = P 0 and map(father) P 0 then n messages (where n is the number of children) are sent by P 0 to update the processor in charge of the father

51 Fan-out variant Properties of fan-out: Historically the first implemented. Incurs greater interprocessor communications than fan-in (or multifrontal) approach both in terms of total number of messages total volume Does not exploit data locality of proportional mapping. Improved algorithm (local aggregation): send aggregated update columns instead of individual factor columns for columns mapped on a single processor. Improve exploitation of data locality of proportional mapping. But memory increase to store aggregates can be critical (as in fan-in).

52 Multifrontal multifrontal for all leaf nodes j in mycols(p) assemble front j partially factorize front j send the schur complement to procs(father(j)) mycols(p) = mycols(p) - {j} end for while mycols(p)!= receive any contribution block (say for node j) assemble contribution block into front j if front j is completely assembled partially factorize front j send the schur complement to procs(father(j)) mycols(p) = mycols(p) - {j} end if end while 4 5

53 Multifrontal variant P2 P2 P2 P1 P1 P1 P0 Fan-in P0 P0 Fan-out P0 P0 P0 Multifrontal

54 Multifrontal variant P2 P2 P2 P1 P1 P1 P0 Fan-in P0 P0 Fan-out P0 P0 P0 Multifrontal

55 Multifrontal variant P2 P2 P2 P1 P1 P1 P0 Fan-in P0 P0 Fan-out P0 P0 P0 Multifrontal

56 Multifrontal variant P2 P2 P2 P1 P1 P1 P0 Fan-in P0 P0 Fan-out P0 P0 P0 Multifrontal

57 Multifrontal variant P2 P2 P2 P1 P1 P1 P0 Fan-in P0 P0 Fan-out P0 P0 P0 Multifrontal

58 Importance of the shape of the tree Suppose that each node in the tree corresponds to a task that: consumes temporary data from the children, produces temporary data, that is passed to the parent node. Wide tree Good parallelism Many temporary blocks to store Large memory usage Deep tree Less parallelism Smaller memory usage

59 Impact of fill reduction on the shape of the tree Reordering technique Shape of the tree observations AMD Deep well-balanced Large frontal matrices on top AMF Very deep unbalanced Small frontal matrices

60 Impact of fill reduction on the shape of the tree PORD deep unbalanced Small frontal matrices SCOTCH Very wide well-balanced Large frontal matrices METIS Wide well-balanced Smaller frontal matrices (than SCOTCH)

61 Memory consumption in the multifrontal method

62 Equivalent reorderings A tree can be regarded as dependency graph where node produce data that is, then, processed by its parent. By this definition, a tree has to be traversed in topological order. But there are still many topological traversals of the same tree. Definition Two orderings P and Q are equivalent if the structures of the filled graphs of P AP T and QAQ T are the same (that is they are isomorphic). Equivalent orderings result in the same amount of fill-in and computation during factorization. To ease the notation, we discuss only one ordering wrt A, i.e., P is an equivalent ordering of A if the filled graph of A and that of P AP T are isomorphic.

63 Equivalent reorderings Any topological ordering on T (A) are equivalent Let P be the permutation matrix corresponding to a topological ordering of T (A). Then, G + (P AP T ) and G + (A) are isomorphic. Any topological ordering on T (A) are equivalent Let P be the permutation matrix corresponding to a topological ordering of T (A). The elimination tree T (P AP T ) and T (A) are isomorphic. Because the fill-in won t change, we have the freedom to choose any specific topological order that will provide other properties

64 Tree traversal orders Which specific topological order to choose? postorder: why? Because data produced by nodes is consumed by parents in a LIFO order. In the multifrontal method, we can thus use a stack memory where contribution blocks are pushed as soon as they are produced by the elimination on a frontal matrix and popped at the moment where the father node is assembled. This provides a better data locality and makes the memory management easier.

65 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Factors Active frontal matrix Stack of contribution blocks Active storage Elimination tree

66 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory

67 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory

68 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory

69 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory

70 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory

71 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory

72 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory

73 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory

74 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory

75 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory

76 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory

77 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory

78 The multifrontal method [Duff & Reid 83] Storage is divided into two parts: Factors Active memory Active memory

79 Example 1: Processing a wide tree Memory unused memory space stack memory space factor memory space non-free memory space Active memory

80 Example 2: Processing a deep tree Memory Allocation of Assembly step for Factorization step for 3 + Stack step for unused memory space factor memory space stack memory space non-free memory space

81 Postorder traversals: memory Postorder provides a good data locality and better memory consumption that a general topological order since father nodes are assembled as soon as its children have been processed. But there are still many postorders of the same tree. Which one to choose? the one that minimizes memory consumption Best (abcdefghi) Worst (hfdbacegi) i Root i g g e e c c Leaves a b d f h h f d b a

82 Problem model M i : memory peak for complete subtree rooted at i, temp i : temporary memory produced by node i, m parent : memory for storing the parent. M(parent) temp1 temp2 temp3 M1 M2 M3 M parent = max( max nbchildren j=1 (M j + j 1 k=1 temp k), m parent + nbchildren j=1 temp j ) (1)

83 Problem model M i : memory peak for complete subtree rooted at i, temp i : temporary memory produced by node i, m parent : memory for storing the parent. M(parent) temp1 temp2 temp3 M1 M2 M3 M parent = max( max nbchildren j=1 (M j + j 1 k=1 temp k), m parent + nbchildren j=1 temp j ) (1) Objective: order the children to minimize M parent

84 Memory-minimizing schedules Theorem [Liu,86] The minimum of max j (x j + j 1 i=1 y i) is obtained when the sequence (x i, y i ) is sorted in decreasing order of x i y i. Corollary An optimal child sequence is obtained by rearranging the children nodes in decreasing order of M i temp i. Interpretation: At each level of the tree, child with relatively large peak of memory in its subtree (M i large with respect to temp i ) should be processed first. Apply on complete tree starting from the leaves (or from the root with a recursive approach)

85 Optimal tree reordering Objective: Minimize peak of stack memory Tree Reorder (T ): for all i in the set of root nodes do Process Node(i); end for Process Node(i): if i is a leaf then M i =m i else for j = 1 to nbchildren do Process Node(j th child); end for Reorder the children of i in decreasing order of (M j temp j ); Compute M parent at node i using Formula (1); end if

86 Memory consumption in parallel In parallel multiple branches have to be traversed at the same time in order to feed all the processes. This means that an higher number of CBs will have to be stored in memory Metric: memory efficiency e(p) = S seq p S max (p) We would like e(p) 1, i.e. S seq /p on each processor. Common mappings/schedulings provide a poor memory efficiency: Proportional mapping: lim e(p) = 0 on regular problems. p

87 Proportional mapping Proportional mapping: top-down traversal of the tree, where processors assigned to a node are distributed among its children proportionally to the weight of their respective subtrees.

88 Proportional mapping Proportional mapping: top-down traversal of the tree, where processors assigned to a node are distributed among its children proportionally to the weight of their respective subtrees. 64

89 Proportional mapping Proportional mapping: top-down traversal of the tree, where processors assigned to a node are distributed among its children proportionally to the weight of their respective subtrees % 10% 50%

90 Proportional mapping Proportional mapping: top-down traversal of the tree, where processors assigned to a node are distributed among its children proportionally to the weight of their respective subtrees Targets run time but poor memory efficiency. Usually a relaxed version is used: more memory-friendly but unreliable estimates.

91 Proportional mapping Proportional mapping: assuming that the sequential peak is 5 GB, { 4 GB S max (p) max 26, 1 GB 6, 5 GB } 5 = 0.16 GB e(p) GB GB 1GB 5GB

92 A more memory-friendly strategy... All-to-all mapping: postorder traversal of the tree, where all the processors work at every node: Optimal memory scalability (S max (p) = S seq /p) but no tree parallelism and prohibitive amounts of communications.

93 memory-aware mappings Memory-aware mapping (Agullo 08): aims at enforcing a given memory constraint (M 0, maximum memory per processor): 1. Try to apply proportional mapping. 5GB GB 1GB 5GB

94 memory-aware mappings Memory-aware mapping (Agullo 08): aims at enforcing a given memory constraint (M 0, maximum memory per processor): 1. Try to apply proportional mapping. 2. Enough memory for each subtree? 5GB 64! GB 1GB 5GB

95 memory-aware mappings Memory-aware mapping (Agullo 08): aims at enforcing a given memory constraint (M 0, maximum memory per processor): 1. Try to apply proportional mapping. 2. Enough memory for each subtree? If not, serialize them, and update M 0 : processors stack equal shares of the CBs from previous nodes

96 memory-aware mappings Memory-aware mapping (Agullo 08): aims at enforcing a given memory constraint (M 0, maximum memory per processor): 1. Try to apply proportional mapping. 2. Enough memory for each subtree? Ensures the given memory constraint and provides reliable estimates. Tends to assign many processors on nodes at the top of the tree = performance issues on parallel nodes.

97 memory-aware mappings subroutine ma_mapping (r, p, M, S)! r : root of the subtree! p( r) : number of processes mapped on r! M( r) : memory constraint on r! S (:) : peak memory consumption for all nodes call prop_mapping (r, p, M) forall ( i children of r)! check if the constraint is respected if( S(i)/p(i) > M(i) ) then! reject PM and revert to tree serialization call tree_serialization (r, p, M) exit! from forall block end if end forall forall ( i children of r)! apply MA mapping recursively to all siblings call ma_mapping (i, p, M, S) end forall end subroutine ma_mapping

98 memory-aware mappings subroutine prop_mapping (r, p, M) forall ( i children of r) p( i) = share of the p( r) processes from PM M(i) = M(r) end forall end subroutine prop_mapping subroutine tree_serialization (r, p, M) stack_siblings = 0 forall ( i children of r)! in appropriate order p(i) = p(r)! update the memory constraint for the subtree M( i) = M( r) - stack_siblings stack_siblings = stack_siblings + cb( i)/ p( r) end forall end subroutine tree_serialization

99 memory-aware mappings A finer memory-aware mapping? Serializing all the children at once is very constraining: more tree parallelism can be found Find groups on which proportional mapping works, and serialize these groups. Heuristic: follow a given order (e.g. the serial postorder) and form groups as large as possible.

100 Experiments Matrix: finite-difference model of acoustic wave propagation, 27-point stencil, grid; Seiscope consortium. N = 7 M, nnz = 189 M, factors=152 GB (METIS). Sequential peak of active memory: 46 GB. Machine: 64 nodes with two quad-core Xeon X5560 per node. We use 256 MPI processes. Perfect memory scalability: 46 GB/256=180MB.

101 Experiments Prop Memory-aware mapping Map M 0 = 225 MB M 0 = 380 MB w/o groups w/ groups Max stack peak (MB) Avg stack peak (MB) Time (s) proportional mapping delivers the fastest factorization because it targets parallelism but, at the same time, it leads to a poor memory scaling with e(256) = 0.09 the memory aware mapping with and without groups respects the imposed memory constraint with a speed that depends on how tight the constraint is grouping clearly provides a benefit in term of speed because it delivers more concurrency while still respecting the constraint

102 Low-Rank approximations

103 Low-rank matrices Take a dense matrix B of size n n and compute its SVD B = XSY :

104 Low-rank matrices Take a dense matrix B of size n n and compute its SVD B = XSY : B = X 1 S 1 Y 1 + X 2 S 2 Y 2 with S 1 (k, k) = σ k > ε, S 2 (1, 1) = σ k+1 ε

105 Low-rank matrices Take a dense matrix B of size n n and compute its SVD B = XSY : B = X 1 S 1 Y 1 + X 2 S 2 Y 2 with S 1 (k, k) = σ k > ε, S 2 (1, 1) = σ k+1 ε If B = X 1 S 1 Y 1 then B B 2 = X 2 S 2 Y 2 2 = σ k+1 ε

106 Low-rank matrices Take a dense matrix B of size n n and compute its SVD B = XSY : B = X 1 S 1 Y 1 + X 2 S 2 Y 2 with S 1 (k, k) = σ k > ε, S 2 (1, 1) = σ k+1 ε If B = X 1 S 1 Y 1 then B B 2 = X 2 S 2 Y 2 2 = σ k+1 ε If the singular values of B decay very fast (e.g. exponentially) then k n even for very small ε (e.g ) memory and CPU consumption can be reduced considerably with a controlled loss of accuracy ( ε) if B is used instead of B

107 Multifrontal BLR LU factorization Frontal matrices are not low-rank but in many applications they exhibit low-rank blocks. L i 1,1 U i 1,1 L i 2,2 U i 2,2 U i d i,d i L i d i,d i CB Because each row and column of a frontal matrix are associated with a point of the discretization mesh, blocking the frontal matrix amounts to grouping the corresponding points of the domain. The objective is thus to group (cluster) the points in such a way that the number of blocks with low-rank property is maximized and their respective rank is minimized.

108 Clustering of points/variables A block represents the interaction between two subdomains σ and τ. If they have a small diameter and are far away the interaction is weak rank is low σ τ rank of distance between and

109 Clustering of points/variables A block represents the interaction between two subdomains σ and τ. If they have a small diameter and are far away the interaction is weak rank is low σ τ rank of distance between and

110 Clustering of points/variables A block represents the interaction between two subdomains σ and τ. If they have a small diameter and are far away the interaction is weak rank is low σ τ rank of distance between and

111 Clustering of points/variables A block represents the interaction between two subdomains σ and τ. If they have a small diameter and are far away the interaction is weak rank is low σ τ rank of distance between and

112 Clustering of points/variables A block represents the interaction between two subdomains σ and τ. If they have a small diameter and are far away the interaction is weak rank is low σ τ rank of distance between and

113 Clustering of points/variables A block represents the interaction between two subdomains σ and τ. If they have a small diameter and are far away the interaction is weak rank is low σ τ rank of distance between and Guidelines for computing the grouping of points: σ, c min σ c max : we want many compressible blocks but not too small to maintain a good compression rate and a good efficiency of operations. σ, minimize diam(σ): for a given cluster size, this gives well shaped clusters and consequently maximizes mutual distances.

114 Clustering For example, if the points associated to the rows/columns of a frontal matrix lie on a square surface, it is much better to cluster in a checkerboard fashion (right), rather than into slices (left). In practice and in an algebraic context (where the discretization mesh is not known) this can be achieved by extracting the subgraph associated with the front variables and by computing a k-way partitioning (with a tool like METIS or SCOTCH, for example).

115 Clustering Here is an example of variables clustering achieved with the x-way partitioning method in the SCOTCH package Although irregular, the partitioning provides relatively well-shaped clusters.

116 Admissibility condition Once the clustering of variables is defined, it is possible to compress all the blocks F (σ, τ) that satisfy a (strong) admissibility condition: Adm s (σ, τ) : max(diam(σ), diam(τ)) η dist(σ, τ) For simplicity, we can use a relaxed version called weak admissibility condition: Adm w (σ, τ) : dist(σ, τ) > 0. The far field of a cluster σ is defined as F(σ) = {τ Adm(σ, τ) is true}. In practice it is easier to consider all block (except diagonal) eligible, try to compress them and eventually revert to non compressed form if rank is too high.

117 Dense BLR LU factorization off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE

118 Dense BLR LU factorization L U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR A kk = L kk U kk SOLVE UPDATE

119 Dense BLR LU factorization L U L L L U U U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE U kj = L 1 kk A kj, L ik = A ik U 1 kk UPDATE

120 Dense BLR LU factorization L U L L L U U U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE A ij = A ij L ik U kj

121 Dense BLR LU factorization L U L L L U U U off-diagonal blocks are compressed in low-rank form right before updating the L U trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE

122 Dense BLR LU factorization L U U U U L L U U U L L L L off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE

123 Dense BLR LU factorization L U U U U L L U U U L L L L off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE

124 Dense BLR LU factorization L U U U U L L U U U L L L L L U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE

125 Dense BLR LU factorization L U U U U L L U U U L L L U U L L L off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE

126 Dense BLR LU factorization L U U U U L L U U U L L L U U L L L off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE

127 Dense BLR LU factorization L U U U U L L U U U L L L U U L L L L U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE

128 Dense BLR LU factorization L U U U U L L U U U L L L U U L L L L U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE

129 Dense BLR LU factorization off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE

130 Dense BLR LU factorization L U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE

131 Dense BLR LU factorization L U L L L U U U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE

132 Dense BLR LU factorization L U L L L U U U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS L k = X k Y T k, U k = X k Y T k UPDATE

133 Dense BLR LU factorization L U L L L U U U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE

134 Dense BLR LU factorization L U L L L U U U off-diagonal blocks are compressed in low-rank form right before updating the L U trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE

135 Dense BLR LU factorization L U U U U L L U U U L L L L off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE

136 Dense BLR LU factorization L U U U U L L U U U L L L L off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE

137 Dense BLR LU factorization L U U U U L L U U U L L L L off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE

138 Dense BLR LU factorization L U U U U L L U U U L L L L L U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE

139 Dense BLR LU factorization L U U U U L L U U U L L L U U L L L off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE

140 Dense BLR LU factorization L U U U U L L U U U L L L U U L L L off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE

141 Dense BLR LU factorization L U U U U L L U U U L L L U U L L L off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE

142 Dense BLR LU factorization L U U U U L L U U U L L L U U L L L L U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE

143 Dense BLR LU factorization L U U U U L L U U U L L L U U L L L L U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE

144 Experimental MF complexity Setting: 1. Poisson: N 3 grid with a 7-point stencil with u = 1 on the boundary Ω u = f 2. Helmholtz: N 3 grid with a 27-point stencil, ω is the angular frequency, v(x) is the seismic velocity field, and u(x, ω) is the time-harmonic wavefield solution to the forcing term s(x, ω). ) ( ω2 v(x) 2 u(x, ω) = s(x, ω)

145 Experimental MF complexity: entries in factor Poisson entries in factors BLR BLR Full Rank: O(n 1.3 ) Helmholtz entries for factors BLR 10-4 Full Rank: O(n 1.3 ) n 1.04 log(n) n 1.20 log(n) n 1.07 log(n) Problem size N Problem size N ε only plays a role in the constant factor for Poisson a factor 3 gain with almost no loss of accuracy

146 Experimental MF complexity: operations BLR BLR Full Rank: O(n 2 ) Poisson Flop BLR 10-4 Full Rank: O(n 2 ) Helmholtz Flop n n n Problem size N Problem size N ε only plays a role in the constant factor for Poisson a factor 9 gain with almost no loss of accuracy

147 Experiments: Poisson Poisson Equation, 7pt on domain %Flops wrt FR % Flops % Time CSR CSR with IR 1e-02 1e-04 1e-06 1e-08 1e-10 1e-12 1e-14 Comp.wise Scaled Residual 1e-14 1e-12 1e-10 1e-8 1e-6 1e-4 1e-2 BLR accuracy Compression improves with higher ε values Time reduction is not as good as flops due to lower efficiency of operations Componentwise Scaled Residual is in good accordance with ε Iterative Refinement can be used to recover the lost accuracy For very high values of ε, the factorization can be used as a preconditioner

148 Experiments: Poisson Helmholtz Equation, 27pt on domain %Flops wrt FR % Flops % Time CSR CSR with IR 1e-02 1e-04 1e-06 1e-08 1e-10 1e-12 1e-14 Comp.wise Scaled Residual 1e-14 1e-12 1e-10 1e-8 1e-6 1e-4 1e-2 BLR accuracy Compression improves with higher ε values Time reduction is not as good as flops due to lower efficiency of operations Componentwise Scaled Residual is in good accordance with ε Iterative Refinement can be used to recover the lost accuracy For very high values of ε, the factorization can be used as a preconditioner

149 Sparse QR Factorization

150 Overdetermined Problems In the case Ax = b with A R m n, m > n, the problem is overdetermined and the existence of a solution is not guaranteed (the solution only exists if b R(A)). In this case, the desired solution may be the one that minimizes the norm of the residual (least squares problem): min Ax b 2 x r b Span{A} Ax The figure above shows, intuitively, that the residual has minimum norm if it is orthogonal to the R(A), i.e., Ax is the projection of b onto R(A).

151 Least Squares Problems: normal equations min x Ax b 2 min x Ax b 2 2 Ax b 2 2 = (Ax b)t (Ax b) = x T A T Ax 2x T A T b + b T b First x-derivative is equal to zero is x = (A T A) 1 A T b. This is the solution of the n n fully determined problem: A T Ax = A T b called the normal equations system: A T A is symmetric, positive definite if rank(a) = n (nonnegative definite otherwise) can be solved with a Cholesky factorization. Not necessarily backward stable because κ(a T A) = κ(a) 2 A T (Ax b) = A T r = 0 means that the residual is orthogonal to R(A) which delivers the desired solution

152 Least Squares Problems: the QR factorization The QR factorization of a matrix A R m n consists in finding an orthonormal matrix Q R m m and an upper triangular matrix R R n n such that [ ] R A = Q 0 Assuming Q = [Q 1 Q 2 ] where Q 1 is made up of the first n columns of Q and Q 2 by the remaining m n [ ] [ ] Q T Q T b = 1 c b = d we have Q T 2 Ax b 2 2 = Q T Ax Q T b 2 2 = Rx c d 2 2. In the case of a full-rank matrix A, this quantity is minimized if Rx = c.

153 Least Squares Problems: the QR factorization Assume Q = [Q 1 Q 2 ] where Q 1 contains the first n and Q 2 the last m n columns of Q: Q 1 is an orthogonal basis for R(A) and Q 2 is an orthogonal basis for N (A T ) P 1 = Q 1 Q T 1 and P 2 = Q 2 Q T 2 are projectors onto R(A) and N (A T ). Thus, Ax = P 1 b and r = b Ax = P 2 b r b Span{A} Ax Properties of the solution by QR factorization it is backward stable it requires 3n 2 (m n/3) flops (much more expensive than Cholesky)

154 Underdetermined Problems In the case Ax = b with A R m n, m < n, the problem is underdetermined and admits infinite solutions; if x p is any solution of Ax = b, then the complete set of solutions is {x Ax = y} = {x p + z z N (A)} In this case, the desired solution may be the one with minimum 2-norm (minimum 2-norm solution problem): minimize x 2 subject to Ax = b

155 Minimum 2-norm solution Problems The minimum 2-norm solution is x mn = A T (AA T ) 1 b In fact, take any other solution x; then A(x x mn ) = 0 and (x x mn ) T x mn = (x x mn ) T A T (AA T ) 1 b = (A(x x mn )) T (AA T ) 1 b = 0 which means (x x mn ) T x mn. Thus x 2 2 = x mn + x x mn 2 2 = x mn x x mn 2 2 i.e., x mn has the smallest 2-norm of any solution. As for the normal equations system, this approach incurs instability due to κ(aa T )

156 Minimum 2-norm solution Problems: the QR factorization Assume [ R1 QR = [Q 1 Q 2 ] 0 ] = A T is the QR factorization of A T where Q 1 is made of the first m columns of Q and Q 2 of the last n m ones. Then [ ] Ax = R T Q T x = [R1 T z1 0] = b The minimum 2-norm solution follows by setting z 2 = 0. Note that Q 2 is an orthogonal basis for N (A) thus, the minimum 2-norm solution is achieved by removing from any solution x p all its components in N (A). z 2

157 Householder QR A matrix H of the form H = I 2 vvt v T v is called a Householder Reflection and v a Householder Vector. The transformation P v = vvt v T v is a projection onto the space of v

158 Householder QR A matrix H of the form H = I 2 vvt v T v is called a Householder Reflection and v a Householder Vector. The transformation P v = vvt v T v is a projection onto the space of v Householder reflections can be used in such a way to annihilate all the coefficients of a vector x except one which corresponds to reflecting x on one of the original axes. If v = x ± x 2 e 1 then Hx has all coefficients equal to zero except the first.

159 Householder QR Note that in practice the H matrix is never explicitly formed. Instead, it is applied through products with the vector v. Also, note that if x is sparse, the v vector (and thus the H matrix) follows its structure. Imagine a sparse x vector and an x one built removing all the zeros in x: x = x 1 0 x 2 x = [ x1 x 2 ] If H is the Householder reflector that annihilates all the elements of x but the first, then the equivalent transformation H for X is [ ] [ ] v1 ṽ =, v H H11 H = 12 v = v 1 0, H = H 11 0 H H 21 H 22 v 2 H 21 0 H 22

Sparse Direct Solvers

Sparse Direct Solvers Sparse Direct Solvers Alfredo Buttari 1 (with slides from Prof P. Amestoy, Dr. J.-Y. L Excellent and Dr. B. Uçar) 1 alfredo.buttari@enseeiht.fr Sparse Matrix Factorizations A 2< m m, symmetric positive

More information

Large Scale Sparse Linear Algebra

Large Scale Sparse Linear Algebra Large Scale Sparse Linear Algebra P. Amestoy (INP-N7, IRIT) A. Buttari (CNRS, IRIT) T. Mary (University of Toulouse, IRIT) A. Guermouche (Univ. Bordeaux, LaBRI), J.-Y. L Excellent (INRIA, LIP, ENS-Lyon)

More information

Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers

Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers Joint work with Patrick Amestoy, Cleve Ashcraft, Olivier Boiteau, Alfredo Buttari and Jean-Yves L Excellent, PhD started on October

More information

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009 Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.

More information

V C V L T I 0 C V B 1 V T 0 I. l nk

V C V L T I 0 C V B 1 V T 0 I. l nk Multifrontal Method Kailai Xu September 16, 2017 Main observation. Consider the LDL T decomposition of a SPD matrix [ ] [ ] [ ] [ ] B V T L 0 I 0 L T L A = = 1 V T V C V L T I 0 C V B 1 V T, 0 I where

More information

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 1 SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 OUTLINE Sparse matrix storage format Basic factorization

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

Fast algorithms for hierarchically semiseparable matrices

Fast algorithms for hierarchically semiseparable matrices NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS Numer. Linear Algebra Appl. 2010; 17:953 976 Published online 22 December 2009 in Wiley Online Library (wileyonlinelibrary.com)..691 Fast algorithms for hierarchically

More information

Algorithmique pour l algèbre linéaire creuse

Algorithmique pour l algèbre linéaire creuse Algorithmique pour l algèbre linéaire creuse Abdou Guermouche 26 septembre 2008 Abdou Guermouche Algorithmique pour l algèbre linéaire creuse 1 Introduction to Sparse Matrix Computations Outline 1. Introduction

More information

Solving linear systems (6 lectures)

Solving linear systems (6 lectures) Chapter 2 Solving linear systems (6 lectures) 2.1 Solving linear systems: LU factorization (1 lectures) Reference: [Trefethen, Bau III] Lecture 20, 21 How do you solve Ax = b? (2.1.1) In numerical linear

More information

Sparse linear solvers

Sparse linear solvers Sparse linear solvers Laura Grigori ALPINES INRIA and LJLL, UPMC On sabbatical at UC Berkeley March 2015 Plan Sparse linear solvers Sparse matrices and graphs Classes of linear solvers Sparse Cholesky

More information

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS ZIXING XIN, JIANLIN XIA, MAARTEN V. DE HOOP, STEPHEN CAULEY, AND VENKATARAMANAN BALAKRISHNAN Abstract. We design

More information

A sparse multifrontal solver using hierarchically semi-separable frontal matrices

A sparse multifrontal solver using hierarchically semi-separable frontal matrices A sparse multifrontal solver using hierarchically semi-separable frontal matrices Pieter Ghysels Lawrence Berkeley National Laboratory Joint work with: Xiaoye S. Li (LBNL), Artem Napov (ULB), François-Henry

More information

Direct solution methods for sparse matrices. p. 1/49

Direct solution methods for sparse matrices. p. 1/49 Direct solution methods for sparse matrices p. 1/49 p. 2/49 Direct solution methods for sparse matrices Solve Ax = b, where A(n n). (1) Factorize A = LU, L lower-triangular, U upper-triangular. (2) Solve

More information

Improvements for Implicit Linear Equation Solvers

Improvements for Implicit Linear Equation Solvers Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often

More information

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS SIAM J. SCI. COMPUT. Vol. 39, No. 4, pp. C292 C318 c 2017 Society for Industrial and Applied Mathematics A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS ZIXING

More information

Incomplete Cholesky preconditioners that exploit the low-rank property

Incomplete Cholesky preconditioners that exploit the low-rank property anapov@ulb.ac.be ; http://homepages.ulb.ac.be/ anapov/ 1 / 35 Incomplete Cholesky preconditioners that exploit the low-rank property (theory and practice) Artem Napov Service de Métrologie Nucléaire, Université

More information

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux The MUMPS library: work done during the SOLSTICE project MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux Sparse Days and ANR SOLSTICE Final Workshop June MUMPS MUMPS Team since beg. of SOLSTICE (2007) Permanent

More information

c 2017 Society for Industrial and Applied Mathematics

c 2017 Society for Industrial and Applied Mathematics SIAM J. SCI. COMPUT. Vol. 39, No. 4, pp. A1710 A1740 c 2017 Society for Industrial and Applied Mathematics ON THE COMPLEXITY OF THE BLOCK LOW-RANK MULTIFRONTAL FACTORIZATION PATRICK AMESTOY, ALFREDO BUTTARI,

More information

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer

More information

Partial Left-Looking Structured Multifrontal Factorization & Algorithms for Compressed Sensing. Cinna Julie Wu

Partial Left-Looking Structured Multifrontal Factorization & Algorithms for Compressed Sensing. Cinna Julie Wu Partial Left-Looking Structured Multifrontal Factorization & Algorithms for Compressed Sensing by Cinna Julie Wu A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor

More information

Scientific Computing

Scientific Computing Scientific Computing Direct solution methods Martin van Gijzen Delft University of Technology October 3, 2018 1 Program October 3 Matrix norms LU decomposition Basic algorithm Cost Stability Pivoting Pivoting

More information

SUMMARY INTRODUCTION BLOCK LOW-RANK MULTIFRONTAL METHOD

SUMMARY INTRODUCTION BLOCK LOW-RANK MULTIFRONTAL METHOD D frequency-domain seismic modeling with a Block Low-Rank algebraic multifrontal direct solver. C. Weisbecker,, P. Amestoy, O. Boiteau, R. Brossier, A. Buttari, J.-Y. L Excellent, S. Operto, J. Virieux

More information

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix Scientific Computing with Case Studies SIAM Press, 2009 http://www.cs.umd.edu/users/oleary/sccswebpage Lecture Notes for Unit VII Sparse Matrix Computations Part 1: Direct Methods Dianne P. O Leary c 2008

More information

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse

More information

Linear Least squares

Linear Least squares Linear Least squares Method of least squares Measurement errors are inevitable in observational and experimental sciences Errors can be smoothed out by averaging over more measurements than necessary to

More information

Numerical Methods. Elena loli Piccolomini. Civil Engeneering. piccolom. Metodi Numerici M p. 1/??

Numerical Methods. Elena loli Piccolomini. Civil Engeneering.  piccolom. Metodi Numerici M p. 1/?? Metodi Numerici M p. 1/?? Numerical Methods Elena loli Piccolomini Civil Engeneering http://www.dm.unibo.it/ piccolom elena.loli@unibo.it Metodi Numerici M p. 2/?? Least Squares Data Fitting Measurement

More information

This can be accomplished by left matrix multiplication as follows: I

This can be accomplished by left matrix multiplication as follows: I 1 Numerical Linear Algebra 11 The LU Factorization Recall from linear algebra that Gaussian elimination is a method for solving linear systems of the form Ax = b, where A R m n and bran(a) In this method

More information

Sparse Linear Algebra PRACE PLA, Ostrava, Czech Republic

Sparse Linear Algebra PRACE PLA, Ostrava, Czech Republic Sparse Linear Algebra PACE PLA, Ostrava, Czech epublic Mathieu Faverge February 2nd, 2017 Mathieu Faverge Sparse Linear Algebra 1 Contributions Many thanks to Patrick Amestoy, Abdou Guermouche, Pascal

More information

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods Marc Baboulin 1, Xiaoye S. Li 2 and François-Henry Rouet 2 1 University of Paris-Sud, Inria Saclay, France 2 Lawrence Berkeley

More information

Matrix decompositions

Matrix decompositions Matrix decompositions Zdeněk Dvořák May 19, 2015 Lemma 1 (Schur decomposition). If A is a symmetric real matrix, then there exists an orthogonal matrix Q and a diagonal matrix D such that A = QDQ T. The

More information

Enhancing Scalability of Sparse Direct Methods

Enhancing Scalability of Sparse Direct Methods Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.

More information

Sparse factorization using low rank submatrices. Cleve Ashcraft LSTC 2010 MUMPS User Group Meeting April 15-16, 2010 Toulouse, FRANCE

Sparse factorization using low rank submatrices. Cleve Ashcraft LSTC 2010 MUMPS User Group Meeting April 15-16, 2010 Toulouse, FRANCE Sparse factorization using low rank submatrices Cleve Ashcraft LSTC cleve@lstc.com 21 MUMPS User Group Meeting April 15-16, 21 Toulouse, FRANCE ftp.lstc.com:outgoing/cleve/mumps1 Ashcraft.pdf 1 LSTC Livermore

More information

Numerical Methods I Non-Square and Sparse Linear Systems

Numerical Methods I Non-Square and Sparse Linear Systems Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant

More information

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman Outline 1. Context 2.

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Decompositions, numerical aspects Gerard Sleijpen and Martin van Gijzen September 27, 2017 1 Delft University of Technology Program Lecture 2 LU-decomposition Basic algorithm Cost

More information

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects Numerical Linear Algebra Decompositions, numerical aspects Program Lecture 2 LU-decomposition Basic algorithm Cost Stability Pivoting Cholesky decomposition Sparse matrices and reorderings Gerard Sleijpen

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,

More information

Linear Solvers. Andrew Hazel

Linear Solvers. Andrew Hazel Linear Solvers Andrew Hazel Introduction Thus far we have talked about the formulation and discretisation of physical problems...... and stopped when we got to a discrete linear system of equations. Introduction

More information

9. Numerical linear algebra background

9. Numerical linear algebra background Convex Optimization Boyd & Vandenberghe 9. Numerical linear algebra background matrix structure and algorithm complexity solving linear equations with factored matrices LU, Cholesky, LDL T factorization

More information

Bridging the gap between flat and hierarchical low-rank matrix formats: the multilevel BLR format

Bridging the gap between flat and hierarchical low-rank matrix formats: the multilevel BLR format Bridging the gap between flat and hierarchical low-rank matrix formats: the multilevel BLR format Amestoy, Patrick and Buttari, Alfredo and L Excellent, Jean-Yves and Mary, Theo 2018 MIMS EPrint: 2018.12

More information

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication. CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax

More information

Sparse Linear Systems. Iterative Methods for Sparse Linear Systems. Motivation for Studying Sparse Linear Systems. Partial Differential Equations

Sparse Linear Systems. Iterative Methods for Sparse Linear Systems. Motivation for Studying Sparse Linear Systems. Partial Differential Equations Sparse Linear Systems Iterative Methods for Sparse Linear Systems Matrix Computations and Applications, Lecture C11 Fredrik Bengzon, Robert Söderlund We consider the problem of solving the linear system

More information

Computational Linear Algebra

Computational Linear Algebra Computational Linear Algebra PD Dr. rer. nat. habil. Ralf Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2017/18 Part 2: Direct Methods PD Dr.

More information

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++ . p.1/21 11 Dec. 2014, LJLL, Paris FreeFem++ workshop A dissection solver with kernel detection for unsymmetric matrices in FreeFem++ Atsushi Suzuki Atsushi.Suzuki@ann.jussieu.fr Joint work with François-Xavier

More information

arxiv: v1 [cs.na] 20 Jul 2015

arxiv: v1 [cs.na] 20 Jul 2015 AN EFFICIENT SOLVER FOR SPARSE LINEAR SYSTEMS BASED ON RANK-STRUCTURED CHOLESKY FACTORIZATION JEFFREY N. CHADWICK AND DAVID S. BINDEL arxiv:1507.05593v1 [cs.na] 20 Jul 2015 Abstract. Direct factorization

More information

LU Factorization. LU factorization is the most common way of solving linear systems! Ax = b LUx = b

LU Factorization. LU factorization is the most common way of solving linear systems! Ax = b LUx = b AM 205: lecture 7 Last time: LU factorization Today s lecture: Cholesky factorization, timing, QR factorization Reminder: assignment 1 due at 5 PM on Friday September 22 LU Factorization LU factorization

More information

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Lecture 9: Numerical Linear Algebra Primer (February 11st) 10-725/36-725: Convex Optimization Spring 2015 Lecture 9: Numerical Linear Algebra Primer (February 11st) Lecturer: Ryan Tibshirani Scribes: Avinash Siravuru, Guofan Wu, Maosheng Liu Note: LaTeX template

More information

Dense LU factorization and its error analysis

Dense LU factorization and its error analysis Dense LU factorization and its error analysis Laura Grigori INRIA and LJLL, UPMC February 2016 Plan Basis of floating point arithmetic and stability analysis Notation, results, proofs taken from [N.J.Higham,

More information

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Outline A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Azzam Haidar CERFACS, Toulouse joint work with Luc Giraud (N7-IRIT, France) and Layne Watson (Virginia Polytechnic Institute,

More information

Fast matrix algebra for dense matrices with rank-deficient off-diagonal blocks

Fast matrix algebra for dense matrices with rank-deficient off-diagonal blocks CHAPTER 2 Fast matrix algebra for dense matrices with rank-deficient off-diagonal blocks Chapter summary: The chapter describes techniques for rapidly performing algebraic operations on dense matrices

More information

MULTI-LAYER HIERARCHICAL STRUCTURES AND FACTORIZATIONS

MULTI-LAYER HIERARCHICAL STRUCTURES AND FACTORIZATIONS MULTI-LAYER HIERARCHICAL STRUCTURES AND FACTORIZATIONS JIANLIN XIA Abstract. We propose multi-layer hierarchically semiseparable MHS structures for the fast factorizations of dense matrices arising from

More information

9. Numerical linear algebra background

9. Numerical linear algebra background Convex Optimization Boyd & Vandenberghe 9. Numerical linear algebra background matrix structure and algorithm complexity solving linear equations with factored matrices LU, Cholesky, LDL T factorization

More information

Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures

Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures Patrick R. Amestoy, Alfredo Buttari, Jean-Yves L Excellent, Théo Mary To cite this version: Patrick

More information

Factoring Matrices with a Tree-Structured Sparsity Pattern

Factoring Matrices with a Tree-Structured Sparsity Pattern TEL-AVIV UNIVERSITY RAYMOND AND BEVERLY SACKLER FACULTY OF EXACT SCIENCES SCHOOL OF COMPUTER SCIENCE Factoring Matrices with a Tree-Structured Sparsity Pattern Thesis submitted in partial fulfillment of

More information

SCALABLE HYBRID SPARSE LINEAR SOLVERS

SCALABLE HYBRID SPARSE LINEAR SOLVERS The Pennsylvania State University The Graduate School Department of Computer Science and Engineering SCALABLE HYBRID SPARSE LINEAR SOLVERS A Thesis in Computer Science and Engineering by Keita Teranishi

More information

IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM

IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM Bogdan OANCEA PhD, Associate Professor, Artife University, Bucharest, Romania E-mail: oanceab@ie.ase.ro Abstract:

More information

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix EE507 - Computational Techniques for EE 7. LU factorization Jitkomut Songsiri factor-solve method LU factorization solving Ax = b with A nonsingular the inverse of a nonsingular matrix LU factorization

More information

A Column Pre-ordering Strategy for the Unsymmetric-Pattern Multifrontal Method

A Column Pre-ordering Strategy for the Unsymmetric-Pattern Multifrontal Method A Column Pre-ordering Strategy for the Unsymmetric-Pattern Multifrontal Method TIMOTHY A. DAVIS University of Florida A new method for sparse LU factorization is presented that combines a column pre-ordering

More information

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria 1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Linear Analysis Lecture 16

Linear Analysis Lecture 16 Linear Analysis Lecture 16 The QR Factorization Recall the Gram-Schmidt orthogonalization process. Let V be an inner product space, and suppose a 1,..., a n V are linearly independent. Define q 1,...,

More information

Parallel Singular Value Decomposition. Jiaxing Tan

Parallel Singular Value Decomposition. Jiaxing Tan Parallel Singular Value Decomposition Jiaxing Tan Outline What is SVD? How to calculate SVD? How to parallelize SVD? Future Work What is SVD? Matrix Decomposition Eigen Decomposition A (non-zero) vector

More information

Contents. Preface... xi. Introduction...

Contents. Preface... xi. Introduction... Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism

More information

Lecture 3: QR-Factorization

Lecture 3: QR-Factorization Lecture 3: QR-Factorization This lecture introduces the Gram Schmidt orthonormalization process and the associated QR-factorization of matrices It also outlines some applications of this factorization

More information

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel algorithms for dense matrix factorizations Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013

More information

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725 Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: proximal gradient descent Consider the problem min g(x) + h(x) with g, h convex, g differentiable, and h simple

More information

AM205: Assignment 2. i=1

AM205: Assignment 2. i=1 AM05: Assignment Question 1 [10 points] (a) [4 points] For p 1, the p-norm for a vector x R n is defined as: ( n ) 1/p x p x i p ( ) i=1 This definition is in fact meaningful for p < 1 as well, although

More information

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8

More information

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices DICEA DEPARTMENT OF CIVIL, ENVIRONMENTAL AND ARCHITECTURAL ENGINEERING PhD SCHOOL CIVIL AND ENVIRONMENTAL ENGINEERING SCIENCES XXX CYCLE A robust multilevel approximate inverse preconditioner for symmetric

More information

CALU: A Communication Optimal LU Factorization Algorithm

CALU: A Communication Optimal LU Factorization Algorithm CALU: A Communication Optimal LU Factorization Algorithm James Demmel Laura Grigori Hua Xiang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-9

More information

Scaling and Pivoting in an Out-of-Core Sparse Direct Solver

Scaling and Pivoting in an Out-of-Core Sparse Direct Solver Scaling and Pivoting in an Out-of-Core Sparse Direct Solver JENNIFER A. SCOTT Rutherford Appleton Laboratory Out-of-core sparse direct solvers reduce the amount of main memory needed to factorize and solve

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 13

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 13 STAT 309: MATHEMATICAL COMPUTATIONS I FALL 208 LECTURE 3 need for pivoting we saw that under proper circumstances, we can write A LU where 0 0 0 u u 2 u n l 2 0 0 0 u 22 u 2n L l 3 l 32, U 0 0 0 l n l

More information

Numerical Methods in Matrix Computations

Numerical Methods in Matrix Computations Ake Bjorck Numerical Methods in Matrix Computations Springer Contents 1 Direct Methods for Linear Systems 1 1.1 Elements of Matrix Theory 1 1.1.1 Matrix Algebra 2 1.1.2 Vector Spaces 6 1.1.3 Submatrices

More information

Rank Revealing QR factorization. F. Guyomarc h, D. Mezher and B. Philippe

Rank Revealing QR factorization. F. Guyomarc h, D. Mezher and B. Philippe Rank Revealing QR factorization F. Guyomarc h, D. Mezher and B. Philippe 1 Outline Introduction Classical Algorithms Full matrices Sparse matrices Rank-Revealing QR Conclusion CSDA 2005, Cyprus 2 Situation

More information

LINEAR SYSTEMS (11) Intensive Computation

LINEAR SYSTEMS (11) Intensive Computation LINEAR SYSTEMS () Intensive Computation 27-8 prof. Annalisa Massini Viviana Arrigoni EXACT METHODS:. GAUSSIAN ELIMINATION. 2. CHOLESKY DECOMPOSITION. ITERATIVE METHODS:. JACOBI. 2. GAUSS-SEIDEL 2 CHOLESKY

More information

Using Postordering and Static Symbolic Factorization for Parallel Sparse LU

Using Postordering and Static Symbolic Factorization for Parallel Sparse LU Using Postordering and Static Symbolic Factorization for Parallel Sparse LU Michel Cosnard LORIA - INRIA Lorraine Nancy, France Michel.Cosnard@loria.fr Laura Grigori LORIA - Univ. Henri Poincaré Nancy,

More information

An exploration of matrix equilibration

An exploration of matrix equilibration An exploration of matrix equilibration Paul Liu Abstract We review three algorithms that scale the innity-norm of each row and column in a matrix to. The rst algorithm applies to unsymmetric matrices,

More information

12. Cholesky factorization

12. Cholesky factorization L. Vandenberghe ECE133A (Winter 2018) 12. Cholesky factorization positive definite matrices examples Cholesky factorization complex positive definite matrices kernel methods 12-1 Definitions a symmetric

More information

Towards parallel bipartite matching algorithms

Towards parallel bipartite matching algorithms Outline Towards parallel bipartite matching algorithms Bora Uçar CNRS and GRAAL, ENS Lyon, France Scheduling for large-scale systems, 13 15 May 2009, Knoxville Joint work with Patrick R. Amestoy (ENSEEIHT-IRIT,

More information

Computation of the mtx-vec product based on storage scheme on vector CPUs

Computation of the mtx-vec product based on storage scheme on vector CPUs BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Computation of the mtx-vec product based on storage scheme on

More information

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and

More information

CSC 576: Linear System

CSC 576: Linear System CSC 576: Linear System Ji Liu Department of Computer Science, University of Rochester September 3, 206 Linear Equations Consider solving linear equations where A R m n and b R n m and n could be extremely

More information

Multifrontal QR factorization in a multiprocessor. environment. Abstract

Multifrontal QR factorization in a multiprocessor. environment. Abstract Multifrontal QR factorization in a multiprocessor environment P. Amestoy 1, I.S. Du and C. Puglisi Abstract We describe the design and implementation of a parallel QR decomposition algorithm for a large

More information

Bindel, Spring 2016 Numerical Analysis (CS 4220) Notes for

Bindel, Spring 2016 Numerical Analysis (CS 4220) Notes for Cholesky Notes for 2016-02-17 2016-02-19 So far, we have focused on the LU factorization for general nonsymmetric matrices. There is an alternate factorization for the case where A is symmetric positive

More information

Block-tridiagonal matrices

Block-tridiagonal matrices Block-tridiagonal matrices. p.1/31 Block-tridiagonal matrices - where do these arise? - as a result of a particular mesh-point ordering - as a part of a factorization procedure, for example when we compute

More information

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey Scientific Computing: An Introductory Survey Chapter 2 Systems of Linear Equations Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign Copyright c 2002. Reproduction

More information

Matrix Assembly in FEA

Matrix Assembly in FEA Matrix Assembly in FEA 1 In Chapter 2, we spoke about how the global matrix equations are assembled in the finite element method. We now want to revisit that discussion and add some details. For example,

More information

An H-LU Based Direct Finite Element Solver Accelerated by Nested Dissection for Large-scale Modeling of ICs and Packages

An H-LU Based Direct Finite Element Solver Accelerated by Nested Dissection for Large-scale Modeling of ICs and Packages PIERS ONLINE, VOL. 6, NO. 7, 2010 679 An H-LU Based Direct Finite Element Solver Accelerated by Nested Dissection for Large-scale Modeling of ICs and Packages Haixin Liu and Dan Jiao School of Electrical

More information

Sparsity-Preserving Difference of Positive Semidefinite Matrix Representation of Indefinite Matrices

Sparsity-Preserving Difference of Positive Semidefinite Matrix Representation of Indefinite Matrices Sparsity-Preserving Difference of Positive Semidefinite Matrix Representation of Indefinite Matrices Jaehyun Park June 1 2016 Abstract We consider the problem of writing an arbitrary symmetric matrix as

More information

Communication Avoiding LU Factorization using Complete Pivoting

Communication Avoiding LU Factorization using Complete Pivoting Communication Avoiding LU Factorization using Complete Pivoting Implementation and Analysis Avinash Bhardwaj Department of Industrial Engineering and Operations Research University of California, Berkeley

More information

B553 Lecture 5: Matrix Algebra Review

B553 Lecture 5: Matrix Algebra Review B553 Lecture 5: Matrix Algebra Review Kris Hauser January 19, 2012 We have seen in prior lectures how vectors represent points in R n and gradients of functions. Matrices represent linear transformations

More information

ECE133A Applied Numerical Computing Additional Lecture Notes

ECE133A Applied Numerical Computing Additional Lecture Notes Winter Quarter 2018 ECE133A Applied Numerical Computing Additional Lecture Notes L. Vandenberghe ii Contents 1 LU factorization 1 1.1 Definition................................. 1 1.2 Nonsingular sets

More information

Direct and Incomplete Cholesky Factorizations with Static Supernodes

Direct and Incomplete Cholesky Factorizations with Static Supernodes Direct and Incomplete Cholesky Factorizations with Static Supernodes AMSC 661 Term Project Report Yuancheng Luo 2010-05-14 Introduction Incomplete factorizations of sparse symmetric positive definite (SSPD)

More information

Orthogonalization and least squares methods

Orthogonalization and least squares methods Chapter 3 Orthogonalization and least squares methods 31 QR-factorization (QR-decomposition) 311 Householder transformation Definition 311 A complex m n-matrix R = [r ij is called an upper (lower) triangular

More information

Towards a stable static pivoting strategy for the sequential and parallel solution of sparse symmetric indefinite systems

Towards a stable static pivoting strategy for the sequential and parallel solution of sparse symmetric indefinite systems Technical Report RAL-TR-2005-007 Towards a stable static pivoting strategy for the sequential and parallel solution of sparse symmetric indefinite systems Iain S. Duff and Stéphane Pralet April 22, 2005

More information

Practical Linear Algebra: A Geometry Toolbox

Practical Linear Algebra: A Geometry Toolbox Practical Linear Algebra: A Geometry Toolbox Third edition Chapter 12: Gauss for Linear Systems Gerald Farin & Dianne Hansford CRC Press, Taylor & Francis Group, An A K Peters Book www.farinhansford.com/books/pla

More information

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

6. Iterative Methods for Linear Systems. The stepwise approach to the solution... 6 Iterative Methods for Linear Systems The stepwise approach to the solution Miriam Mehl: 6 Iterative Methods for Linear Systems The stepwise approach to the solution, January 18, 2013 1 61 Large Sparse

More information

July 18, Abstract. capture the unsymmetric structure of the matrices. We introduce a new algorithm which

July 18, Abstract. capture the unsymmetric structure of the matrices. We introduce a new algorithm which An unsymmetrized multifrontal LU factorization Λ Patrick R. Amestoy y and Chiara Puglisi z July 8, 000 Abstract A well-known approach to compute the LU factorization of a general unsymmetric matrix A is

More information