Sparse Linear Algebra: Direct Methods, advanced features
|
|
- Ann Mills
- 6 years ago
- Views:
Transcription
1 Sparse Linear Algebra: Direct Methods, advanced features P. Amestoy and A. Buttari (INPT(ENSEEIHT)-IRIT) A. Guermouche (Univ. Bordeaux-LaBRI), J.-Y. L Excellent and B. Uçar (INRIA-CNRS/LIP-ENS Lyon) F.-H. Rouet (Lawrence Berkeley National Laboratory) C. Weisbecker INPT(ENSEEIHT)-IRIT
2 Sparse Matrix Factorizations A R m m, symmetric positive definite LL T = A Ax = b A R m m, symmetric LDL T = A Ax = b A R m m, unsymmetric LU = A Ax = b A R m n, m n QR = A min x Ax b if m > n min x such that Ax = b if n > m
3 Cholesky on a dense matrix a 11 a 21 a 22 a 31 a 32 a 33 a 41 a 42 a 43 a 44 left-looking Cholesky for k = 1,..., n do for i = k,..., n do for j = 1,..., k 1 do a (k) ik = a(k) ik l ijl kj end for end for l kk = a (k 1) kk for i = k + 1,..., n do end for end for l ik = a (k 1) ik /l kk Left looking used for modification modified right-looking Cholesky for k = 1,..., n do l kk = a (k 1) kk for i = k + 1,..., n do l ik = a (k 1) ik /l kk for j = k + 1,..., i do a (k) ij = a (k) ij l ik l jk end for end for end for Right looking
4 Factorization of sparse matrices: problems 138 JOSEPH W. H. LIU The factorization of a sparse matrix is problematic due to the presence of fill-in. a The basic LU step: Even if a (k) i,j a (k+1) i,j 138 JOSEPH W. H. LIU = a (k) i,j a(k) i,k a(k) k,j a (k) k,k is null, a(k+1) i,j can be a nonzero d a fa d o o o o h oo 0 d I fa the factorization is more expensive than O(nz) o higher amount of memory required d than O(nz) more complicated algorithms to achieve the factorization I FIG An example of matrix structures.
5 Factorization of sparse matrices: problems Because of the fill-in, the matrix factorization is commonly preceded by an analysis phase where: the fill-in is reduced through matrix permutations the fill-in is predicted though the use of (e.g.) elimination graphs computations are structured using elimination trees The analysis must complete much faster than the actual factorization
6 Analysis
7 Fill-reducing ordering methods Three main classes of methods for minimizing fill-in during factorization Local approaches : At each step of the factorization, selection of the pivot that is likely to minimize fill-in. Method is characterized by the way pivots are selected. Markowitz criterion (for a general matrix) Minimum degree (for symmetric matrices) Global approaches : The matrix is permuted so as to confine the fill-in within certain parts of the permuted matrix Cuthill-McKee, Reverse Cuthill-McKee Nested dissection Hybrid approaches : First permute the matrix globally to confine the fill-in, then reorder small parts using local heuristics.
8 The elimination process in the graphs G U (V, E) undirected graph of A for k = 1 : n 1 do V V {k} {remove vertex k} E E {(k, l) : l adj(k)} {(x, y) : x adj(k) and y adj(k)} G k (V, E) {for definition} end for G k are the so-called elimination graphs (Parter, 61) G0 : H0 =
9 The Elimination Tree The elimination tree is a graph that expresses all the dependencies in the factorization in a compact way. 138 JOSEPH W. H. LIU Elimination tree: basic idea If i depends on j (i.e., l ij 0, i > j) and k depends on i (i.e., l ki 0, k > i), then k depends on j for the transitive relation and therefore, the direct dependency expressed by l kj 0, if it exists, is a redundant For each column j < n of L, remove all the nonzeros in the column j JOSEPH W. H. LIU except the first one below the diagonal. Let L t denote the remaining structure and consider the matrix F t = L t + L T t. The graph G(F t ) is a tree called the elimination tree. d a fa d o d I o o h o oo 0 oj
10 Uses of the elimination tree The elimination tree has several uses in the factorization of a sparse matrix: it expresses the order in which variables can be eliminated: the elimination of a variable only affects (directly or indirectly) its ancestors and only depends on its descendants Therefore, any topological order of the elimination tree leads to a correct result and to the same fill-in it expresses concurrence: because variables in separate subtrees do not affect each other, they can be eliminated in parallel
11 Matrix factorization
12 Cholesky on a sparse matrix The Cholesky factorization of a sparse matrix can be achieved with a left-looking, right-looking or multifrontal method. Reference case: regular 3 3 grid ordered by nested dissection. Nodes in the separators are ordered last (see the section on orderings) Notation: cdiv(j): A(j, j) and then A(j + 1 : n, j)/ A(j, j) cmod(j,k): A(j : n, j) A(j, k) A(j : n, k) struct(l(1:k,j)): the structure of L(1:k,j) submatrix
13 Sparse left-looking Cholesky 9 left-looking for j=1 to n do for k in struct(l(j,i:j-1)) do cmod(j,k) end for cdiv(j) end for In the left-looking method, before variable j is eliminated, column j is updated with all the columns that have a nonzero on line j. In the example above, struct(l(7,1:6))={1, 3, 4, 6}. this corresponds to receiving updates from nodes lower in the subtree rooted at j the filled graph is necessary to determine the structure of each row
14 Sparse right-looking Cholesky 9 right-looking for k=1 to n do cdiv(k) for j in struct(l(k+1:n,k)) do cmod(j,k) end for end for In the right-looking method, after variable k is eliminated, column k is used to update all the columns corresponding to nonzeros in column k. In the example above, struct(l(4:9,3))={7, 8, 9}. this corresponds to sending updates to nodes higher in the elimination tree the filled graph is necessary to determine the structure of each column
15 The Multifrontal Method Each node of the elimination tree is associated with a dense front The tree is traversed bottom-up and at each node 1. assembly of the front 2. partial factorization
16 The Multifrontal Method Each node of the elimination tree is associated with a dense front The tree is traversed bottom-up and at each node 1. assembly of the front 2. partial factorization a 44 a 46 a 47 a a l 44 l 64 b 66 b 67 l 74 b 76 b 77
17 The Multifrontal Method Each node of the elimination tree is associated with a dense front The tree is traversed bottom-up and at each node 1. assembly of the front 2. partial factorization a 55 a 56 a 59 a a l 55 l 65 c 66 c 69 l 95 c 96 c 99
18 The Multifrontal Method Each node of the elimination tree is associated with a dense front The tree is traversed bottom-up and at each node 1. assembly of the front 2. partial factorization a 66 0 a a b 66 b b 76 b l 66 l 76 d 77 d 78 d 79 l 86 d 87 d 88 d 89 l 86 d 97 d 98 d 99 c c c c 99
19 The Multifrontal Method A dense matrix, called frontal matrix, is associated at each node of the elimination tree. The Multifrontal method consists in a bottom-up traversal of the tree where at each node two operations are done: Assembly Nonzeros from the original matrix are assembled together with the contribution blocks from children nodes into the frontal matrix arrowhead + cb cb-n Elimination A partial factorization of the frontal matrix is done. The variable associated to the node of the frontal tree (called fully assembled) can be eliminated. This step produces part of the final factors and a Schur complement (contribution block) that will be assembled into the father node frontal matrix L cb
20 The Multifrontal Method: example
21 Solve
22 Solve Once the matrix is factorized, the problem can be solved against one or more right-hand sides: AX = LL T X = B, A, L R n n, X R n k, B R n k The solution of this problem can be achieved in two steps: forward substitution LZ = B backward substitution L T X = Z
23 Solve: left-looking Forward Substitution 9 8 = Backward Substitution 9 8 =
24 Solve: right-looking Forward Substitution 9 8 = Backward Substitution 9 8 =
25 Direct solvers: resume The solution of a sparse linear system can be achieved in three phases that include at least the following operations: Analysis Factorization Solve Fill-reducing permutation Symbolic factorization Elimination tree computation The actual matrix factorization Forward substitution Backward substitution These phases, especially the analysis, can include many other operations. Some of them are presented next.
26 Parallelism
27 Parallelization: two levels of parallelism tree parallelism arising from sparsity, it is formalized by the fact that nodes in separate subtrees of the elimination tree can be eliminated at the same time node parallelism within each node: parallel dense LU factorization (BLAS) Decreasing tree parallelism Increasing node parallelism U L L U U L
28 Exploiting the second level of parallelism is crucial Multifrontal factorization (1) (2) Computer #procs MFlops (speed-up) MFlops (speed-up) Alliant FX/ (1.9) 34 (4.3) IBM 3090J/6VF (2.1) 227 (3.8) CRAY (1.8) 404 (2.3) CRAY Y-MP (2.3) 1119 (4.8) Performance summary of the multifrontal factorization on matrix BCSSTK15. In column (1), we exploit only parallelism from the tree. In column (2), we combine the two levels of parallelism.
29 Task mapping and scheduling Task mapping and scheduling: objective Organize work to achieve a goal like makespan (total execution time) minimization, memory minimization or a mixture of the two: Mapping: where to execute a task? Scheduling: when and where to execute a task? several approaches are possible: static: Build the schedule before the execution and follow it at run-time Advantage: very efficient since it has a global view of the system Drawback: Requires a very-good performance model for the platform dynamic: Take scheduling decisions dynamically at run-time Advantage: Reactive to the evolution of the platform and easy to use on several platforms Drawback: Decisions taken with local criteria (a decision which seems to be good at time t can have very bad consequences at time t + 1) hybrid: Try to combine the advantages of static and dynamic
30 Task mapping and scheduling The mapping/scheduling is commonly guided by a number of parameters: First of all, a mapping/scheduling which is good for concurrency is commonly not good in terms of memory consumption Especially in distributed-memory environments, data transfers have to be limited as much as possible
31 Dynamic scheduling on shared memory computers The data can be shared between processors without any communication, therefore a dynamic scheduling is relatively simple to implement and efficient: The dynamic scheduling can be done through a pool of ready tasks (a task can be, for example, seen as the assembly and factorization of a front) as soon as a task becomes ready (i.e., when all the tasks associated with the child nodes are done), it is queued in the pool processes pick up ready tasks from the pool and execute them. As soon as one process finishes a task, it goes back to the pool to pick up a new one and this continues until all the nodes of the tree are done the difficulty lies in choosing one task among all those in the pool because the order can influence both performance and memory consumption
32 Decomposition of the tree into levels Determination of Level L 0 based on subtree cost. 1,2,...,8 1,2,3,4 5,6,7,8 1,2, ,7,8 L 0 3 1,2 6 7, Subtree roots Scheduling of top of the tree can be dynamic.
33 Simplifying the mapping/scheduling problem The tree commonly contains many more nodes than the available processes. Objective : Find a layer L 0 such that subtrees of L 0 can be mapped onto the processor with a good balance. Step A Step B Step C Construction and mapping of the initial level L 0 Let L 0 Roots of the assembly tree repeat Find the node q in L 0 whose subtree has largest computational cost Set L 0 (L 0 \{q}) {children of q} Greedy mapping of the nodes of L 0 onto the processors Estimate the load unbalance until load unbalance < threshold
34 Static scheduling on distributed memory computers Main objective: reduce the volume of communication between processors because data is transferred through the (slow) network. Proportional mapping initially assigns all processes to root node. performs a top-down traversal of the tree where the processes assign to a node are subdivided among its children in a way that is proportional to their relative weight 1,2, , ,2,3,4,5 4, , Good at localizing communication but not so easy if no overlapping between processor partitions at each step. Mapping of the tasks onto the 5 processors
35 Assumptions and Notations Assumptions : We assume that each column of L/each node of the tree is assigned to a single processor. Each processor is in charge of computing cdiv(j) for columns j that it owns. Notation : mycols(p) is the set of columns owned by processor p. map(j) gives the processor owning column j (or task j). procs(l(:,k)) = {map(j) j struct(l(:,k))} (only processors in procs(l(:,k)) require updates from column k they correspond to ancestors of k in the tree). father(j) is the father of node j in the elimination tree
36 Computational strategies for parallel direct solvers The parallel algorithm is characterized by: Computational graph dependency Communication graph There are three classical approaches to distributed memory parallelism: 1. Fan-in : The fan-in algorithm is very similar to the left-looking approach and is demand-driven: data required are aggregated update columns computed by sending processor 2. Fan-out : The fan-out algorithms is very similar to the right-looking approach and is data driven: data is sent as soon as it is produced. 3. Multifrontal : The communication pattern follows a bottom-up traversal of the tree. Messages are contribution blocks and are sent to the processor mapped on the father node
37 Fan-in variant (similar to left looking) fan-in for j=1 to n u=0 for all k in (struct(l(j,1:j-1)) mycols(p) ) cmod(u,k) end for if map(j)!= p send u to processor map(j) else incorporate u in column j receive all the updates on column j and incorporate them cdiv(j) end if end for
38 Fan-in variant P4 P0 P1 P2 P3
39 Fan-in variant P4 P0 P1 P2 P3
40 Fan-in variant P4 P0 P1 P2 P3
41 Fan-in variant P4 P0 P1 P2 P3
42 Fan-in variant P4 P0 P1 P2 P3
43 Fan-in variant P4 Communication P0 P1 P2 P3
44 Fan-in variant P4 Communication P0 P0 P0 P0 if i children map(i) = P 0 and map(father) P 0 (only) one message sent by P 0 exploits data locality of proportional mapping.
45 Fan-out variant (similar to right-looking) fan-out for all leaf nodes j in mycols(p) cdiv(j) send column L(:,j) to procs(l(:,j)) mycols(p) = mycols(p) - {j} end for while mycols(p)!= receive any column (say L(:,k)) for j in struct(l(:,k)) mycols(p) cmod(j,k) if column j is completely updated cdiv(j) send column L(:,j) to procs(l(:,j)) mycols(p) = mycols(p) - {j} end if end for end while 4 5
46 Fan-out variant P4 P0 P1 P2 P3
47 Fan-out variant P4 Communication P0 P0 P0 P0 if i children map(i) = P 0 and map(father) P 0 then n messages (where n is the number of children) are sent by P 0 to update the processor in charge of the father
48 Fan-out variant P4 Communication P0 P0 P0 P0 if i children map(i) = P 0 and map(father) P 0 then n messages (where n is the number of children) are sent by P 0 to update the processor in charge of the father
49 Fan-out variant P4 Communication P0 P0 P0 P0 if i children map(i) = P 0 and map(father) P 0 then n messages (where n is the number of children) are sent by P 0 to update the processor in charge of the father
50 Fan-out variant P4 Communication P0 P0 P0 P0 if i children map(i) = P 0 and map(father) P 0 then n messages (where n is the number of children) are sent by P 0 to update the processor in charge of the father
51 Fan-out variant Properties of fan-out: Historically the first implemented. Incurs greater interprocessor communications than fan-in (or multifrontal) approach both in terms of total number of messages total volume Does not exploit data locality of proportional mapping. Improved algorithm (local aggregation): send aggregated update columns instead of individual factor columns for columns mapped on a single processor. Improve exploitation of data locality of proportional mapping. But memory increase to store aggregates can be critical (as in fan-in).
52 Multifrontal multifrontal for all leaf nodes j in mycols(p) assemble front j partially factorize front j send the schur complement to procs(father(j)) mycols(p) = mycols(p) - {j} end for while mycols(p)!= receive any contribution block (say for node j) assemble contribution block into front j if front j is completely assembled partially factorize front j send the schur complement to procs(father(j)) mycols(p) = mycols(p) - {j} end if end while 4 5
53 Multifrontal variant P2 P2 P2 P1 P1 P1 P0 Fan-in P0 P0 Fan-out P0 P0 P0 Multifrontal
54 Multifrontal variant P2 P2 P2 P1 P1 P1 P0 Fan-in P0 P0 Fan-out P0 P0 P0 Multifrontal
55 Multifrontal variant P2 P2 P2 P1 P1 P1 P0 Fan-in P0 P0 Fan-out P0 P0 P0 Multifrontal
56 Multifrontal variant P2 P2 P2 P1 P1 P1 P0 Fan-in P0 P0 Fan-out P0 P0 P0 Multifrontal
57 Multifrontal variant P2 P2 P2 P1 P1 P1 P0 Fan-in P0 P0 Fan-out P0 P0 P0 Multifrontal
58 Importance of the shape of the tree Suppose that each node in the tree corresponds to a task that: consumes temporary data from the children, produces temporary data, that is passed to the parent node. Wide tree Good parallelism Many temporary blocks to store Large memory usage Deep tree Less parallelism Smaller memory usage
59 Impact of fill reduction on the shape of the tree Reordering technique Shape of the tree observations AMD Deep well-balanced Large frontal matrices on top AMF Very deep unbalanced Small frontal matrices
60 Impact of fill reduction on the shape of the tree PORD deep unbalanced Small frontal matrices SCOTCH Very wide well-balanced Large frontal matrices METIS Wide well-balanced Smaller frontal matrices (than SCOTCH)
61 Memory consumption in the multifrontal method
62 Equivalent reorderings A tree can be regarded as dependency graph where node produce data that is, then, processed by its parent. By this definition, a tree has to be traversed in topological order. But there are still many topological traversals of the same tree. Definition Two orderings P and Q are equivalent if the structures of the filled graphs of P AP T and QAQ T are the same (that is they are isomorphic). Equivalent orderings result in the same amount of fill-in and computation during factorization. To ease the notation, we discuss only one ordering wrt A, i.e., P is an equivalent ordering of A if the filled graph of A and that of P AP T are isomorphic.
63 Equivalent reorderings Any topological ordering on T (A) are equivalent Let P be the permutation matrix corresponding to a topological ordering of T (A). Then, G + (P AP T ) and G + (A) are isomorphic. Any topological ordering on T (A) are equivalent Let P be the permutation matrix corresponding to a topological ordering of T (A). The elimination tree T (P AP T ) and T (A) are isomorphic. Because the fill-in won t change, we have the freedom to choose any specific topological order that will provide other properties
64 Tree traversal orders Which specific topological order to choose? postorder: why? Because data produced by nodes is consumed by parents in a LIFO order. In the multifrontal method, we can thus use a stack memory where contribution blocks are pushed as soon as they are produced by the elimination on a frontal matrix and popped at the moment where the father node is assembled. This provides a better data locality and makes the memory management easier.
65 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Factors Active frontal matrix Stack of contribution blocks Active storage Elimination tree
66 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory
67 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory
68 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory
69 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory
70 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory
71 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory
72 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory
73 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory
74 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory
75 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory
76 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory
77 The multifrontal method [Duff & Reid 83] A= L+U-I= Storage is divided into two parts: Factors Active memory Active memory
78 The multifrontal method [Duff & Reid 83] Storage is divided into two parts: Factors Active memory Active memory
79 Example 1: Processing a wide tree Memory unused memory space stack memory space factor memory space non-free memory space Active memory
80 Example 2: Processing a deep tree Memory Allocation of Assembly step for Factorization step for 3 + Stack step for unused memory space factor memory space stack memory space non-free memory space
81 Postorder traversals: memory Postorder provides a good data locality and better memory consumption that a general topological order since father nodes are assembled as soon as its children have been processed. But there are still many postorders of the same tree. Which one to choose? the one that minimizes memory consumption Best (abcdefghi) Worst (hfdbacegi) i Root i g g e e c c Leaves a b d f h h f d b a
82 Problem model M i : memory peak for complete subtree rooted at i, temp i : temporary memory produced by node i, m parent : memory for storing the parent. M(parent) temp1 temp2 temp3 M1 M2 M3 M parent = max( max nbchildren j=1 (M j + j 1 k=1 temp k), m parent + nbchildren j=1 temp j ) (1)
83 Problem model M i : memory peak for complete subtree rooted at i, temp i : temporary memory produced by node i, m parent : memory for storing the parent. M(parent) temp1 temp2 temp3 M1 M2 M3 M parent = max( max nbchildren j=1 (M j + j 1 k=1 temp k), m parent + nbchildren j=1 temp j ) (1) Objective: order the children to minimize M parent
84 Memory-minimizing schedules Theorem [Liu,86] The minimum of max j (x j + j 1 i=1 y i) is obtained when the sequence (x i, y i ) is sorted in decreasing order of x i y i. Corollary An optimal child sequence is obtained by rearranging the children nodes in decreasing order of M i temp i. Interpretation: At each level of the tree, child with relatively large peak of memory in its subtree (M i large with respect to temp i ) should be processed first. Apply on complete tree starting from the leaves (or from the root with a recursive approach)
85 Optimal tree reordering Objective: Minimize peak of stack memory Tree Reorder (T ): for all i in the set of root nodes do Process Node(i); end for Process Node(i): if i is a leaf then M i =m i else for j = 1 to nbchildren do Process Node(j th child); end for Reorder the children of i in decreasing order of (M j temp j ); Compute M parent at node i using Formula (1); end if
86 Memory consumption in parallel In parallel multiple branches have to be traversed at the same time in order to feed all the processes. This means that an higher number of CBs will have to be stored in memory Metric: memory efficiency e(p) = S seq p S max (p) We would like e(p) 1, i.e. S seq /p on each processor. Common mappings/schedulings provide a poor memory efficiency: Proportional mapping: lim e(p) = 0 on regular problems. p
87 Proportional mapping Proportional mapping: top-down traversal of the tree, where processors assigned to a node are distributed among its children proportionally to the weight of their respective subtrees.
88 Proportional mapping Proportional mapping: top-down traversal of the tree, where processors assigned to a node are distributed among its children proportionally to the weight of their respective subtrees. 64
89 Proportional mapping Proportional mapping: top-down traversal of the tree, where processors assigned to a node are distributed among its children proportionally to the weight of their respective subtrees % 10% 50%
90 Proportional mapping Proportional mapping: top-down traversal of the tree, where processors assigned to a node are distributed among its children proportionally to the weight of their respective subtrees Targets run time but poor memory efficiency. Usually a relaxed version is used: more memory-friendly but unreliable estimates.
91 Proportional mapping Proportional mapping: assuming that the sequential peak is 5 GB, { 4 GB S max (p) max 26, 1 GB 6, 5 GB } 5 = 0.16 GB e(p) GB GB 1GB 5GB
92 A more memory-friendly strategy... All-to-all mapping: postorder traversal of the tree, where all the processors work at every node: Optimal memory scalability (S max (p) = S seq /p) but no tree parallelism and prohibitive amounts of communications.
93 memory-aware mappings Memory-aware mapping (Agullo 08): aims at enforcing a given memory constraint (M 0, maximum memory per processor): 1. Try to apply proportional mapping. 5GB GB 1GB 5GB
94 memory-aware mappings Memory-aware mapping (Agullo 08): aims at enforcing a given memory constraint (M 0, maximum memory per processor): 1. Try to apply proportional mapping. 2. Enough memory for each subtree? 5GB 64! GB 1GB 5GB
95 memory-aware mappings Memory-aware mapping (Agullo 08): aims at enforcing a given memory constraint (M 0, maximum memory per processor): 1. Try to apply proportional mapping. 2. Enough memory for each subtree? If not, serialize them, and update M 0 : processors stack equal shares of the CBs from previous nodes
96 memory-aware mappings Memory-aware mapping (Agullo 08): aims at enforcing a given memory constraint (M 0, maximum memory per processor): 1. Try to apply proportional mapping. 2. Enough memory for each subtree? Ensures the given memory constraint and provides reliable estimates. Tends to assign many processors on nodes at the top of the tree = performance issues on parallel nodes.
97 memory-aware mappings subroutine ma_mapping (r, p, M, S)! r : root of the subtree! p( r) : number of processes mapped on r! M( r) : memory constraint on r! S (:) : peak memory consumption for all nodes call prop_mapping (r, p, M) forall ( i children of r)! check if the constraint is respected if( S(i)/p(i) > M(i) ) then! reject PM and revert to tree serialization call tree_serialization (r, p, M) exit! from forall block end if end forall forall ( i children of r)! apply MA mapping recursively to all siblings call ma_mapping (i, p, M, S) end forall end subroutine ma_mapping
98 memory-aware mappings subroutine prop_mapping (r, p, M) forall ( i children of r) p( i) = share of the p( r) processes from PM M(i) = M(r) end forall end subroutine prop_mapping subroutine tree_serialization (r, p, M) stack_siblings = 0 forall ( i children of r)! in appropriate order p(i) = p(r)! update the memory constraint for the subtree M( i) = M( r) - stack_siblings stack_siblings = stack_siblings + cb( i)/ p( r) end forall end subroutine tree_serialization
99 memory-aware mappings A finer memory-aware mapping? Serializing all the children at once is very constraining: more tree parallelism can be found Find groups on which proportional mapping works, and serialize these groups. Heuristic: follow a given order (e.g. the serial postorder) and form groups as large as possible.
100 Experiments Matrix: finite-difference model of acoustic wave propagation, 27-point stencil, grid; Seiscope consortium. N = 7 M, nnz = 189 M, factors=152 GB (METIS). Sequential peak of active memory: 46 GB. Machine: 64 nodes with two quad-core Xeon X5560 per node. We use 256 MPI processes. Perfect memory scalability: 46 GB/256=180MB.
101 Experiments Prop Memory-aware mapping Map M 0 = 225 MB M 0 = 380 MB w/o groups w/ groups Max stack peak (MB) Avg stack peak (MB) Time (s) proportional mapping delivers the fastest factorization because it targets parallelism but, at the same time, it leads to a poor memory scaling with e(256) = 0.09 the memory aware mapping with and without groups respects the imposed memory constraint with a speed that depends on how tight the constraint is grouping clearly provides a benefit in term of speed because it delivers more concurrency while still respecting the constraint
102 Low-Rank approximations
103 Low-rank matrices Take a dense matrix B of size n n and compute its SVD B = XSY :
104 Low-rank matrices Take a dense matrix B of size n n and compute its SVD B = XSY : B = X 1 S 1 Y 1 + X 2 S 2 Y 2 with S 1 (k, k) = σ k > ε, S 2 (1, 1) = σ k+1 ε
105 Low-rank matrices Take a dense matrix B of size n n and compute its SVD B = XSY : B = X 1 S 1 Y 1 + X 2 S 2 Y 2 with S 1 (k, k) = σ k > ε, S 2 (1, 1) = σ k+1 ε If B = X 1 S 1 Y 1 then B B 2 = X 2 S 2 Y 2 2 = σ k+1 ε
106 Low-rank matrices Take a dense matrix B of size n n and compute its SVD B = XSY : B = X 1 S 1 Y 1 + X 2 S 2 Y 2 with S 1 (k, k) = σ k > ε, S 2 (1, 1) = σ k+1 ε If B = X 1 S 1 Y 1 then B B 2 = X 2 S 2 Y 2 2 = σ k+1 ε If the singular values of B decay very fast (e.g. exponentially) then k n even for very small ε (e.g ) memory and CPU consumption can be reduced considerably with a controlled loss of accuracy ( ε) if B is used instead of B
107 Multifrontal BLR LU factorization Frontal matrices are not low-rank but in many applications they exhibit low-rank blocks. L i 1,1 U i 1,1 L i 2,2 U i 2,2 U i d i,d i L i d i,d i CB Because each row and column of a frontal matrix are associated with a point of the discretization mesh, blocking the frontal matrix amounts to grouping the corresponding points of the domain. The objective is thus to group (cluster) the points in such a way that the number of blocks with low-rank property is maximized and their respective rank is minimized.
108 Clustering of points/variables A block represents the interaction between two subdomains σ and τ. If they have a small diameter and are far away the interaction is weak rank is low σ τ rank of distance between and
109 Clustering of points/variables A block represents the interaction between two subdomains σ and τ. If they have a small diameter and are far away the interaction is weak rank is low σ τ rank of distance between and
110 Clustering of points/variables A block represents the interaction between two subdomains σ and τ. If they have a small diameter and are far away the interaction is weak rank is low σ τ rank of distance between and
111 Clustering of points/variables A block represents the interaction between two subdomains σ and τ. If they have a small diameter and are far away the interaction is weak rank is low σ τ rank of distance between and
112 Clustering of points/variables A block represents the interaction between two subdomains σ and τ. If they have a small diameter and are far away the interaction is weak rank is low σ τ rank of distance between and
113 Clustering of points/variables A block represents the interaction between two subdomains σ and τ. If they have a small diameter and are far away the interaction is weak rank is low σ τ rank of distance between and Guidelines for computing the grouping of points: σ, c min σ c max : we want many compressible blocks but not too small to maintain a good compression rate and a good efficiency of operations. σ, minimize diam(σ): for a given cluster size, this gives well shaped clusters and consequently maximizes mutual distances.
114 Clustering For example, if the points associated to the rows/columns of a frontal matrix lie on a square surface, it is much better to cluster in a checkerboard fashion (right), rather than into slices (left). In practice and in an algebraic context (where the discretization mesh is not known) this can be achieved by extracting the subgraph associated with the front variables and by computing a k-way partitioning (with a tool like METIS or SCOTCH, for example).
115 Clustering Here is an example of variables clustering achieved with the x-way partitioning method in the SCOTCH package Although irregular, the partitioning provides relatively well-shaped clusters.
116 Admissibility condition Once the clustering of variables is defined, it is possible to compress all the blocks F (σ, τ) that satisfy a (strong) admissibility condition: Adm s (σ, τ) : max(diam(σ), diam(τ)) η dist(σ, τ) For simplicity, we can use a relaxed version called weak admissibility condition: Adm w (σ, τ) : dist(σ, τ) > 0. The far field of a cluster σ is defined as F(σ) = {τ Adm(σ, τ) is true}. In practice it is easier to consider all block (except diagonal) eligible, try to compress them and eventually revert to non compressed form if rank is too high.
117 Dense BLR LU factorization off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE
118 Dense BLR LU factorization L U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR A kk = L kk U kk SOLVE UPDATE
119 Dense BLR LU factorization L U L L L U U U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE U kj = L 1 kk A kj, L ik = A ik U 1 kk UPDATE
120 Dense BLR LU factorization L U L L L U U U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE A ij = A ij L ik U kj
121 Dense BLR LU factorization L U L L L U U U off-diagonal blocks are compressed in low-rank form right before updating the L U trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE
122 Dense BLR LU factorization L U U U U L L U U U L L L L off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE
123 Dense BLR LU factorization L U U U U L L U U U L L L L off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE
124 Dense BLR LU factorization L U U U U L L U U U L L L L L U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE
125 Dense BLR LU factorization L U U U U L L U U U L L L U U L L L off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE
126 Dense BLR LU factorization L U U U U L L U U U L L L U U L L L off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE
127 Dense BLR LU factorization L U U U U L L U U U L L L U U L L L L U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE
128 Dense BLR LU factorization L U U U U L L U U U L L L U U L L L L U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE
129 Dense BLR LU factorization off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE
130 Dense BLR LU factorization L U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE
131 Dense BLR LU factorization L U L L L U U U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE UPDATE
132 Dense BLR LU factorization L U L L L U U U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS L k = X k Y T k, U k = X k Y T k UPDATE
133 Dense BLR LU factorization L U L L L U U U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE
134 Dense BLR LU factorization L U L L L U U U off-diagonal blocks are compressed in low-rank form right before updating the L U trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE
135 Dense BLR LU factorization L U U U U L L U U U L L L L off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE
136 Dense BLR LU factorization L U U U U L L U U U L L L L off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE
137 Dense BLR LU factorization L U U U U L L U U U L L L L off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE
138 Dense BLR LU factorization L U U U U L L U U U L L L L L U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE
139 Dense BLR LU factorization L U U U U L L U U U L L L U U L L L off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE
140 Dense BLR LU factorization L U U U U L L U U U L L L U U L L L off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE
141 Dense BLR LU factorization L U U U U L L U U U L L L U U L L L off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE
142 Dense BLR LU factorization L U U U U L L U U U L L L U U L L L L U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE
143 Dense BLR LU factorization L U U U U L L U U U L L L U U L L L L U off-diagonal blocks are compressed in low-rank form right before updating the trailing submatrix. As a result, updating a block A ij of size b using blocks L ik, U kj of rank r only takes O(br 2 ) operations instead of O(b 3 ) FACTOR SOLVE COMPRESS UPDATE
144 Experimental MF complexity Setting: 1. Poisson: N 3 grid with a 7-point stencil with u = 1 on the boundary Ω u = f 2. Helmholtz: N 3 grid with a 27-point stencil, ω is the angular frequency, v(x) is the seismic velocity field, and u(x, ω) is the time-harmonic wavefield solution to the forcing term s(x, ω). ) ( ω2 v(x) 2 u(x, ω) = s(x, ω)
145 Experimental MF complexity: entries in factor Poisson entries in factors BLR BLR Full Rank: O(n 1.3 ) Helmholtz entries for factors BLR 10-4 Full Rank: O(n 1.3 ) n 1.04 log(n) n 1.20 log(n) n 1.07 log(n) Problem size N Problem size N ε only plays a role in the constant factor for Poisson a factor 3 gain with almost no loss of accuracy
146 Experimental MF complexity: operations BLR BLR Full Rank: O(n 2 ) Poisson Flop BLR 10-4 Full Rank: O(n 2 ) Helmholtz Flop n n n Problem size N Problem size N ε only plays a role in the constant factor for Poisson a factor 9 gain with almost no loss of accuracy
147 Experiments: Poisson Poisson Equation, 7pt on domain %Flops wrt FR % Flops % Time CSR CSR with IR 1e-02 1e-04 1e-06 1e-08 1e-10 1e-12 1e-14 Comp.wise Scaled Residual 1e-14 1e-12 1e-10 1e-8 1e-6 1e-4 1e-2 BLR accuracy Compression improves with higher ε values Time reduction is not as good as flops due to lower efficiency of operations Componentwise Scaled Residual is in good accordance with ε Iterative Refinement can be used to recover the lost accuracy For very high values of ε, the factorization can be used as a preconditioner
148 Experiments: Poisson Helmholtz Equation, 27pt on domain %Flops wrt FR % Flops % Time CSR CSR with IR 1e-02 1e-04 1e-06 1e-08 1e-10 1e-12 1e-14 Comp.wise Scaled Residual 1e-14 1e-12 1e-10 1e-8 1e-6 1e-4 1e-2 BLR accuracy Compression improves with higher ε values Time reduction is not as good as flops due to lower efficiency of operations Componentwise Scaled Residual is in good accordance with ε Iterative Refinement can be used to recover the lost accuracy For very high values of ε, the factorization can be used as a preconditioner
149 Sparse QR Factorization
150 Overdetermined Problems In the case Ax = b with A R m n, m > n, the problem is overdetermined and the existence of a solution is not guaranteed (the solution only exists if b R(A)). In this case, the desired solution may be the one that minimizes the norm of the residual (least squares problem): min Ax b 2 x r b Span{A} Ax The figure above shows, intuitively, that the residual has minimum norm if it is orthogonal to the R(A), i.e., Ax is the projection of b onto R(A).
151 Least Squares Problems: normal equations min x Ax b 2 min x Ax b 2 2 Ax b 2 2 = (Ax b)t (Ax b) = x T A T Ax 2x T A T b + b T b First x-derivative is equal to zero is x = (A T A) 1 A T b. This is the solution of the n n fully determined problem: A T Ax = A T b called the normal equations system: A T A is symmetric, positive definite if rank(a) = n (nonnegative definite otherwise) can be solved with a Cholesky factorization. Not necessarily backward stable because κ(a T A) = κ(a) 2 A T (Ax b) = A T r = 0 means that the residual is orthogonal to R(A) which delivers the desired solution
152 Least Squares Problems: the QR factorization The QR factorization of a matrix A R m n consists in finding an orthonormal matrix Q R m m and an upper triangular matrix R R n n such that [ ] R A = Q 0 Assuming Q = [Q 1 Q 2 ] where Q 1 is made up of the first n columns of Q and Q 2 by the remaining m n [ ] [ ] Q T Q T b = 1 c b = d we have Q T 2 Ax b 2 2 = Q T Ax Q T b 2 2 = Rx c d 2 2. In the case of a full-rank matrix A, this quantity is minimized if Rx = c.
153 Least Squares Problems: the QR factorization Assume Q = [Q 1 Q 2 ] where Q 1 contains the first n and Q 2 the last m n columns of Q: Q 1 is an orthogonal basis for R(A) and Q 2 is an orthogonal basis for N (A T ) P 1 = Q 1 Q T 1 and P 2 = Q 2 Q T 2 are projectors onto R(A) and N (A T ). Thus, Ax = P 1 b and r = b Ax = P 2 b r b Span{A} Ax Properties of the solution by QR factorization it is backward stable it requires 3n 2 (m n/3) flops (much more expensive than Cholesky)
154 Underdetermined Problems In the case Ax = b with A R m n, m < n, the problem is underdetermined and admits infinite solutions; if x p is any solution of Ax = b, then the complete set of solutions is {x Ax = y} = {x p + z z N (A)} In this case, the desired solution may be the one with minimum 2-norm (minimum 2-norm solution problem): minimize x 2 subject to Ax = b
155 Minimum 2-norm solution Problems The minimum 2-norm solution is x mn = A T (AA T ) 1 b In fact, take any other solution x; then A(x x mn ) = 0 and (x x mn ) T x mn = (x x mn ) T A T (AA T ) 1 b = (A(x x mn )) T (AA T ) 1 b = 0 which means (x x mn ) T x mn. Thus x 2 2 = x mn + x x mn 2 2 = x mn x x mn 2 2 i.e., x mn has the smallest 2-norm of any solution. As for the normal equations system, this approach incurs instability due to κ(aa T )
156 Minimum 2-norm solution Problems: the QR factorization Assume [ R1 QR = [Q 1 Q 2 ] 0 ] = A T is the QR factorization of A T where Q 1 is made of the first m columns of Q and Q 2 of the last n m ones. Then [ ] Ax = R T Q T x = [R1 T z1 0] = b The minimum 2-norm solution follows by setting z 2 = 0. Note that Q 2 is an orthogonal basis for N (A) thus, the minimum 2-norm solution is achieved by removing from any solution x p all its components in N (A). z 2
157 Householder QR A matrix H of the form H = I 2 vvt v T v is called a Householder Reflection and v a Householder Vector. The transformation P v = vvt v T v is a projection onto the space of v
158 Householder QR A matrix H of the form H = I 2 vvt v T v is called a Householder Reflection and v a Householder Vector. The transformation P v = vvt v T v is a projection onto the space of v Householder reflections can be used in such a way to annihilate all the coefficients of a vector x except one which corresponds to reflecting x on one of the original axes. If v = x ± x 2 e 1 then Hx has all coefficients equal to zero except the first.
159 Householder QR Note that in practice the H matrix is never explicitly formed. Instead, it is applied through products with the vector v. Also, note that if x is sparse, the v vector (and thus the H matrix) follows its structure. Imagine a sparse x vector and an x one built removing all the zeros in x: x = x 1 0 x 2 x = [ x1 x 2 ] If H is the Householder reflector that annihilates all the elements of x but the first, then the equivalent transformation H for X is [ ] [ ] v1 ṽ =, v H H11 H = 12 v = v 1 0, H = H 11 0 H H 21 H 22 v 2 H 21 0 H 22
Sparse Direct Solvers
Sparse Direct Solvers Alfredo Buttari 1 (with slides from Prof P. Amestoy, Dr. J.-Y. L Excellent and Dr. B. Uçar) 1 alfredo.buttari@enseeiht.fr Sparse Matrix Factorizations A 2< m m, symmetric positive
More informationLarge Scale Sparse Linear Algebra
Large Scale Sparse Linear Algebra P. Amestoy (INP-N7, IRIT) A. Buttari (CNRS, IRIT) T. Mary (University of Toulouse, IRIT) A. Guermouche (Univ. Bordeaux, LaBRI), J.-Y. L Excellent (INRIA, LIP, ENS-Lyon)
More informationBlock Low-Rank (BLR) approximations to improve multifrontal sparse solvers
Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers Joint work with Patrick Amestoy, Cleve Ashcraft, Olivier Boiteau, Alfredo Buttari and Jean-Yves L Excellent, PhD started on October
More informationJ.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009
Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.
More informationV C V L T I 0 C V B 1 V T 0 I. l nk
Multifrontal Method Kailai Xu September 16, 2017 Main observation. Consider the LDL T decomposition of a SPD matrix [ ] [ ] [ ] [ ] B V T L 0 I 0 L T L A = = 1 V T V C V L T I 0 C V B 1 V T, 0 I where
More informationSOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA
1 SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 OUTLINE Sparse matrix storage format Basic factorization
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)
AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical
More informationFast algorithms for hierarchically semiseparable matrices
NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS Numer. Linear Algebra Appl. 2010; 17:953 976 Published online 22 December 2009 in Wiley Online Library (wileyonlinelibrary.com)..691 Fast algorithms for hierarchically
More informationAlgorithmique pour l algèbre linéaire creuse
Algorithmique pour l algèbre linéaire creuse Abdou Guermouche 26 septembre 2008 Abdou Guermouche Algorithmique pour l algèbre linéaire creuse 1 Introduction to Sparse Matrix Computations Outline 1. Introduction
More informationSolving linear systems (6 lectures)
Chapter 2 Solving linear systems (6 lectures) 2.1 Solving linear systems: LU factorization (1 lectures) Reference: [Trefethen, Bau III] Lecture 20, 21 How do you solve Ax = b? (2.1.1) In numerical linear
More informationSparse linear solvers
Sparse linear solvers Laura Grigori ALPINES INRIA and LJLL, UPMC On sabbatical at UC Berkeley March 2015 Plan Sparse linear solvers Sparse matrices and graphs Classes of linear solvers Sparse Cholesky
More informationA DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS
A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS ZIXING XIN, JIANLIN XIA, MAARTEN V. DE HOOP, STEPHEN CAULEY, AND VENKATARAMANAN BALAKRISHNAN Abstract. We design
More informationA sparse multifrontal solver using hierarchically semi-separable frontal matrices
A sparse multifrontal solver using hierarchically semi-separable frontal matrices Pieter Ghysels Lawrence Berkeley National Laboratory Joint work with: Xiaoye S. Li (LBNL), Artem Napov (ULB), François-Henry
More informationDirect solution methods for sparse matrices. p. 1/49
Direct solution methods for sparse matrices p. 1/49 p. 2/49 Direct solution methods for sparse matrices Solve Ax = b, where A(n n). (1) Factorize A = LU, L lower-triangular, U upper-triangular. (2) Solve
More informationImprovements for Implicit Linear Equation Solvers
Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often
More informationA DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS
SIAM J. SCI. COMPUT. Vol. 39, No. 4, pp. C292 C318 c 2017 Society for Industrial and Applied Mathematics A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS ZIXING
More informationIncomplete Cholesky preconditioners that exploit the low-rank property
anapov@ulb.ac.be ; http://homepages.ulb.ac.be/ anapov/ 1 / 35 Incomplete Cholesky preconditioners that exploit the low-rank property (theory and practice) Artem Napov Service de Métrologie Nucléaire, Université
More informationMUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux
The MUMPS library: work done during the SOLSTICE project MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux Sparse Days and ANR SOLSTICE Final Workshop June MUMPS MUMPS Team since beg. of SOLSTICE (2007) Permanent
More informationc 2017 Society for Industrial and Applied Mathematics
SIAM J. SCI. COMPUT. Vol. 39, No. 4, pp. A1710 A1740 c 2017 Society for Industrial and Applied Mathematics ON THE COMPLEXITY OF THE BLOCK LOW-RANK MULTIFRONTAL FACTORIZATION PATRICK AMESTOY, ALFREDO BUTTARI,
More informationParallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors
Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer
More informationPartial Left-Looking Structured Multifrontal Factorization & Algorithms for Compressed Sensing. Cinna Julie Wu
Partial Left-Looking Structured Multifrontal Factorization & Algorithms for Compressed Sensing by Cinna Julie Wu A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor
More informationScientific Computing
Scientific Computing Direct solution methods Martin van Gijzen Delft University of Technology October 3, 2018 1 Program October 3 Matrix norms LU decomposition Basic algorithm Cost Stability Pivoting Pivoting
More informationSUMMARY INTRODUCTION BLOCK LOW-RANK MULTIFRONTAL METHOD
D frequency-domain seismic modeling with a Block Low-Rank algebraic multifrontal direct solver. C. Weisbecker,, P. Amestoy, O. Boiteau, R. Brossier, A. Buttari, J.-Y. L Excellent, S. Operto, J. Virieux
More informationScientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix
Scientific Computing with Case Studies SIAM Press, 2009 http://www.cs.umd.edu/users/oleary/sccswebpage Lecture Notes for Unit VII Sparse Matrix Computations Part 1: Direct Methods Dianne P. O Leary c 2008
More informationStatic-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems
Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse
More informationLinear Least squares
Linear Least squares Method of least squares Measurement errors are inevitable in observational and experimental sciences Errors can be smoothed out by averaging over more measurements than necessary to
More informationNumerical Methods. Elena loli Piccolomini. Civil Engeneering. piccolom. Metodi Numerici M p. 1/??
Metodi Numerici M p. 1/?? Numerical Methods Elena loli Piccolomini Civil Engeneering http://www.dm.unibo.it/ piccolom elena.loli@unibo.it Metodi Numerici M p. 2/?? Least Squares Data Fitting Measurement
More informationThis can be accomplished by left matrix multiplication as follows: I
1 Numerical Linear Algebra 11 The LU Factorization Recall from linear algebra that Gaussian elimination is a method for solving linear systems of the form Ax = b, where A R m n and bran(a) In this method
More informationSparse Linear Algebra PRACE PLA, Ostrava, Czech Republic
Sparse Linear Algebra PACE PLA, Ostrava, Czech epublic Mathieu Faverge February 2nd, 2017 Mathieu Faverge Sparse Linear Algebra 1 Contributions Many thanks to Patrick Amestoy, Abdou Guermouche, Pascal
More informationUsing Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods
Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods Marc Baboulin 1, Xiaoye S. Li 2 and François-Henry Rouet 2 1 University of Paris-Sud, Inria Saclay, France 2 Lawrence Berkeley
More informationMatrix decompositions
Matrix decompositions Zdeněk Dvořák May 19, 2015 Lemma 1 (Schur decomposition). If A is a symmetric real matrix, then there exists an orthogonal matrix Q and a diagonal matrix D such that A = QDQ T. The
More informationEnhancing Scalability of Sparse Direct Methods
Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.
More informationSparse factorization using low rank submatrices. Cleve Ashcraft LSTC 2010 MUMPS User Group Meeting April 15-16, 2010 Toulouse, FRANCE
Sparse factorization using low rank submatrices Cleve Ashcraft LSTC cleve@lstc.com 21 MUMPS User Group Meeting April 15-16, 21 Toulouse, FRANCE ftp.lstc.com:outgoing/cleve/mumps1 Ashcraft.pdf 1 LSTC Livermore
More informationNumerical Methods I Non-Square and Sparse Linear Systems
Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant
More informationUtilisation de la compression low-rank pour réduire la complexité du solveur PaStiX
Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman Outline 1. Context 2.
More informationNumerical Linear Algebra
Numerical Linear Algebra Decompositions, numerical aspects Gerard Sleijpen and Martin van Gijzen September 27, 2017 1 Delft University of Technology Program Lecture 2 LU-decomposition Basic algorithm Cost
More informationProgram Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects
Numerical Linear Algebra Decompositions, numerical aspects Program Lecture 2 LU-decomposition Basic algorithm Cost Stability Pivoting Cholesky decomposition Sparse matrices and reorderings Gerard Sleijpen
More informationOn the design of parallel linear solvers for large scale problems
On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,
More informationLinear Solvers. Andrew Hazel
Linear Solvers Andrew Hazel Introduction Thus far we have talked about the formulation and discretisation of physical problems...... and stopped when we got to a discrete linear system of equations. Introduction
More information9. Numerical linear algebra background
Convex Optimization Boyd & Vandenberghe 9. Numerical linear algebra background matrix structure and algorithm complexity solving linear equations with factored matrices LU, Cholesky, LDL T factorization
More informationBridging the gap between flat and hierarchical low-rank matrix formats: the multilevel BLR format
Bridging the gap between flat and hierarchical low-rank matrix formats: the multilevel BLR format Amestoy, Patrick and Buttari, Alfredo and L Excellent, Jean-Yves and Mary, Theo 2018 MIMS EPrint: 2018.12
More informationCME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.
CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax
More informationSparse Linear Systems. Iterative Methods for Sparse Linear Systems. Motivation for Studying Sparse Linear Systems. Partial Differential Equations
Sparse Linear Systems Iterative Methods for Sparse Linear Systems Matrix Computations and Applications, Lecture C11 Fredrik Bengzon, Robert Söderlund We consider the problem of solving the linear system
More informationComputational Linear Algebra
Computational Linear Algebra PD Dr. rer. nat. habil. Ralf Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2017/18 Part 2: Direct Methods PD Dr.
More informationA dissection solver with kernel detection for unsymmetric matrices in FreeFem++
. p.1/21 11 Dec. 2014, LJLL, Paris FreeFem++ workshop A dissection solver with kernel detection for unsymmetric matrices in FreeFem++ Atsushi Suzuki Atsushi.Suzuki@ann.jussieu.fr Joint work with François-Xavier
More informationarxiv: v1 [cs.na] 20 Jul 2015
AN EFFICIENT SOLVER FOR SPARSE LINEAR SYSTEMS BASED ON RANK-STRUCTURED CHOLESKY FACTORIZATION JEFFREY N. CHADWICK AND DAVID S. BINDEL arxiv:1507.05593v1 [cs.na] 20 Jul 2015 Abstract. Direct factorization
More informationLU Factorization. LU factorization is the most common way of solving linear systems! Ax = b LUx = b
AM 205: lecture 7 Last time: LU factorization Today s lecture: Cholesky factorization, timing, QR factorization Reminder: assignment 1 due at 5 PM on Friday September 22 LU Factorization LU factorization
More informationLecture 9: Numerical Linear Algebra Primer (February 11st)
10-725/36-725: Convex Optimization Spring 2015 Lecture 9: Numerical Linear Algebra Primer (February 11st) Lecturer: Ryan Tibshirani Scribes: Avinash Siravuru, Guofan Wu, Maosheng Liu Note: LaTeX template
More informationDense LU factorization and its error analysis
Dense LU factorization and its error analysis Laura Grigori INRIA and LJLL, UPMC February 2016 Plan Basis of floating point arithmetic and stability analysis Notation, results, proofs taken from [N.J.Higham,
More informationA High-Performance Parallel Hybrid Method for Large Sparse Linear Systems
Outline A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Azzam Haidar CERFACS, Toulouse joint work with Luc Giraud (N7-IRIT, France) and Layne Watson (Virginia Polytechnic Institute,
More informationFast matrix algebra for dense matrices with rank-deficient off-diagonal blocks
CHAPTER 2 Fast matrix algebra for dense matrices with rank-deficient off-diagonal blocks Chapter summary: The chapter describes techniques for rapidly performing algebraic operations on dense matrices
More informationMULTI-LAYER HIERARCHICAL STRUCTURES AND FACTORIZATIONS
MULTI-LAYER HIERARCHICAL STRUCTURES AND FACTORIZATIONS JIANLIN XIA Abstract. We propose multi-layer hierarchically semiseparable MHS structures for the fast factorizations of dense matrices arising from
More information9. Numerical linear algebra background
Convex Optimization Boyd & Vandenberghe 9. Numerical linear algebra background matrix structure and algorithm complexity solving linear equations with factored matrices LU, Cholesky, LDL T factorization
More informationPerformance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures
Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures Patrick R. Amestoy, Alfredo Buttari, Jean-Yves L Excellent, Théo Mary To cite this version: Patrick
More informationFactoring Matrices with a Tree-Structured Sparsity Pattern
TEL-AVIV UNIVERSITY RAYMOND AND BEVERLY SACKLER FACULTY OF EXACT SCIENCES SCHOOL OF COMPUTER SCIENCE Factoring Matrices with a Tree-Structured Sparsity Pattern Thesis submitted in partial fulfillment of
More informationSCALABLE HYBRID SPARSE LINEAR SOLVERS
The Pennsylvania State University The Graduate School Department of Computer Science and Engineering SCALABLE HYBRID SPARSE LINEAR SOLVERS A Thesis in Computer Science and Engineering by Keita Teranishi
More informationIMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM
IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM Bogdan OANCEA PhD, Associate Professor, Artife University, Bucharest, Romania E-mail: oanceab@ie.ase.ro Abstract:
More information7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix
EE507 - Computational Techniques for EE 7. LU factorization Jitkomut Songsiri factor-solve method LU factorization solving Ax = b with A nonsingular the inverse of a nonsingular matrix LU factorization
More informationA Column Pre-ordering Strategy for the Unsymmetric-Pattern Multifrontal Method
A Column Pre-ordering Strategy for the Unsymmetric-Pattern Multifrontal Method TIMOTHY A. DAVIS University of Florida A new method for sparse LU factorization is presented that combines a column pre-ordering
More information1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria
1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationLinear Analysis Lecture 16
Linear Analysis Lecture 16 The QR Factorization Recall the Gram-Schmidt orthogonalization process. Let V be an inner product space, and suppose a 1,..., a n V are linearly independent. Define q 1,...,
More informationParallel Singular Value Decomposition. Jiaxing Tan
Parallel Singular Value Decomposition Jiaxing Tan Outline What is SVD? How to calculate SVD? How to parallelize SVD? Future Work What is SVD? Matrix Decomposition Eigen Decomposition A (non-zero) vector
More informationContents. Preface... xi. Introduction...
Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism
More informationLecture 3: QR-Factorization
Lecture 3: QR-Factorization This lecture introduces the Gram Schmidt orthonormalization process and the associated QR-factorization of matrices It also outlines some applications of this factorization
More informationCommunication avoiding parallel algorithms for dense matrix factorizations
Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013
More informationNumerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725
Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: proximal gradient descent Consider the problem min g(x) + h(x) with g, h convex, g differentiable, and h simple
More informationAM205: Assignment 2. i=1
AM05: Assignment Question 1 [10 points] (a) [4 points] For p 1, the p-norm for a vector x R n is defined as: ( n ) 1/p x p x i p ( ) i=1 This definition is in fact meaningful for p < 1 as well, although
More informationBLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8
More informationA robust multilevel approximate inverse preconditioner for symmetric positive definite matrices
DICEA DEPARTMENT OF CIVIL, ENVIRONMENTAL AND ARCHITECTURAL ENGINEERING PhD SCHOOL CIVIL AND ENVIRONMENTAL ENGINEERING SCIENCES XXX CYCLE A robust multilevel approximate inverse preconditioner for symmetric
More informationCALU: A Communication Optimal LU Factorization Algorithm
CALU: A Communication Optimal LU Factorization Algorithm James Demmel Laura Grigori Hua Xiang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-9
More informationScaling and Pivoting in an Out-of-Core Sparse Direct Solver
Scaling and Pivoting in an Out-of-Core Sparse Direct Solver JENNIFER A. SCOTT Rutherford Appleton Laboratory Out-of-core sparse direct solvers reduce the amount of main memory needed to factorize and solve
More informationMatrix Computations: Direct Methods II. May 5, 2014 Lecture 11
Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would
More informationSTAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 13
STAT 309: MATHEMATICAL COMPUTATIONS I FALL 208 LECTURE 3 need for pivoting we saw that under proper circumstances, we can write A LU where 0 0 0 u u 2 u n l 2 0 0 0 u 22 u 2n L l 3 l 32, U 0 0 0 l n l
More informationNumerical Methods in Matrix Computations
Ake Bjorck Numerical Methods in Matrix Computations Springer Contents 1 Direct Methods for Linear Systems 1 1.1 Elements of Matrix Theory 1 1.1.1 Matrix Algebra 2 1.1.2 Vector Spaces 6 1.1.3 Submatrices
More informationRank Revealing QR factorization. F. Guyomarc h, D. Mezher and B. Philippe
Rank Revealing QR factorization F. Guyomarc h, D. Mezher and B. Philippe 1 Outline Introduction Classical Algorithms Full matrices Sparse matrices Rank-Revealing QR Conclusion CSDA 2005, Cyprus 2 Situation
More informationLINEAR SYSTEMS (11) Intensive Computation
LINEAR SYSTEMS () Intensive Computation 27-8 prof. Annalisa Massini Viviana Arrigoni EXACT METHODS:. GAUSSIAN ELIMINATION. 2. CHOLESKY DECOMPOSITION. ITERATIVE METHODS:. JACOBI. 2. GAUSS-SEIDEL 2 CHOLESKY
More informationUsing Postordering and Static Symbolic Factorization for Parallel Sparse LU
Using Postordering and Static Symbolic Factorization for Parallel Sparse LU Michel Cosnard LORIA - INRIA Lorraine Nancy, France Michel.Cosnard@loria.fr Laura Grigori LORIA - Univ. Henri Poincaré Nancy,
More informationAn exploration of matrix equilibration
An exploration of matrix equilibration Paul Liu Abstract We review three algorithms that scale the innity-norm of each row and column in a matrix to. The rst algorithm applies to unsymmetric matrices,
More information12. Cholesky factorization
L. Vandenberghe ECE133A (Winter 2018) 12. Cholesky factorization positive definite matrices examples Cholesky factorization complex positive definite matrices kernel methods 12-1 Definitions a symmetric
More informationTowards parallel bipartite matching algorithms
Outline Towards parallel bipartite matching algorithms Bora Uçar CNRS and GRAAL, ENS Lyon, France Scheduling for large-scale systems, 13 15 May 2009, Knoxville Joint work with Patrick R. Amestoy (ENSEEIHT-IRIT,
More informationComputation of the mtx-vec product based on storage scheme on vector CPUs
BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Computation of the mtx-vec product based on storage scheme on
More informationIntroduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra
Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and
More informationCSC 576: Linear System
CSC 576: Linear System Ji Liu Department of Computer Science, University of Rochester September 3, 206 Linear Equations Consider solving linear equations where A R m n and b R n m and n could be extremely
More informationMultifrontal QR factorization in a multiprocessor. environment. Abstract
Multifrontal QR factorization in a multiprocessor environment P. Amestoy 1, I.S. Du and C. Puglisi Abstract We describe the design and implementation of a parallel QR decomposition algorithm for a large
More informationBindel, Spring 2016 Numerical Analysis (CS 4220) Notes for
Cholesky Notes for 2016-02-17 2016-02-19 So far, we have focused on the LU factorization for general nonsymmetric matrices. There is an alternate factorization for the case where A is symmetric positive
More informationBlock-tridiagonal matrices
Block-tridiagonal matrices. p.1/31 Block-tridiagonal matrices - where do these arise? - as a result of a particular mesh-point ordering - as a part of a factorization procedure, for example when we compute
More informationScientific Computing: An Introductory Survey
Scientific Computing: An Introductory Survey Chapter 2 Systems of Linear Equations Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign Copyright c 2002. Reproduction
More informationMatrix Assembly in FEA
Matrix Assembly in FEA 1 In Chapter 2, we spoke about how the global matrix equations are assembled in the finite element method. We now want to revisit that discussion and add some details. For example,
More informationAn H-LU Based Direct Finite Element Solver Accelerated by Nested Dissection for Large-scale Modeling of ICs and Packages
PIERS ONLINE, VOL. 6, NO. 7, 2010 679 An H-LU Based Direct Finite Element Solver Accelerated by Nested Dissection for Large-scale Modeling of ICs and Packages Haixin Liu and Dan Jiao School of Electrical
More informationSparsity-Preserving Difference of Positive Semidefinite Matrix Representation of Indefinite Matrices
Sparsity-Preserving Difference of Positive Semidefinite Matrix Representation of Indefinite Matrices Jaehyun Park June 1 2016 Abstract We consider the problem of writing an arbitrary symmetric matrix as
More informationCommunication Avoiding LU Factorization using Complete Pivoting
Communication Avoiding LU Factorization using Complete Pivoting Implementation and Analysis Avinash Bhardwaj Department of Industrial Engineering and Operations Research University of California, Berkeley
More informationB553 Lecture 5: Matrix Algebra Review
B553 Lecture 5: Matrix Algebra Review Kris Hauser January 19, 2012 We have seen in prior lectures how vectors represent points in R n and gradients of functions. Matrices represent linear transformations
More informationECE133A Applied Numerical Computing Additional Lecture Notes
Winter Quarter 2018 ECE133A Applied Numerical Computing Additional Lecture Notes L. Vandenberghe ii Contents 1 LU factorization 1 1.1 Definition................................. 1 1.2 Nonsingular sets
More informationDirect and Incomplete Cholesky Factorizations with Static Supernodes
Direct and Incomplete Cholesky Factorizations with Static Supernodes AMSC 661 Term Project Report Yuancheng Luo 2010-05-14 Introduction Incomplete factorizations of sparse symmetric positive definite (SSPD)
More informationOrthogonalization and least squares methods
Chapter 3 Orthogonalization and least squares methods 31 QR-factorization (QR-decomposition) 311 Householder transformation Definition 311 A complex m n-matrix R = [r ij is called an upper (lower) triangular
More informationTowards a stable static pivoting strategy for the sequential and parallel solution of sparse symmetric indefinite systems
Technical Report RAL-TR-2005-007 Towards a stable static pivoting strategy for the sequential and parallel solution of sparse symmetric indefinite systems Iain S. Duff and Stéphane Pralet April 22, 2005
More informationPractical Linear Algebra: A Geometry Toolbox
Practical Linear Algebra: A Geometry Toolbox Third edition Chapter 12: Gauss for Linear Systems Gerald Farin & Dianne Hansford CRC Press, Taylor & Francis Group, An A K Peters Book www.farinhansford.com/books/pla
More information6. Iterative Methods for Linear Systems. The stepwise approach to the solution...
6 Iterative Methods for Linear Systems The stepwise approach to the solution Miriam Mehl: 6 Iterative Methods for Linear Systems The stepwise approach to the solution, January 18, 2013 1 61 Large Sparse
More informationJuly 18, Abstract. capture the unsymmetric structure of the matrices. We introduce a new algorithm which
An unsymmetrized multifrontal LU factorization Λ Patrick R. Amestoy y and Chiara Puglisi z July 8, 000 Abstract A well-known approach to compute the LU factorization of a general unsymmetric matrix A is
More information