Factoring Matrices with a Tree-Structured Sparsity Pattern

Size: px

Start display at page:

Download "Factoring Matrices with a Tree-Structured Sparsity Pattern"

Penelope Hancock
5 years ago
Views:

1 TEL-AVIV UNIVERSITY RAYMOND AND BEVERLY SACKLER FACULTY OF EXACT SCIENCES SCHOOL OF COMPUTER SCIENCE Factoring Matrices with a Tree-Structured Sparsity Pattern Thesis submitted in partial fulfillment of the requirements for the M.Sc. degree of Tel-Aviv University by Alex Druinsky The research work for this thesis has been carried out at Tel-Aviv University under the direction of Prof. Sivan Toledo June, 2007

3 Abstract Let A be a matrix whose sparsity pattern is a tree with maximal degree d max. We show that if the columns of A are ordered using minimum degree on A+A, then factoring A using a sparse LU with partial pivoting algorithm generates only O(d max n) fill, requires only O(d max n) operations, and is much more stable than LU with partial pivoting on a general matrix. We also propose an even more efficient and justas-stable algorithm called sibling-dominant pivoting. This algorithm is a strict partial pivoting algorithm that modifies the column preordering locally to minimize fill and work. It leads to only O(n) work and fill. More conventional column pre-ordering methods that are based (usually implicitly) on the sparsity pattern of A A are not as efficient as the approaches that we propose in this thesis. An algorithm that computes a symmetric indefinite factorization P AP T = LBL T, where B is block-diagonal with diagonal blocks of order 1 or 2, is also proposed. The algorithm generates O(d max n) fill and requires O(d 2 maxn) operations, which is efficient when d max is small, but is not as good as the LU algorithms. 3

4 Contents Abstract 3 Acknowledgments 5 Chapter 1. Introduction 6 Chapter 2. Background LU Factorization Sparse LU Numerical Stability of the LU Factorization 19 Chapter 3. Notation 21 Chapter 4. LU with Partial Pivoting Using Minimum Degree on G 22 Chapter 5. Sibling-Dominant Pivoting 26 Chapter 6. Experimental Results 31 Chapter 7. Symmetric Indefinite Factorization 37 Chapter 8. Conclusions 47 Bibliography 49 4

5 Acknowledgments This thesis is based on a paper that was written by Prof. Sivan Toledo and myself. 5

6 CHAPTER 1 Introduction This thesis explores the behavior of sparse LU factorizations of matrices whose sparsity pattern is a tree. We begin with an analysis of fill, work, and stability of the factorization when the columns of the matrix are pre-ordered using a minimum-degree algorithm. This analysis constitutes Chapter 4. The chapter that follows, Chapter 5, describes an algorithm that is even more efficient. This algorithm uses strict partial pivoting and a coarse preordering of the columns, but the final column ordering is determined dynamically by inspecting the numerical values in the reduced matrix. This algorithm performs work that is only linear in the order of the matrix and generates only a linear amount of fill. Chapter 6 shows that our new algorithm is, indeed, highly efficient, more than elimination based on a minimum-degree ordering. The experimental results also show that both algorithms are more efficient than a conservative column preordering algorithm. Another subject of the thesis is symmetric indefinite factorization of symmetric matrices whose sparsity pattern is a tree. Such factorizations are attractive because they preserve symmetry, but they require careful design to guarantee numerical stability. Chapter 7 presents an efficient algorithm for this type of factorizations. The notation for the thesis and background facts are described in Chapters 2 and 3, and our conclusions from this research in Chapter 8. 6

7 CHAPTER 2 Background This chapter describes the basic ideas that form the background of this thesis. Among the topics are: the LU factorization, sparse LU, and the numerical properties of the LU factorization. This chapter does not present original results LU Factorization The classical method for solving a system of linear equations is Gaussian elimination. Gaussian elimination solves a square system Ax = b of order n in two conceptual steps: (1) Reduce Ax = b to the triangular form Ux = c; (2) Solve Ux = c by substitution. The reduction to triangular form proceeds in n 1 steps of elimination. The rule for step k is to eliminate variable x k from each of the equations k + 1, k + 2,..., n by subtracting a multiple of equation k. Assuming that this process is carried out in exact arithmetic and that divisionby-zero does not take place, the resulting system is upper-triangular with the same set of solutions as the original system Ax = b. A close relative of Gaussian elimination is the LU factorization. The LU factorization of a matrix A is a pair of matrices, unit lowertriangular L and upper-triangular U, that satisfy the identity A = LU. One algorithm that computes the LU factorization is the right-looking LU factorization. The computation in right-looking LU proceeds along the same lines as the reduction to triangular form in Gaussian elimination. The memory of right-looking LU holds a pair of n n matrices, L and U. The matrix L is initialized with the identity matrix and U is initialized with the matrix A. The algorithm transforms the two matrices into the LU factorization of A within n 1 steps. Let L (k) and U (k) be the values of L and U after k steps. The transformation 7

8 in step k satisfies the identities 2.1. LU FACTORIZATION 8 L (k) = L (k 1) L 1 k, U (k) = L k U (k 1), where L k is an n n matrix. This means that the identity A = L (k) U (k) holds throughout the process. The matrix L k is equal to the identity matrix in all of its elements except for the subdiagonal elements in column k and has the form 1... L k = where 1 l k+1,k l k+2,k.... l n,k 1 l i,k = U (k 1) i,k. U (k 1) k,k, This definition of L k guarantees that the first k columns of U (k) have only zeros below the diagonal. The inverse of L k is obtained by flipping the signs of the subdiagonal elements in its column k, L 1 k = l k+1,k,. l k+2,k... l n,k 1

9 2.1. LU FACTORIZATION 9 which means that L (k) has the form 1 l 2,1 1.. l 3,2.. L (k) = l k+1,k l n,1 l n,2 l n,k 1 The transformation of U in step k is implemented according to the rule U 11 u 12 U 13 U u 22 u 23, U 33 1 u 22 u 32 u 23 assuming that U is partitioned according to the formula U 11 u 12 U 13 U = u 22 u 23, u 32 U 33 where U 11 has order k 1, u 22 is a scalar and U 33 has order n k. The transformation of L in step k is implemented according to the rule L i,k U i,k U k,k, i = k + 1, k + 2,..., n. The diagonal block of U (k) that lies at the intersection of the last n k rows and columns is called the reduced matrix and also has the shorter name A (k). Figure 2.1 shows a Matlab implementation of right-looking LU. Another algorithm that computes the LU factorization is the leftlooking LU factorization. Left-looking LU computes the LU factorization without maintaining an explicit representation of the reduced matrix. This feature is usually considered an advantage of left-looking LU over right-looking LU. The memory of left-looking LU holds a pair of matrices: an n n matrix L and a matrix with n rows, U. The matrix L is initialized with the identity matrix and U is initialized with a matrix with zero columns. The algorithm transforms the two matrices into the LU factorization of A within n steps. Let L (k) and U (k) be the values of L and U after k steps, and let u (k) be the solution

10 2.1. LU FACTORIZATION 10 function [L, U] = right looking lu(a) % Overwrite A with the matrix L - I + U. for k = 1:n - 1 A(k + 1:n, k) = A(k + 1:n, k) / A(k, k); A(k + 1:n, k + 1:n) = A(k + 1:n, k + 1:n)... - A(k + 1:n, k) * A(k, k + 1:n); end % for k % L <- I + lower triangular part of A % U <- upper triangular part of A L = eye(n) + tril(a, -1); U = triu(a); end % function right looking lu Figure 2.1. Right-looking LU factorization of the system L (k 1) u (k) = a k, where a k is column k of A. The transformation in step k satisfies the identities L (k) = L (k 1) L 1 k, U (k) = L k [ U (k 1) u (k)]. The matrix L k in these identities has the same form as its namesake from right-looking LU, with subdiagonal elements in column k defined according to the formula l i,k = u(k) i u (k) k This definition guarantees that the identity A k = L (k) U (k), is invariant throughout the process (the matrix A k consists of the first k columns of A). Additionally, column j of U (k) has only zeros below element j (j = 1, 2,..., k) and L (k) has the same form as its namesake from right-looking LU. The transformation in step k is implemented as follows. (1) Solve Lu = a k by substitution; (2) Augment U with the column ũ. The column ũ is obtained from u by replacing the elements below element k with zeros;.

11 function [L, U] = left looking lu(a) L = eye(n); U = zeros(n); 2.1. LU FACTORIZATION 11 for k = 1:n % solve L * u = A(:, k) by substitution u = L \ A(:, k); U(1:k, k) = u(1:k); L(k + 1:n, k) = u(k + 1:n) / u(k); end % for k end % function left looking lu Figure 2.2. Left-looking LU factorization (3) Assign L i,k u i u k, i = k + 1, k + 2,..., n. Figure 2.2 shows a Matlab implementation of left-looking LU. Algorithms that compute the LU factorization are often enhanced with a pivoting mechanism. Such enhanced algorithms compute a factorization called the LU factorization with pivoting, which consists of a reordering of the rows and columns of A and of the LU factorization of the reordered matrix. The formula for this factorization is P AQ = LU, where P and Q are permutation matrices that affect the reordering of rows and columns, and L and U are the LU factorization of the reordered matrix. The reordering is produced by adjusting the initial ordering as the computation progresses, and the adjustments are made according to a pre-determined rule, called the pivoting strategy. Algorithms that incorporate pivoting are more suitable for computation in floating-point arithmetic and are better adapted to the factorization of sparse matrices. Right-looking LU can be modified to incorporate pivoting. The memory of the modified algorithm holds a pair of n n permutation matrices, P and Q, in addition to the matrices L and U. The matrices P and Q are initialized with the identity matrix. The algorithm transforms the four matrices into the LU factorization with pivoting of A within n 1 steps. Let P (k), Q (k), L (k) and U (k) be the values of the four matrices after k steps. The transformation in step k of the

12 modified algorithm satisfies the identities 2.1. LU FACTORIZATION 12 L (k) = P k L (k 1) P 1 k L 1 k, U (k) = L k P k U (k 1) Q k, P (k) = P k P (k 1), Q (k) = Q (k 1) Q k. The matrix L k in these identities has the same form as its namesake from the algorithm without pivoting, with subdiagonal elements in column k defined by the formula ( ) Pk U (k 1) Q k l i,k = i,k (P k U (k 1) Q k ) k,k. The matrices P k and Q k in the identities have the form [ ] [ ] I I P k =, Q k =, P 22 k where I is the identity matrix of order k 1, the matrix Pk 22 is obtained from the identity matrix of order n k + 1 by interchanging the first row with some row, and Q 22 k is obtained from the identity matrix of order n k + 1 by interchanging the first column with some column. The row and column that are interchanged are chosen according to the Q 22 k pivoting strategy. These definitions guarantee that the identity P (k) AQ (k) = L (k) U (k) is invariant throughout the process. These definitions also guarantee that (1) L (k) is unit lower-triangular and equal to the identity matrix in all but the first k of its columns; (2) the first k columns of U (k) have only zeros below the diagonal. In other words, L (k) and U (k) have the form [ ] [ L (k) L (k) 11 = L (k), U (k) = 21 I U (k) 11 U (k) 12 U (k) 22 ], where L (k) 11 is unit lower-triangular of order k, the matrix I is the identity matrix of order n k, the matrix U (k) 11 is upper-triangular of order k, and U (k) 22 has order n k. That L (k) and U (k) do indeed have this

13 2.1. LU FACTORIZATION 13 form can be seen by considering the more detailed identities for the transformation in step k L (k) = [ U (k) =L k [ L (k 1) 11 Pk 22L(k 1) 21 I ] L 1 k, U (k 1) 11 U (k 1) 12 Q 22 k P 22 k U (k 1) 22 Q 22 k The transformation in step k is implemented according to these two identities as follows: (1) Interchange two rows in each of P, U and the first k 1 columns of L in accordance with the pivoting strategy; (2) Interchange two columns in each of Q and U in accordance with the pivoting strategy; (3) Carry out the rule for step k of right-looking LU without pivoting. Figure 2.3 shows a Matlab pseudo-code for right-looking LU with pivoting. Left-looking LU can also be modified to incorporate pivoting. The algorithm is typically used in conjunction with pivoting strategies that reorder rows but not columns. The memory of the modified algorithm holds an n n permutation matrix P, in addition to L and U. The matrix P is initialized with the identity matrix. The algorithm transforms P, L and U into the LU factorization with pivoting of A within n steps. Let P (k), L (k) and U (k) be the values of the three matrices after k steps, and let u (k) be the solution of the system L (k 1) u (k) = P (k 1) a k. The transformation in step k satisfies the identities L (k) = P k L (k 1) P 1 k L 1 k, [ ] U (k) = L k P k U (k 1) u (k), P (k) = P k P (k 1). The matrix L k in these identities has the same form as its namesake from left-looking LU without pivoting, with subdiagonal elements in ].

14 2.1. LU FACTORIZATION 14 function [P, Q, L, U] = right looking lu pivoting(a) P = eye(n); Q = eye(n); % Overwrite A with the matrix L - I + U for k = 1:n - 1 r = a row index chosen from k, k + 1,..., n according to the pivoting strategy c = a column index, chosen likewise % Interchange rows in P and A P([k r], :) = P([r k], :); A([k r], :) = A([r k], :); % Interchange columns in Q and A Q(:, [k c]) = Q(:, [c k]); A(:, [k c]) = A(:, [c k]); A(k + 1:n, k) = A(k + 1:n, k) / A(k, k); A(k + 1:n, k + 1:n) = A(k + 1:n, k + 1:n)... - A(k + 1:n, k) * A(k, k + 1:n); end % for k % L <- I + lower triangular part of A % U <- upper triangular part of A L = eye(n) + tril(a, -1); U = triu(a); end % function right looking lu pivoting Figure 2.3. Right-looking LU factorization with pivoting column k defined by the formula ( Pk u (k)) i l i,k =. (P k u (k) ) k The matrix P k has the same form as its namesake from right-looking LU with pivoting. These definitions guarantee that the identity P (k) A k = L (k) U (k) is invariant throughout the process. These definitions also guarantee that (1) L (k) is unit lower triangular and equal to the identity matrix in all but the first k of its columns; (2) column j of U (k) has only zeros below element j (j = 1, 2,..., k). The transformation in step k is implemented as follows:

15 2.1. LU FACTORIZATION 15 function [P, L, U] = left looking lu pivoting(a) P = eye(n); L = eye(n); U = zeros(n); for k = 1:n % Solve L * U = P * a k by substitution u = L \ (P * A(:, k)); r = a row index chosen from k, k + 1,..., n according to the pivoting strategy % Interchange rows in P, L and u P([k r], :) = P([r k], :); L([k r], 1:k - 1) = L([r k], 1:k - 1); u([k r]) = u([r k]); U(1:k, k) = u(1:k); L(k + 1:n, k) = u(k + 1:n) / u(k); end % for k end % function left looking lu pivoting Figure 2.4. Left-looking LU factorization with pivoting (1) Solve Lu = P a k by substitution; (2) Interchange two rows in each of P, u and the first k 1 columns of L in accordance with the pivoting strategy; (3) Augment U with the column ũ. The column ũ is obtained from u by replacing the elements below element k with zeros; (4) Assign L i,k = u i u k, i = k + 1, k + 2,..., n. Figure 2.4 shows a Matlab pseudo-code for left-looking LU with pivoting. Solving a linear system by means of the LU factorization amounts to the solution of two triangular systems. The following routine solves Ax = b by means of the LU factorization with pivoting: (1) Compute the LU factorization P AQ = LU; (2) Solve Ly = P b by substitution; (3) Solve Uz = y by substitution; (4) Multiply x = Qz. One benefit of using the LU factorization is that it can be re-used to solve more systems with the same matrix and different right-hand-side vectors.

16 2.2. SPARSE LU 16 An abstract way of thinking about the computation of the LU factorization is that this computation amounts to the evaluation of the expressions u ij = (P AQ) ij (2.1) ( i l ik u kj, (i j) k=1 l ij = (P AQ) ij ) j l ik u kj /u jj. (i > j) k=1 Consider Equation 2.1 in detail. This equation specifies one expression for each element of L and U. Each expression for an element of U prescribes a sequence of multiply-add operations, and each expression for an element of L prescribes a sequence of multiply-add operations followed by a division. The order of multiply-add operations in each expression is not restricted, and the relative order of operations from different expressions is not unique. This means that the expressions can be correctly evaluated by any one of many different sequences of arithmetic operations. The left-looking and right-looking algorithms in this section provide two different examples of such sequences Sparse LU The LU factorization of sparse matrices can often benefit from the exploitation of sparsity. Sparsity can be exploited in two ways. Firstly, sparsity presents opportunities to economize memory space. To this end, sparse LU algorithms use special data-structures that store only the nonzero elements of sparse matrices. Secondly, the factorization of a sparse matrix often requires much less work than that of a general matrix. It is often the case that of all the arithmetic operations that are required to compute the factorization of a general matrix, only few are sufficient to compute the factorization of a sparse matrix. Sparse LU algorithms exploit this by carrying out only those arithmetic operations that are necessary to compute the factorization of the matrix at hand. The exploitation of sparsity often allows sparse LU algorithms to solve problems of a dramatically larger scale than the problems that can be solved by general-purpose LU implementations. One sparse LU algorithm is called gplu. The algorithm was first presented in the paper [10], and a more recent description is included

17 2.2. SPARSE LU 17 in the book [5]. The main feature of gplu is that the total amount of work it carries out to compute a factorization is proportional to the number of arithmetic operations on nonzeros in that factorization. This allows us to asymptotically bound the total amount of work in the factorization (work that includes arithmetic but also pointer manipulation and so on) by simply bounding the number of arithmetic operations. The basic ingredient of gplu is an efficient algorithm for solving sparse triangular systems. The discussion in this section is restricted to unit lower-triangular n n systems Lx = b. An element of the solution, x i, is said to be a symbolic nonzero if either b i is nonzero or both L ij is nonzero and x j is a symbolic nonzero for some j in 1, 2,..., i 1. Let X be the set of symbolic nonzero indices of x, and B the set of nonzero indices of b. Let G(L T ) be the directed graph of n vertices, such that there is an edge from j to i if L ij is a subdiagonal nonzero of L. Let Reach(B) be the set of vertices of G(L T ) that lie along paths that originate in the set B. algorithm in routine col subst of Figure 2.5. Consider the column-wise substitution Since symbolic zeros of x are also numeric zeros, it is safe to replace the for loop in this routine with an iteration over the elements of X. Furthermore, the iteration can execute in any topological order of the subgraph of G(L T ) that is induced by X. Additionally, because the sets Reach(B) and X coincide, a DFS scan from the set B in G(L T ) can compute both the set X and a topological order. These observations result in the routine sparse col subst of Figure 2.5. gplu is based on two data structures: the compressed column storage (CCS) scheme for sparse matrices, and the sparse accumulator (SPA) representation of sparse vectors. The CCS representation of an n n matrix A consists of three vectors: i, x and p. The vector i lists the row indices of the nonzeros of A and x lists their numerical values. The two vectors hold the nonzeros of A in ascending order of column, each nonzero in two elements in matching positions. The vector p holds n+1 indices into i and x. The elements of p are such that the nonzeros in column j of A are held in positions p(j), p(j) + 1,..., p(j + 1) 1 of i and x. The SPA representation of a sparse vector v of length n consists of three vectors: i, x and occupied. The vector i lists the row

18 2.2. SPARSE LU 18 % Solve Lx = b by column-wise substitution function x = col subst(l, b) x = b for j = 1:n - 1 x(j + 1:n) = x(j + 1:n) - L(j + 1:n, j) * x(j) end % for j end % function col subst % Solve sparse Lx = b by column-wise substitution function x = sparse col subst(l, b) Use a DFS scan from the set B in G(L T ) to compute the set X in topological order. x = b for j = X(1), X(2),..., X(end) x(j + 1:n) = x(j + 1:n) - L(j + 1:n, j) * x(j) end % for j end % function sparse col subst Figure 2.5. Solving sparse lower-triangular systems by column-wise substitution indices of the nonzeros of v, x is the standard dense representation of v by n real numbers, and occupied is a vector of n bits that indicate which elements of v are non-zero. gplu is an implementation of left-looking LU. The input of the algorithm is the matrix A in CCS format, and the output are the matrices P, L and U, also in CCS format. The internal representation of P, L and U is not CCS. The matrix L is internally represented in indirect indexing CCS format. The vector i in the representation of L does not list the row indices of the nonzeros, but rather indices into a lookup vector, which lists the actual row indices. Additionally, after k steps, the representation of L holds only the first k of its columns. The matrix U is also represented in indirect indexing CCS format. The matrix P is represented by a vector, pinv, which is defined by the identity 1 2 pinv = P T.. n

19 2.3. NUMERICAL STABILITY OF THE LU FACTORIZATION 19 The system Lu = P a k is solved in each step by means of the sparse column-wise substitution algorithm. The solution, u, is held in indirect indexing SPA format. The vector i in the representation of u does not store the nonzero indices of u, but rather indices into a lookup vector. The vectors x and occupied hold the nonzeros of u in the positions indicated by i, not in their actual positions. The lookup vector in the representations of L, U and u is the vector pinv. A convenient property of this choice is that the CCS representation of A is, in fact, a representation of the matrix P A in indirect indexing CCS format with lookup vector pinv. The indirect indexing scheme is necessary because it allows to make row interchanges in P, L, u and A simply by swapping two elements in pinv. There is another detail in the implementation of gplu that should be emphasized. The re-initialization of the SPA in each step is done by resetting those elements of x and occupied that are listed in i to their original state and then emptying i. It is important not to reset the other elements of x and occupied, because that would cost O(n 2 ) work for the whole factorization, which will usually overwhelm the cost of the other aspects of the algorithm Numerical Stability of the LU Factorization The subject of this section is the influence of floating-point arithmetic on algorithms that solve linear systems by means of the LU factorization. The standard error analysis is given here only in summary form. An exhaustive reference on this subject is the book [11]. One way to analyze the stability of numerical algorithms is through the idea of backward stability. The analysis of solving Ax = b by means of the LU factorization proceeds according to the following outline. The introduction of errors in floating-point computation usually means that an algorithm that computes the solution to Ax = b will obtain the solution ˆx to a perturbed system, (A + δa) ˆx = b and not the solution of the original system. The relative backward error, δa / A, measures the size of the perturbation. The scalar κ(a) = A A 1, called the condition number of the system Ax = b, measures the sensitivity of the solution x to perturbations in A and b. The relative error of the computed solution, ˆx x / x, usually has the same order of magnitude as the product of the relative backward error and of the condition

20 2.3. NUMERICAL STABILITY OF THE LU FACTORIZATION 20 number of the system. This analysis splits the error in the computed solution into two factors: the backward error, which is contributed by the algorithm, and the condition number, which is the inherent difficulty of the problem. This analysis suggests that algorithms with small backward error are as accurate as possible. Algorithms that guarantee backward error as small as machine precision are said to be backward stable. When is an algorithm backward stable? The backward error is known to satisfy δa L U γ 3n, A A where γ 3n = 3nɛ M / (1 3nɛ M ), and ɛ M is the machine epsilon. This means that an algorithm that makes sure that L U / A is small has small backward error. The partial pivoting strategy is one method for making sure that this is the case. Partial pivoting is a row interchange strategy, that chooses the interchange in step k so as to place the largest element in absolute value of column k on the diagonal. When partial pivoting is used, the growth factor is defined by ρ = max i,j U i,j max i,j A i,j. Because the growth factor and L U / A have the same order of magnitude, making sure that the growth factor is small is another way to guarantee a solution with small backward error. Partial pivoting is known in practice to have a small growth factor (though not in theory).

21 CHAPTER 3 Notation Let B be an arbitrary n-by-n square matrix, possibly unsymmetric. The graph G B of B is an undirected graph with vertices {1, 2,..., n} and with the edge set E B = {(i, j) : i j and (B ij 0 or B ji 0)}. In an undirected graph G, we denote by d G (i) the degree of vertex i, and by d max (G) the maximal degree in the graph. If d G (i) = 1, we denote by p G (i) the neighbor of i. When the identity of G is clear from context, we denote d(i) = d G (i), d max = d max (G) and p(i) = p G (i). Consider an elimination algorithm, say Gaussian elimination. We view the elimination algorithm in two different ways. The algebraic view is that the elimination of k rows and columns produces partial factors and a reduced matrix B (k). The graphical view is that the elimination is a game that defines rules on how to eliminate vertices from a graph. For example, the rule for the Cholesky factorization of a symmetric positive definite matrix is that step k removes vertex k and turns its neighbors into a clique. We denote by G (k) B the graph that the game produces after the elimination of vertices 1,..., k from G = G (0) B. When the identity of B is also clear from the context, we denote G (k) = G (k) subgraph of G (k) B B. Under this definition, the graph of B(k) is a, and it may be a proper subgraph if cancellations occurred. The vertices of G (k) B are {k + 1,..., n}. That is, in the graphical representation we always view vertices 1,..., k as having been eliminated first, even if the elimination ordering performed some pivoting steps. In the rest of this thesis A always denotes an n-by-n sparse matrix whose graph is a tree. 21

22 CHAPTER 4 LU with Partial Pivoting Using Minimum Degree on G This chapter analyzes the behavior of the conventional sparse LU factorization with partial pivoting under effective column orderings. We begin with an analysis of work and fill under the most obvious column ordering, which always eliminates in step k a leaf in G (k 1). We show that G (k) is always a tree, that we can compute this ordering before the factorization begins, and that this process requires O(d max n) arithmetic operations and generates O(d max n) fill. The ordering induced by this rule is simply the minimum degree ordering applied to G. The key observation is that although pivoting can generate fill when G is a tree, the fill can only occur in U; but pivoting essentially cannot change the graph of the reduced matrix. Lemma 4.1. Let 1 be a leaf of G (0). After one step of Gaussian elimination with partial pivoting, the following properties hold: (1) G (1) is a subgraph of G (0) induced by vertices 2,..., n, and hence it is also a tree. (2) The first row of U contains at most d(p(1)) + 1 nonzeros. (3) The first column of L contains at most two nonzeros. (4) The first elimination step performs at most one comparison, one division, and at most d(p(1)) multiply-add operations. Proof. Since 1 is a leaf, the first column and row of A contain at most two nonzeros, in positions 1 and p(1). If the diagonal element is larger in absolute value, the algorithm simply modifies one diagonal element in A (1) without producing any fill. In this case, the first row of U and the first column of L will each have at most two nonzeros, in positions 1 and p(1). If, on the other hand, Ap(1),1 > A1,1, Gaussian elimination with partial pivoting will exchange rows 1 and p(1). The algorithm will then subtract a scaled multiple of A p(1),: from A 1,:, leading to fill of at most d(p(1)) + 1 nonzeros in A 1,:. The columns 22

23 4. LU WITH PARTIAL PIVOTING USING MINIMUM DEGREE ON G 23 of these potential nonzeros include columns 1 and p(1), so the new structure of row 1 is exactly the structure of row p(1). This proves the first claim of the lemma. When pivoting occurs, the structure of the first row of U is the structure of A p(1),:. The structure of the first column of L is always the structure of A :,1. This proves the other claims in the lemma. This lemma leads to the the main result on LU with partial pivoting. Lemma 4.2. Choose some root for G and use it to define the parent p(i) of every vertex i except the root. Let Q be the permutation matrix of any ordering of {1,..., n} such that p(i) is ordered after i. If we apply Gaussian elimination with partial pivoting to compute Q T AQ = P LU, then (1) L contains at most 2n nonzeros. (2) U contains at most n (d(i) + 1)(d(i) + 2) 3 = O(d max n) 2 i=1 nonzeros. (3) The algorithm performs n 1 comparisons and O(d max n) arithmetic operations. Proof. The definition of Q along with Lemma 4.1 ensure that in the kth step, when the algorithm eliminates column k of (Q T AQ) (k 1), this column has at most two nonzeros (k is a leaf in G (k 1) ). Consider the rows corresponding to the nonzeros in column k of (Q T AQ) (k 1). In one of these there are at most two nonzeros (again because k is a leaf in G (k 1) ). In the other the number of nonzeros is bounded by d + 1, where d is the degree of the parent of k in G (k 1). The bound on the number of nonzeros in L is, therefore, trivial. The bound on the number of nonzeros in U is derived by charging the nonzeros in row k of U to i = p(k). The number of nonzeros that i is charged for is the sum over its children of the number of remaining children plus one when they are eliminated. Therefore, i is charged for d(i) + 1 nonzeros when its first child is eliminated, for d(i) when the second child is eliminated, and so on, down to 3 nonzeros when the last child is eliminated. This gives the formula inside the summation. Note

24 4. LU WITH PARTIAL PIVOTING USING MINIMUM DEGREE ON G 24 that a leaf i of G contributes 0 to the sum. The asymptotic bound is easy to prove: no row of U contains more than d max + 1 nonzeros. The bound on the number of nonzeros in a row of U, along with the fact that columns of L have at most two nonzeros, give the bound on arithmetic operations and comparisons. It is possible to construct a family of matrices that shows that the bounds in this lemma are asymptotically tight. We omit the details, but we do describe in Section 6 experimental results with such matrices, which clearly show the tightness of the results. We note that although this ordering method is fairly natural for tree-structured matrices, it is different than orderings that we would get from ordering algorithms designed for general sparse Gaussian elimination with partial pivoting. One such algorithm is column minimum degree [9] or its approximate-degree variants [6, 7]. This algorithm produces minimum-degree-type orderings for A A but without computing the nonzero structure of A A. The degree of a vertex in the graph of A A can reach the number of vertices within distance 2 of it in G A. Therefore, a column minimum degree algorithm may order internal vertices of G A before leaves in G A. Our algorithm does not. Another algorithm for ordering the columns of matrices for fill reduction in LU with partial pivoting relies on wide separators [2]. Consider a regular degree-d rooted tree. We can partition hierarchically with wide separators as follows. The root and its children constitute the top-level separator, connected components in the next two levels form the next-level separators, and so on. There are d + 1 vertices in each separator: a vertex and its children in G. The wide-separators ordering algorithm does not specify the ordering within separators, so the vertex might be eliminated before its children. This leads to O(dn) nonzeros in L, O(dn) nonzeros in U, and O(d 2 n) arithmetic operations. The number of operations is a factor of d worse than in our approach. When Gaussian elimination with partial pivoting always eliminates a leaf of the reduced graph (as is the case in this algorithm and in the algorithm of the next section), the growth factor in the elimination is limited to d max + 1. Unless d max is huge, this ensures that the elimination is backward stable.

25 4. LU WITH PARTIAL PIVOTING USING MINIMUM DEGREE ON G 25 Lemma 4.3. Let M be a bound on the magnitude of the elements of A, and let L and U be the triangular factors produced by Gaussian elimination with partial pivoting of AQ for some permutation matrix Q. If the elimination always eliminates leaves in the reduced graph, then the elements of U and of the reduced matrices are bounded in magnitude by (d max + 1) M (the elements of L are always bounded in magnitude by 1 in Gaussian elimination with partial pivoting). Proof. Off-diagonal elements in the reduced matrices never grow. If we eliminate a leaf i without exchanging the row of the leaf and of its parent, then the only element that is modified in the reduced matrix is the diagonal element associated with the parent. If we do exchange the two rows, then off-diagonals in row p(i) in the new reduced matrix are computed by A (i) p(i),j = L p(i),ia (i 1) p(i),j. Since L p(i),i 1, the offdiagonals cannot grow in magnitude. Diagonal elements can grow. In an elimination step without a row exchange, we have A (i) p(i),p(i) = A(i 1) p(i),p(i) L p(i),ia (i 1) i,p(i). In an elimination step with a row exchange, we have A (i) p(i),p(i) = A(i 1) i,p(i) L p(i),ia (i 1) p(i),p(i). Since every diagonal element in the reduced matrices undergoes at most d max modifications, it is easy to see by induction that elements in the reduced matrices and therefore in U are bounded in magnitude by (d max + 1) M. If A is not only tree-structured but tridiagonal and no column pivoting is used, then the proof of this lemma proves a normwise growth bound by a factor of 2. This is a known result that is a special case of the analysis of Bohte [1] of growth in Gaussian elimination with partial pivoting of banded matrices (see also [11], Section 9.5).

26 CHAPTER 5 Sibling-Dominant Pivoting We now show that we can reduce the work and fill to O(n) using a more sophisticated ordering strategy. In this algorithm, the column ordering depends on the numerical values of A, not only on the structure of G and the reduced graphs. Furthermore, we build this ordering dynamically as the algorithm progresses, not as a preprocessing step. But even with the cost of the dynamic ordering taken into account, the algorithm still performs only O(n) operations. Definition 5.1. The dominance of column j in a matrix B is B jj max i j B ij dom(j) = max i j B ij B jj otherwise. The dominance is not continuous in B jj ; as B jj grows, the dominance shrinks towards 1, but then jumps to. We say that column j dominates column k if dom(j) dom(k). Our algorithm works as follows. It selects a vertex i with at most one non-leaf neighbor. It will eliminate next the leaves {j 1,..., j l } whose neighbor is i, but in a specific order. To determine the ordering of {j 1,..., j l }, the algorithm first computes the dominance of {j 1,..., j l }. Next, the algorithm finds the column in {j 1,..., j l } with the largest finite dominance (if any). This column is ordered after all the columns with infinite dominance and before all the other columns with finite dominance. Now the algorithm performs the elimination of {j 1,..., j l }, breaking ties by not pivoting. That is, if A (k) j,j = A (k) i,j for some j {j 1,..., j l }, the algorithm uses the diagonal element of the reduced matrix as a pivot. Lemma 5.2. The algorithm given above performs at most one pivoting step during the elimination of {j 1,..., j l }. 26

27 5. SIBLING-DOMINANT PIVOTING 27 Proof. The elimination sequence starts with columns with infinite dominance (perhaps none). There are two kinds of such columns: columns that are diagonally dominant, and columns with a zero diagonal. If we eliminate a leaf column j with a zero diagonal, rows j and i = p(j) are simply exchanged and then row and column j are dropped from the reduced matrix. In this case, no more pivoting can occur during the elimination of the remaining columns in {j 1,..., j l }. In the elimination of a leaf column that is diagonally dominant (even weakly dominant), the algorithm does not pivot, and we only modify element A i,i in the reduced matrix. Because only this element is modified, the dominance of the remaining columns in {j 1,..., j l } is preserved. Now consider what happens when we eliminate the first column j in {j 1,..., j l } with a finite dominance. Its dominance is larger than that of any other column in {j 1,..., j l } with a finite dominance. Because j has finite dominance, the algorithm pivots in column j; it exchanges rows j and i = p(j). In the reduced matrix, we have A (j) i,k = A(j 1) j,k A(j 1) j,j A (j 1) i,j A (j 1) i,k. (This formula represents both the row exchange and the numerical modification.) For a sibling k of j, we have A (j 1) j,k = 0, so and the absolute values satisfy A (j) i,k A(j 1) A (j) i,k = j,j A (j 1) A (j 1) i,k i,j = = A (j 1) j,j A (j 1) i,j A (j 1) i,k dom(j). A (j 1) i,k We claim that the remaining uneliminated columns in {j 1,..., j l } have become diagonally dominant. Let k be one of these columns. We have

28 5. SIBLING-DOMINANT PIVOTING 28 dom(j) dom(k), so A (j) k,k = = = = A (j 1) k,k A (j 1) k,k A (j 1) i,k A (j 1) i,k dom(k) A (j 1) i,k dom(j) A (j). i,k A (j 1) i,k This implies that if we eliminate k next, the algorithm will not pivot. Since we have already shown that when the algorithm does not pivot, it does not modify other sibling columns, the other siblings will remain diagonally dominant and they will not require pivoting either. We note that columns with infinite dominance can be ordered after the column with the largest finite dominance, but the proof becomes longer. Since there is no benefit in this variant ordering, we kept the ordering strict and the proof short. In our sibling-dominant elimination, the elimination of each column uses the regular partial-pivoting rule. It appears that the column ordering cannot be computed before the numerical factorization begins. Nonetheless, the overall effort to compute the column ordering is O(n) operations. Lemma 5.3. The total amount of work to order the columns in sibling-dominant partial pivoting requires only O(n) operations and O(n) storage. Proof. We pre-compute a partial column ordering before the numerical factorization begins and refine it into a total ordering during the numerical factorization. We use two integer vectors of size n: column-index and partial-order. The preordering phase begins with an arbitrary choice of root for G A, say vertex 1. We write the index of the root into column-index[1] and an invalid column index (such as 1 or n+1) into partial-order[1].

29 5. SIBLING-DOMINANT PIVOTING 29 We now perform a breadth-first search (BFS) of G A starting from the root. When we visit a new vertex, we write its column index into the next-unused location in column-index and the index of its parent in the rooted G A into the first unused location in partial-order. The identity of the parent is always known, because it is the vertex from which we discovered the current one. The partial ordering is specified by the reverse ordering of the sameparent groups of columns in the two vectors. That is, the first group of columns to be eliminated will be the last group with the same parent to be discovered by the BFS. It is not hard to see that if the elimination ordering respects this partial order, then each group with the same parent is eliminated when all the members of the group are leaves of the reduced graph. Given the vectors column-index and partial-order we can enumerate the siblings groups in time proportional to their size, by scanning from the last uneliminated column in the vectors towards the beginning of the vectors until we find a column with a different parent (or no parent). During this enumeration, we can find and eliminate all the columns with infinite dominance, and we can compute the dominance of the rest. Once this traversal and the elimination of infinite-dominance columns is complete, we eliminate the column with the largest finite dominance, if any, and then we eliminate its siblings. Clearly, the total work and storage required for computing the column ordering is O(n). The keys to efficiently computing the ordering are an efficient pre-computation of the sibling groups (using BFS) and the fact that in each group we only need to find one dominant sibling, not to sort all the siblings by dominance. The sibling-dominant column ordering results in a partial-pivoting Gaussian elimination algorithm that only performs O(n) work and only generates O(n) fill. Theorem 5.4. The work and fill in sibling-dominant partial pivoting are both O(n). Proof. Sibling-dominant partial pivoting is a special case of partialpivoting using a minimum degree ordering. By lemmas 4.1 and 4.2,

30 5. SIBLING-DOMINANT PIVOTING 30 each column in L contains at most two nonzeros and requires a constant number of operations to compute. The elimination of columns with infinite dominance because they are diagonally dominant requires no pivoting. Therefore, the row of U is simply the row of the reduced matrix, which contains at most two nonzeros because the reduced graph is a tree and the eliminated column is a leaf. If we eliminate a column with infinite dominance because its diagonal element is zero or if there are no such columns and we eliminate the column with the largest finite dominance, then we exchange two rows. This causes fill in the row of U: the number of fill elements and the number of multiplyadd operations is bounded by the number of remaining siblings in the group plus 2. However, once we perform such a row exchange, further eliminations in the same sibling group will not require any row exchanges (Lemma 5.2). This implies that no more fill will occur in U within this group, and that the number of arithmetic operations per remaining column in the group is bounded by a small constant. Since we showed that the ordering of the elimination steps also requires only O(n) operations, the lemma holds.

31 CHAPTER 6 Experimental Results We conducted three sets of experiments in order to demonstrate the effectiveness of the sibling-dominant pivoting method. In all the experiments we used almost-complete regular trees. These trees are degree-d max complete trees with some of the leaves removed to obtain a specific number of vertices. In these trees, the degree of all the internal vertices is the same except for one or two (the root and perhaps one vertex in the second-to-last level). Figure 6.1 shows an example of such a tree. The first experiment, whose results are shown in Figures 6.2, 6.3, and 6.4, used trees with 1000 vertices and d max that ranged from 2 to 999. The matrices are all symmetric and the value of all the nonzero off-diagonal elements is 1. The values of the diagonal elements are constructed so as to cause LU with partial pivoting to fill as much as possible when the column ordering is produced by a minimum-degree algorithm applied to A + A. In other words, to construct the matrices we first construct their graph, we then construct a minimum-degree ordering for this graph, permute the matrix according to this ordering, and finally compute the values of the diagonal elements. The results of the experiments show clearly that the amount of fill and the number of arithmetic operations that LU with partial pivoting performs on these matrices is linear in d max, even though the elimination was ordered bottom-up using a minimum-degree algorithm. The Figure 6.1. An almost complete regular tree with d max = 4. 31

32 number of non zeros EXPERIMENTAL RESULTS 32 6 x n = 1000, worst case diagonals 105 minimum degree 5 sibling dominant d max Figure 6.2. The number of nonzeros in the LU factors of symmetric matrices whose graphs are almost-complete regular trees. The values of the elements of the matrices were chosen so as to cause as much fill as possible in LU with partial pivoting when the elimination ordering is bottom-up. arithmetic operations 5 x n = 1000, worst case diagonals 105 minimum degree sibling dominant d max Figure 6.3. The number of arithmetic operations in the factorizations, on the same matrices as in Figure 6.2. slight non-linearities are caused by the fact that the trees are not complete. The growth in both algorithms is exactly the same and appears

33 6. EXPERIMENTAL RESULTS n = 1000, worst case diagonals minimum degree sibling dominant growth factor d max Figure 6.4. The growth in U on the same matrices as in Figure 6.2. The two data sets overlap each other exactly. to be slightly sub-linear. The fact that the growth in the two algorithms is exactly the same is an artifact of the special structure of these matrices; in general, the growth factors would be different. In the second set of experiments we used random diagonal elements with uniform distribution in [ 1, 1]; the non-zero off-diagonals were still 2 all 1. The trees were still almost complete with 10, 000 vertices. For each of 10 different values of d max we generated 100 random matrices. The results, shown in Figures 6.5, 6.6, and 6.7 indicate that on these trees, partial pivoting with a bottom-up column preordering fills much less than the worst-case bound. The average amount of fill grows very little with d max, but the variance in fill does grow with d max. Still, in absolute terms the amount of fill and arithmetic in sibling-dominant pivoting is much smaller than in partial pivoting. The growth in U is exactly linear in d max and is almost completely invariant to the random choice of diagonal elements. In this case too, the growth in the two algorithms is the same (again an artifact of the special construction). The last set of experiments used similar matrices, but the value of all the diagonal elements was n (number of vertices and the order of the matrices). These matrices are strongly diagonally dominant. The

34 number of non zeros EXPERIMENTAL RESULTS x n = 10000, random diagonals 105 minimum degree 1.6 sibling dominant d max Figure 6.5. The number of nonzeros in the LU factors of symmetric matrices whose graphs are almost-complete regular trees. The values of the off-diagonal nonzeros are all 1 and the values of the diagonal elements are random. The markers and the lines show the average of 100 matrices and the error bars show the standard deviation. arithmetic operations 16 x n = 10000, random diagonals 104 minimum degree 14 sibling dominant d max Figure 6.6. The number of arithmetic operations in the experiments described in Figure 6.5. LU factorization with partial pivoting was computed after a columnpreordering phase that started by a random column ordering but then

35 6. EXPERIMENTAL RESULTS n = 10000, random diagonals minimum degree sibling dominant growth factor d max number of non zeros Figure 6.7. The growth factors in the experiments described in Figure x n = 1000, diagonally dominant 105 colamd 2.5 sibling dominant d max Figure 6.8. The number of nonzeros in the factorization of symmetric diagonally-dominant tree-structured matrices. For LU with partial pivoting, the matrices were ordered using colamd following an initial random ordering; the randomness in the results is caused by this initial random ordering. reordered again using colamd [6], a minimum-degree heuristic ordering that tries to minimize fill in the Cholesky factor of A A (without explicitly computing A A or its nonzero structure). Multiple initial

36 arithmetic operations EXPERIMENTAL RESULTS 36 5 x n = 1000, diagonally dominant 107 colamd sibling dominant d max Figure 6.9. The number of arithmetic operations in the experiments shown in Figure 6.8. random orderings of each matrix produced variance in the fill and work of the partial-pivoting code, even with the colamd reordering. The matrices are diagonally dominant, so no pivoting was performed by LU with partial pivoting. Therefore, minimum degree on A+A would have produced no fill and linear work. The results in Figures 6.8 and 6.9 indicate that colamd leads to superlinear fill and work in LU with partial pivoting even though the matrix is a diagonally dominant tree. Colamd indeed guards the factorization from catastrophic fill, by considering any pivoting sequence. But the price of this conservatism is some fill even when no pivoting is performed.

Gaussian Elimination and Back Substitution

Jim Lambers MAT 610 Summer Session 2009-10 Lecture 4 Notes These notes correspond to Sections 31 and 32 in the text Gaussian Elimination and Back Substitution The basic idea behind methods for solving