Reconstructing Householder Vectors from Tall-Skinny QR

Size: px

Start display at page:

Download "Reconstructing Householder Vectors from Tall-Skinny QR"

Walter Harrell
5 years ago
Views:

Reconstructing Householder Vectors from Tall-Skinny QR Grey Ballard James Demmel Laura Grigori Mathias Jacquelin Hong Die Nguyen Edgar Solomonik Electrical Engineering

1 Reconstructing Householder Vectors from Tall-Skinny QR Grey Ballard James Demmel Laura Grigori Mathias Jacquelin Hong Die Nguyen Edgar Solomonik Electrical Engineering and Comuter Sciences University of California at Berkeley Technical Reort No. UCB/EECS htt:// October 6, 013

2 Coyright 013, by the author(s. All rights reserved. Permission to make digital or hard coies of all or art of this work for ersonal or classroom use is granted without fee rovided that coies are not made or distributed for rofit or commercial advantage and that coies bear this notice and the full citation on the first age. To coy otherwise, to reublish, to ost on servers or to redistribute to lists, requires rior secific ermission. Acknowledgement We would like to thank Yusaku Yamamoto for sharing his slides from SIAM ALA 01 with us. We also thank Jack Poulson for hel configuring Elemental. Solomonik was suorted by a Deartment of Energy Comutational Science Graduate Fellowshi, grant number DE-FG0-97ER5308. This research used resources of the National Energy Research Scientific Comuting Center (NERSC, which is suorted by the Office of Science of the U.S. Deartment of Energy under Contract No. DE- AC0-05CH1131. We also acknowledge DOE grants DE-SC , DE-SC , DE-SC , DE-SC , DE-SC001000, AC0-05CH1131, and DARPA grant HR

3 Reconstructing Householder Vectors from Tall-Skinny QR October 6, 013 Grey Ballard 1, James Demmel, Laura Grigori 3, Mathias Jacquelin 4, Hong Die Nguyen, and Edgar Solomonik 1 Sandia National Laboratories, Livermore, USA University of California, Berkeley, Berkeley, USA 3 INRIA Paris - Rocquencourt, Paris, France 4 Lawrence Berkeley National Laboratory, Berkeley, USA Abstract The Tall-Skinny QR (TSQR algorithm is more communication efficient than the standard Householder algorithm for QR decomosition of matrices with many more rows than columns. However, TSQR roduces a different reresentation of the orthogonal factor and therefore requires more software develoment to suort the new reresentation. Further, imlicitly alying the orthogonal factor to the trailing matrix in the context of factoring a square matrix is more comlicated and costly than with the Householder reresentation. We show how to erform TSQR and then reconstruct the Householder vector reresentation with the same asymtotic communication efficiency and little extra comutational cost. We demonstrate the high erformance and numerical stability of this algorithm both theoretically and emirically. The new Householder reconstruction algorithm allows us to design more efficient arallel QR algorithms, with significantly lower latency cost comared to Householder QR and lower bandwidth and latency costs comared with Communication-Avoiding QR (CAQR algorithm. As a result, our final arallel QR algorithm outerforms ScaLAPACK and Elemental imlementations of Householder QR and our imlementation of CAQR on the Hoer Cray XE6 NERSC system. 1 Introduction Because of the rising costs of communication (i.e., data movement relative to comutation, so-called communication-avoiding algorithms ones that erform roughly the same comutation as alternatives but significantly reduce communication often run with reduced running times on today s architectures. In articular, the standard algorithm for comuting the QR decomosition of a tall and skinny matrix (one with many more rows than columns is often bottlenecked by communication costs. A more recent algorithm known as Tall-Skinny QR (TSQR is resented in [1] (the ideas go back to [] and overcomes this bottleneck by reformulating the comutation. In fact, the algorithm is communication otimal, attaining the lower bound for communication costs of QR decomosition (u to a logarithmic factor in the number of rocessors [3]. Not only is communication reduced in theory, but the algorithm has been demonstrated to erform better on a variety of architectures, including multicore rocessors [4], grahics rocessing units [5], and distributed-memory systems [6]. The standard algorithm for QR decomosition, which is imlemented in LAPACK [7], ScaLAPACK [8], and Elemental [9] is known as Householder-QR (given below as Algorithm 1. For tall and skinny matrices, the algorithm works column-by-column, comuting a Householder vector and alying the corresonding transformation for each column in the matrix. When the matrix is distributed across a arallel machine, 1

4 this requires one arallel reduction er column. The TSQR algorithm (given below as Algorithm, on the other hand, erforms only one reduction during the entire comutation. Therefore, TSQR requires asymtotically less inter-rocessor synchronization than Householder-QR on arallel machines (TSQR also achieves asymtotically higher cache reuse on sequential machines. Comuting the QR decomosition of a tall and skinny matrix is an imortant kernel in many contexts, including standalone least squares roblems, eigenvalue and singular value comutations, and Krylov subsace and other iterative methods. In addition, the tall and skinny factorization is a standard building block in the comutation of the QR decomosition of general (not necessarily tall and skinny matrices. In articular, most algorithms work by factoring a tall and skinny anel of the matrix, alying the orthogonal factor to the trailing matrix, and then continuing on to the next anel. Although Householder- QR is bottlenecked by communication in the anel factorization, it can aly the orthogonal factor as an aggregated Householder transformation efficiently, using matrix multilication [10]. The Communication-Avoiding QR (CAQR [1] algorithm uses TSQR to factor each anel of a general matrix. One difficulty faced by CAQR is that TSQR comutes an orthogonal factor that is imlicitly reresented in a different format than that of Householder-QR. While Householder-QR reresents the orthogonal factor as a set of Householder vectors (one er column, TSQR comutes a tree of smaller sets of Householder vectors (though with the same total number of nonzero arameters. In CAQR, this difference in reresentation imlies that the trailing matrix udate is done using the imlicit tree reresentation rather than matrix multilication as ossible with Householder-QR. From a software engineering ersective, this means writing and tuning more comlicated code. Furthermore, from a erformance ersective, the udate of the trailing matrix within CAQR is less communication efficient than the udate within Householder-QR by a logarithmic factor in the number of rocessors. Building on a method introduced by Yamamoto [11], we show that the standard Householder vector reresentation may be recovered from the imlicit TSQR reresentation for roughly the same cost as the TSQR itself. The key idea is that the Householder vectors that reresent an orthonormal matrix can be comuted via LU decomosition (without ivoting of the orthonormal matrix subtracted from a diagonal sign matrix. We rove that this reconstruction is as numerically stable as Householder-QR (indeendent of the matrix condition number, and validate this roof with exerimental results. This reconstruction method allows us to get the best of TSQR algorithm (avoiding synchronization as well as the best of the Householder-QR algorithm (efficient trailing matrix udates via matrix multilication. By obtaining Householder vectors from the TSQR reresentation, we can logically decoule the block size of the trailing matrix udates from the number of columns in each TSQR. This abstraction makes it ossible to otimize anel factorization and the trailing matrix udates indeendently. Our resulting arallel imlementation outerforms ScaLAPACK, Elemental, and our own lightly-otimized CAQR imlementation on the Hoer Cray XE6 latform at NERSC. While we do not exerimentally study sequential erformance, we exect our algorithm will also be beneficial in this setting, due to the cache efficiency gained by using TSQR. Performance cost model In this section, we detail our algorithmic erformance cost model for arallel execution on rocessors. We will use the α-β-γ model which exresses algorithmic costs in terms of comutation and communication costs, the latter being comosed of a bandwidth cost as well as a latency cost. We assume the cost of a message of size w words is α+βw, where α is the er-message latency cost and β is the er-word bandwidth cost. We let γ be the cost er floating oint oeration. We ignore the network toology and measure the costs in arallel, so that the cost of two disjoint airs of rocessors communicating the same-sized message simultaneously is the same as that of one message. We also assume that two rocessors can exchange equal-sized messages simultaneously, but a rocessor can communicate with only one other rocessor at a time. Our algorithmic analysis will deend on the costs of collective communication, articularly broadcasts,

5 reductions, and all-reductions, and we consider both tree-based and bidirectional-exchange (or recursive doubling/halving algorithms [1, 13, 14]. For arrays of size w, the collectives scatter, gather, allgather, and reduce-scatter can all be erformed with communication cost α log + β 1 w (1 (reduce-scatter also incurs a comutational cost of γ (( 1/w. Since a broadcast can be erformed with scatter and all-gather, a reduction can be erformed with reduce-scatter and gather, and an all-reduction can be erformed with reduce-scatter and all-gather, the communication costs of those collectives for large arrays are twice that of Equation (1. For small arrays (w <, it is not ossible to use ielined or recursive-doubling algorithms, which require the subdivision of the message into many ieces, therefore collectives such as a binomial tree broadcast must be used. In this case, the communication cost of broadcast, reduction, and all-reduction is α log + β w log. 3 Previous Work We distinguish between two tyes of QR factorization algorithms. We call an algorithm that distributes entire rows of the matrix to rocessors a 1D algorithm. Such algorithms are often used for tall and skinny matrices. Algorithms which distribute the matrix across a D grid of r c rocessors are known as D algorithms. Many right-looking D algorithms for QR decomosition of nearly square matrices divide the matrix into column anels and work anel-by-anel, factoring the anel with a 1D algorithm and then udating the trailing matrix. We consider two such existing algorithms in this section: D-Householder-QR (using Householder-QR and CAQR (using TSQR. 3.1 Householder-QR We first resent Householder-QR in Algorithm 1, following [15] so that each Householder vector has a unit diagonal entry. We use LAPACK [7] notation for the scalar quantities. 1 We resent Algorithm 1 in Matlab-style notation as a sequential algorithm. The algorithm works column-by-column, comuting a Householder vector and then udating the trailing matrix to the right. The Householder vectors are stored in an m b lower triangular matrix Y. Note that we do not include τ as art of the outut because it can be recomuted from Y. While the algorithm works for general m and n, it is most commonly used when m n, such as a anel factorization within a square QR decomosition. In LAPACK terms, this algorithm corresonds to geqr and is used as a subroutine in geqrf. In this case, we also comute an uer triangular matrix T so that n Q = (I τ i y i yi T = I Y T Y T, i=1 which allows the alication of Q T to the trailing matrix to be done efficiently using matrix multilication. Comuting T is done in LAPACK with larft but can also be comuted from Y T Y by solving the equation Y T Y = T 1 + T T for T 1 (since Y T Y is symmetric and T 1 is triangular, the off-diagonal entries are equivalent and the diagonal entries differ by a factor of [16] D Algorithm We will make use of Householder-QR as a sequential algorithm, but there are arallelizations of the algorithm in libraries such as ScaLAPACK [8] and Elemental [9] against which we will comare our new aroach. Assuming a 1D distribution across rocessors, the arallelization of Householder-QR (Algorithm 1 requires communication at lines and 8, both of which can be erformed using an all-reduction. 1 However, we deart from the LAPACK code in that there is no check for a zero norm of a subcolumn. 3

6 Algorithm 1 [Y, R] = Householder-QR(A Require: A is m b 1: for i = 1 to b do % Comute the Householder vector : α = A(i, i, β = A(i:m, i 3: if α > 0 then 4: β = β 5: end if 6: A(i, i = β, τ(i = β α β 7: A(i+1:m, i = 1 α β A(i+1:m, i % Aly the Householder transformation to the trailing matrix 8: z = τ(i [A(i, i+1:b + A(i+1:m, i T A(i+1:m, i+1:b] 9: A(i+1:m, i+1:b = A(i+1:m, i+1:b A(i+1:m, i z 10: end for Ensure: A = ( n i=1 (I τ iy i y T i R Ensure: R overwrites the uer triangle and Y (the Householder vectors has imlicit unit diagonal and overwrites the strict lower triangle of A; τ is an array of length b with τ i = /(y T i y i Because these stes occur for each column in the matrix, the total latency cost of the algorithm is b log. This synchronization cost is a otential arallel scaling bottleneck, since it grows with the number of columns of the matrix and does not decrease with the number of rocessors. The bandwidth cost is dominated by the all-reduction of the z vector, which has length b i at the ith ste. Thus, the bandwidth cost of Householder-QR is (b / log. The comutation within the algorithm is load balanced for a arallel cost of (mb b 3 /3/ flos D Algorithm In the context of a D algorithm, in order to erform an udate with the comuted Householder vectors, we must also comute the T matrix from Y in arallel. The leading order cost of comuting T 1 from Y T Y is mb / flos lus the cost of reducing a symmetric b b matrix, α log + β b /; note that the communication costs are lower order terms comared to comuting Y. We resent the costs of arallel Householder-QR in the first row of Table 1, combining the costs of Algorithm 1 with those of comuting T. We refer to the D algorithm that uses Householder-QR as the anel factorization as D-Householder- QR. In ScaLAPACK terms, this algorithm corresonds to xgeqrf. The overall cost of D-Householder- QR, which includes anel factorizations and trailing matrix udates, is given to leading order by ( 6mnb 3n b γ r + n b + mn n 3 /3 c +β (nb log r + mn n r + n c +α The derivation of these costs is very similar to that of CAQR-HR (see Section 4.3. (assuming m n and b = n/( log then we obtain the leading order costs γ (mn n 3 /3/ + β (mn + n / + α n log. ( n log r + n b log c. If we ick r = c = Note that these costs match those of [1, 8], with excetions coming from the use of more efficient collectives. The choice of b is made to reserve the leading constants of the arallel comutational cost. We resent the costs of D-Householder-QR in the first row of Table. 3. Communication-Avoiding QR In this section we resent arallel Tall-Skinny QR (TSQR [1, Algorithm 1] and Communication-Avoiding QR (CAQR [1, Algorithm ], which are algorithms for comuting a QR decomosition that are more 4

7 communication efficient than Householder-QR, articularly for tall and skinny matrices D Algorithm (TSQR We resent a simlified version of TSQR in Algorithm : we assume the number of rocessors is a ower of two and use a binary reduction tree (TSQR can be erformed with any tree. Note also that we resent a reduction algorithm rather than an all-reduction (i.e., the final R resides on only one rocessor at the end of the algorithm. TSQR assumes the tall-skinny matrix A is distributed in block row layout so that each rocessor owns a (m/ n submatrix. After each rocessor comutes a local QR factorization of its submatrix (line 1, the algorithm works by reducing the remaining n n triangles to one final uer triangular R = Q T A (lines 10. The Q that triangularizes A is stored imlicitly as a tree of sets of Householder vectors, given by {Y i,k }. In articular, {Y i,k } is the set of Householder vectors comuted by rocessor i at the kth level of the tree. The ith leaf of tree, Y i,0 is the set of Householder vectors which rocessor i comutes by doing a local QR on its art of the initial matrix A. In the case of a binary tree, every internal node of the tree consists of a QR factorization of two stacked b b triangles (line 6. This sarsity structure can be exloited, saving a constant factor of comutation comared to a QR factorization of a dense b b matrix. In fact, as of version 3.4, LAPACK has subroutines for exloiting this and similar sarsity structures (tqrt. Furthermore, the Householder vectors generated during the QR factorization of stacked triangles have similar sarsity; the structure of the Y i,k for k > 0 is an identity matrix stacked on to of a triangle. Because the set {Y i,k } constitutes the imlicit tree reresentation of the orthogonal factor, it is imortant to note how the tree is distributed across rocessors since the Q factor will be either alied to a matrix (see [1, Algorithm ] or constructed exlicitly (see Section For a given Y i,k, i indicates the rocessor number (both where it is comuted and where it is stored, and k indicates the level in the tree. Although 0 i 1 and 0 k log, many of the Y i,k are null; for examle, Y i,log = for i > 0 and Y 1,k = for k > 0. In the case of the binary tree in Algorithm, rocessor 0 comutes and stores Y 0,k for 0 k log and thus requires O(b log extra memory. The costs and analysis of TSQR are given in [1, 17]: ( mb γ + b3 3 log ( b + β log + α log. We tabulate these costs in the second row of Table 1. We note that the TSQR inner tree factorizations require an extra comutational cost O(b 3 log and a bandwidth cost of O(b log. Also note that in the context of a D algorithm, using TSQR as the anel factorization imlies that there is no b b T matrix to comute; the udate of the trailing matrix is erformed differently. 3.. D Algorithm (CAQR The D algorithm that uses TSQR for anel factorizations is known as CAQR. In order to udate the trailing matrix within CAQR, the imlicit orthogonal factor comuted from TSQR needs to be alied as a tree in the same order as it was comuted. See [1, Algorithm ] for a descrition of this rocess, or see [18, Algorithm 4] for seudocode that matches the binary tree in Algorithm. We refer to this alication of imlicit Q T as Aly-TSQR-Q T. The algorithm has the same tree deendency flow structure as TSQR but requires a bidirectional exchange between aired nodes in the tree. We note that in internal nodes of the tree it is ossible to exloit the additional sarsity structure of Y i,k (an identity matrix stacked on to of a triangular matrix, which our imlementation does via the use of the LAPACK v3.4+ routine tmqrt. Further, since A is m n and intermediate values of rows of A are communicated, the trailing matrix udate costs more than TSQR when n > b. In the context of CAQR on a square matrix, Aly-TSQR-Q T is erformed on a trailing matrix with n m columns. The extra work in the alication of the inner leaves of the tree is roortional to O(n b log(/ and bandwidth cost roortional to O(n log(/. Since the cost of Aly-TSQR-Q T is almost leading order in CAQR, it is desirable in ractice to otimize 5

8 Algorithm [{Y i,k }, R] = TSQR(A Require: Number of rocessors is and i is the rocessor index Require: A is m b matrix distributed in block row layout; A i is rocessor i s block 1: [Y i,0, R i ] = Householder-QR(A i : for k = 1 to log do 3: if i 0 mod k and i + k 1 < then 4: j = i + k 1 5: Receive R j from rocessor j 6: [Y i,k, R i ] = Householder-QR 7: else if i k 1 mod k then 8: Send R i to rocessor i k 1 9: end if 10: end for 11: if i = 0 then ([ Ri R j ] 1: R = R 0 13: end if Ensure: A = QR with Q imlicitly reresented by {Y i,k } Ensure: R is stored by rocessor 0 and Y i,k is stored by rocessor i the udate routine. However, the tree deendency structure comlicates this manual develoer or comiler otimization task. The overall cost of CAQR is given to leading order by ( 6mnb 3n ( b 4nb γ + r 3 (( nb β + n log r + c + 3n b c mn n r log r + 6mn n ( 3n + α b log r + 4n b log c See [1] for a discussion of these costs and [17] for detailed analysis. Note that the bandwidth cost is slightly lower here due to the use of more efficient broadcasts. If we ick r = c = (assuming m n and b = then we obtain the leading order costs n log ( mn n 3 /3 γ ( mn + n log + β + α. ( 7 log 3. Again, we choose b to reserve the leading constants of the comutational cost. Note that b needs to be chosen smaller here than in Section 3.1. due to the costs associated with Aly-TSQR-Q T. It is ossible to reduce the costs of Aly-TSQR-Q T further using ideas from efficient recursive doubling/halving collectives. See Aendix A for details Constructing Exlicit Q from TSQR In many use cases of QR decomosition, an exlicit orthogonal factor is not necessary; rather, we need only the ability to aly the matrix (or its transose to another matrix, as done in the revious section. For our uroses (see Section 4, we will need to form the exlicit m b orthonormal matrix from the imlicit tree reresentation. Though it is not necessary within CAQR, we describe it here because it is a known algorithm (see [19, Figure 4] and the structure of the algorithm is very similar to TSQR. In LAPACK terms, constructing (i.e., generating the orthogonal factor when it is stored as a set of Householder vectors is done with orgqr. 6

9 Algorithm 3 [B] = Aly-TSQR-Q T ({Y i,k }, A Require: Number of rocessors is and i is the rocessor index Require: A is m n matrix distributed in block row layout; A i is rocessor i s block Require: {Y i,k } is the imlicit tree TSQR reresentation of b Householder vectors of length m. 1: B i = Aly-Householder-Q T (Y i,0, A i : Let B i be the first b rows of B i 3: for k = 1 to log do 4: if i 0 mod k and i + k 1 < then 5: j = i + k 1 6: Receive [ ] B j from rocessor j [ ] Bi Bi 7: = Aly-Householder-Q (Y B T i,k, j B j 8: Send B j back to rocessor j 9: else if i k 1 mod k then 10: Send B i to rocessor i k 1 11: Receive udated rows B i from rocessor i k 1 1: Set the first b rows of B i to B i 13: end if 14: end for Ensure: B = Q T A with rocessor i owning block B i, where Q is the orthogonal matrix imlicitly reresented by {Y i,k } Algorithm 4 resents the method for constructing the m b matrix Q by alying the (imlicit square orthogonal factor to the first b columns of the m m identity matrix. Note that while we resent Algorithm 4 assuming a binary tree, any tree shae is ossible, as long as the imlicit Q is comuted using the same tree shae as TSQR. While the nodes of the tree are comuted from leaves to root, they will be alied in reverse order from root to leaves. Note that in order to minimize the comutational cost, the sarsity of the identity matrix at the root node and the sarsity structure of {Y i,k } at the inner tree nodes is exloited. In articular, since I b is uer triangular and Y 0, log has the structure of identity stacked on to of an uer triangle, the outut of the root node alication in line 7 is a stack of two uer triangles. By induction, since every Y i,k for k > 0 has the same structure, all outut Q i,k are triangular matrices. At the leaf nodes, when Y i,0 is alied in line 13, the outut is a dense (m/ b block. Since the communicated matrices Q j are triangular just as R j was triangular in the TSQR algorithm, Construct-TSQR-Q incurs the exact same comutational and communication costs as TSQR. So, we can reconstruct the unique art of the Q matrix from the imlicit form given by TSQR for the same cost as the TSQR itself. 3.3 Yamamoto s Basis-Kernel Reresentation The main goal of this work is to combine Householder-QR with CAQR; Yamamoto [11] rooses a scheme to achieve this. As described in Section 3.1, D-Householder-QR suffers from a communication bottleneck in the anel factorization. TSQR alleviates that bottleneck but requires a more comlicated (and slightly less efficient trailing matrix udate. Motivated in art to imrove the erformance and rogrammability of a hybrid CPU/GPU imlementation, Yamamoto suggests comuting a reresentation of the orthogonal factor that triangularizes the anel that mimics the reresentation in Householder-QR. As described by Sun and Bischof [0], there are many so-called basis-kernel reresentations of an orthogonal matrix. See also [1] for general discussion of block reflectors. For examle, the Householder- 7

10 Algorithm 4 Q = Construct-TSQR-Q({Y i,k } Require: Number of rocessors is and i is the rocessor index Require: {Y i,k } is comuted by Algorithm so that Y i,k is stored by rocessor i 1: if i = 0 then : Q0 = I b 3: end if 4: for k = log down to 1 do 5: if i 0 mod k and i + k 1 < then 6: j = i + k 1 7: [ Qi Q j ] = Aly-Householder-Q ( Y i,k, [ ] Qi 0 8: Send Q j to rocessor j 9: else if i k 1 mod k then 10: Receive Q i from rocessor i k 1 11: end if 1: end for ( [ ] Qi 13: Q i = Aly-Q-to-Triangle Y i,0, 0 Ensure: Q is orthonormal m b matrix distributed in block row layout; Q i is rocessor i s block QR algorithm comutes a lower triangular matrix Y such that A = (I Y T Y1 T R, so that [ ] Q = I Y T Y T Y1 = I T [ Y T ]. ( Y 1 Y T Here, Y is called the basis and T is called the kernel in this reresentation of the square orthogonal factor Q. However, there are many such basis-kernel reresentations if we do not restrict Y and T to be lower and uer triangular matrices, resectively. Yamamoto [ ] [11] chooses a basis-kernel reresentation that is easy to comute. For an m b matrix A, Q1 let A = R where Q 1 and R are b b. Then define the basis-kernel reresentation Q [ ] Q = I Ỹ T Ỹ T Q1 I [I Q1 ] T [ = I (Q1 I T Q T ], (3 Q [ ] R where I Q 1 is assumed to be nonsingular. It can be easily verified that Q T Q = I and Q T A = ; in 0 fact, this is the reresentation suggested and validated by [, Theorem 3]. Note that both the basis and kernel matrices Ỹ and T are dense. The main advantage of basis-kernel reresentations is that they can be used to aly the orthogonal factor (or its transose very efficiently using matrix multilication. In articular, the comutational comlexity of alying Q T using any basis-kernel is the same to leading order, assuming Y has the same dimensions as A and m b. Thus, it is not necessary to reconstruct the Householder vectors; from a comutational ersective, finding any basis-kernel reresentation of the orthogonal factor comuted by TSQR will do. Note also that in order to aly Q T with the reresentation in Equation (3, we need to aly the inverse of I Q 1, so we need to erform an LU decomosition of the b b matrix and then aly the inverses of the triangular factors using triangular solves. The assumtion that I Q 1 is nonsingular can be droed by relacing I with a diagonal sign matrix S chosen so that S Q 1 is nonsingular [3]; in this case the reresentation becomes [ ] QS = I Ỹ T Ỹ T Q1 S = I S [ ] T [ S Q 1 (Q1 S T Q T ]. (4 Q 8

11 Yamamoto s aroach is very closely related to TSQR-HR (Algorithm 6, resented in Section 4. We comare the methods in Section New Algorithms We first resent our main contribution, a arallel algorithm that erforms TSQR and then reconstructs the Householder vector reresentation from the TSQR reresentation of the orthogonal factor. We then show that this reconstruction algorithm may be used as a building block for more efficient D QR algorithms. In articular, the algorithm is able to combine two existing aroaches for D QR factorizations, leveraging the efficiency of TSQR in anel factorizations and the efficiency of Householder-QR in trailing matrix udates. While Householder reconstruction adds some extra cost to the anel factorization, we show that its use in the D algorithm reduces overall communication comared to both D-Householder-QR and CAQR. 4.1 TSQR with Householder Reconstruction The basic stes of our 1D algorithm include erforming TSQR, constructing the exlicit tall-skinny Q factor, and then comuting the Householder vectors corresonding to Q. The key idea of our reconstruction algorithm is that erforming Householder-QR on an orthonormal matrix Q is the same as erforming an LU decomosition on Q S, where S is a diagonal sign matrix corresonding to the sign choices made inside the Householder-QR algorithm. This claim is roved exlicitly in Lemma 5.. Informally, ignoring signs, if Q = I Y T Y T 1 with Y a matrix of Householder vectors, then Y ( T Y T 1 is uer triangular. of Q I since Y is unit lower triangular and T Y T 1 is an LU decomosition In this section we resent Modified-LU as Algorithm 5, which can be alied to any orthonormal matrix (not necessarily one obtained from TSQR. Ignoring lines 1, 3, and 4, it is exactly LU decomosition without ivoting. Note that with the choice of S, no ivoting is required since the effective diagonal entry will be at least 1 in absolute value and all other entries in the column are bounded by 1 (the matrix is orthonormal. 3 This holds true throughout the entire factorization because the trailing matrix remains orthonormal, which we rove within Lemma 5.. Algorithm 5 [L, U, S] = Modified-LU(Q Require: Q is m b orthonormal matrix 1: S = 0 : for i = 1 to b do 3: S(i, i = sgn(q(i, i 4: Q(i, i = Q(i, i S(i, i % Scale ith column of L by diagonal element 5: Q(i+1:m, i = 1 Q(i,i Q(i+1:m, i % Perform Schur comlement udate 6: Q(i+1:m, i+1:b = Q(i+1:m, i+1:b Q(i+1:m, i Q(i, i+1:b 7: end for Ensure: U overwrites the uer triangle and L has imlicit unit diagonal and overwrites the strict lower triangle of Q; S is diagonal so that Q S = LU Given the algorithms of the revious sections, we now resent the full aroach for comuting the QR decomosition of a tall-skinny matrix using TSQR and Householder reconstruction. That is, in this section we resent an algorithm such that the format of the outut of the algorithm is identical to that of Householder-QR. However, we argue that the communication costs of this aroach are much less than those of erforming Householder-QR. 3 We use the convention sgn(0 = 1. 9

12 Householder-QR 3mb TSQR mb TSQR-HR 5mb Flos Words Messages b log b log b3 3 + b3 3 log b log log + 4b3 3 log b log 4 log Table 1: Costs of QR factorization of tall-skinny m b matrix distributed over rocessors in 1D fashion. We assume these algorithms are used as anel factorizations in the context of a D algorithm alied to an m n matrix. Thus, costs of Householder-QR and TSQR-HR include the costs of comuting T. The method, given as Algorithm 6, is to erform TSQR (line 1, construct the tall-skinny Q factor exlicitly (line, and then comute the Householder vectors that reresent that orthogonal factor using Modified-LU (line 3. The R factor is comuted in line 1 and the Householder vectors (the columns of Y are comuted in line 3. An added benefit of the aroach is that the triangular T matrix, which allows for block alication of the Householder vectors, can be comuted very chealy. That is, a triangular solve involving the uer triangular factor from Modified-LU comutes the T so that A = (I Y T Y1 T R. To comute T directly from Y (as is necessary if Householder-QR is used requires O(nb flos; here the triangular solve involves O(b 3 flos. Our aroach for comuting T is given in line 4, and line 5 ensures sign agreement between the columns of the (imlicitly stored orthogonal factor and rows of R. Algorithm 6 [Y, T, R] = TSQR-HR(A Require: A is m b matrix distributed in block row layout 1: [{Y i,k }, R] = TSQR(A : Q = Construct-TSQR-Q({Y i,k } 3: [Y, U, S] = Modified-LU(Q 4: T = USY1 T 5: R = S R Ensure: A = (I Y T Y1 T R, where Y is m b and unit lower triangular, Y 1 is to b b block of Y, and T and R are b b and uer triangular On rocessors, Algorithm 6 incurs the following costs (ignoring lower order terms: 1. Comute [{Y i,k }, R ] = TSQR(A The comutational costs of this ste come from lines 1 and 6 in Algorithm. Line 1 corresonds to a QR factorization of a (m/ b matrix, with a flo count of (m/b b 3 /3 (each rocessor erforms this ste simultaneously. Line 6 corresonds to a QR factorization of a b b triangle stacked on to of a b b triangle. Exloiting the sarsity structure, the flo count is b 3 /3; this occurs at every internal node of the binary tree, so the total cost in arallel is (b 3 /3 log. The communication costs of Algorithm occur in lines 5 and 8. Since every R i,k is a b b uer triangular matrix, the cost of a single message is α + β (b /. This occurs at every internal node in the tree, so the total communication cost in arallel is a factor log larger. Thus, the cost of this ste is ( ( mb γ + b3 b 3 log + β log + α log.. Q = Construct-TSQR-Q({Y i,k } The comutational costs of this ste come from lines 7 and 13 in Algorithm 4. Note that for k > 0, Y i,k is a b b matrix: the identity matrix stacked on to of an uer triangular matrix. Furthermore, Q i,k is an uer triangular matrix. Exloiting the structure of Y i,k and Q i,k, the cost of line 7 is b 3 /3, which occurs at every internal node of the tree. Each Y i,0 is a (m/ b lower triangular 10

13 matrix of Householder vectors, so the cost of alying them to an uer triangular matrix in line 13 is (m/b b 3 /3. Note that the comutational cost of these two lines is the same as those of the revious ste in Algorithm. The communication attern of Algorithm 4 is identical to Algorithm, so the communication cost is also the same as that of the revious ste. Thus, the cost of constructing Q is ( mb γ + b3 3 log 3. [Y, U, S] = Modified-LU(Q ( b + β log + α log. Ignoring lines 3 4 in Algorithm 5, Modified-LU is the same as LU without ivoting. In arallel, the algorithm consists of a b b (modified LU factorization of the to block followed by arallel triangular solves to comute the rest of the lower triangular factor. The flo count of the b b LU factorization is b 3 /3, and the cost of each rocessor s triangular solve is (m/b. The communication cost of arallel Modified-LU is only that of a broadcast of the uer triangular b b U factor (for which we use a bidirectional-exchange algorithm: β (b + α ( log. Thus, the cost of this ste is 4. T = USY T 1 and R = SR ( mb γ + b3 + β b + α log 3 The last two stes consist of comuting T and scaling the final R aroriately. Since S is a sign matrix, comuting US and SR requires no floating oint oerations and can be done locally on one rocessor. Thus, the only comutational cost is erforming a b b triangular solve involving Y1 T. If we ignore the triangular structure of the outut, the flo count of this oeration is b 3. However, since this oeration occurs on the same rocessor that comutes the to m/ b block of Y and U, it can be overlaed with the revious ste (Modified-LU. After the to rocessor erforms the b b LU factorization and broadcasts the U factor, it comutes only m/ b rows of Y (all other rocessors udate m/ rows. Thus, erforming an extra b b triangular solve on the to rocessor adds no comutational cost to the critical ath of the algorithm. Thus, TSQR-HR(A where A is m-by-b incurs the following costs (ignoring lower order terms: ( 5mb γ + 4b3 3 log + β (b log + α (4 log. (5 Note that the LU factorization required in Yamamoto s aroach (see Section 3.3 is equivalent to erforming Modified-LU( Q 1. In Algorithm 6, the Modified-LU algorithm is alied to an m b matrix rather than to only the to b b block; since no ivoting is required, the only difference is the udate of the bottom m b rows with a triangular solve. Thus it is not hard to see that, ignoring signs, the Householder basis-kernel reresentation in Equation ( can be obtained from the reresentation given in Equation (3 with two triangular solves: if the LU factorization gives I Q 1 = LU, then Y = Ỹ U 1 and T = UL T. Indeed, erforming these two oerations and handling the signs correctly gives Algorithm 6. While Yamamoto s aroach avoids erforming the triangular solve on Q, it still involves erforming both TSQR and Construct-TSQR-Q. 11

14 Flos Words Messages D-Householder-QR mn n 3 /3 n log CAQR CAQR-HR mn n 3 /3 mn n 3 /3 mn+n / mn+n log 7 mn+n / log 3 6 log Table : Costs of QR factorization of m n matrix distributed over rocessors in D fashion. Here we assume a square rocessor grid ( r = c. We also choose block sizes for each algorithm indeendently to ensure the leading order terms for flos are identical. 4. LU Decomosition of A R Comuting the Householder vectors basis-kernel reresentation via Algorithm 6 requires forming the exlicit m b orthonormal matrix. It is ossible to comute this reresentation [ ] more chealy, though at R1 the exense of losing numerical stability. Suose A is full rank, R = is the uer triangular factor 0 comuted by TSQR, and A R is also full rank. Then A = (I Y T Y1 T R 1 imlies that the unique LU decomosition of A R is given by A R = Y ( T Y1 T R 1, (6 since Y is unit lower triangular and T, Y1 T, and R 1 are all uer triangular. The assumtion that A R is full rank can be droed [ ] by choosing an aroriate diagonal sign matrix S and comuting the LU SR1 decomosition of A. 0 Not surrisingly, choosing the signs to match the choices of the Householder-QR algorithm alied to A guarantees that the matrix is nonsingular. In fact, the signs can be comuted during the course of a modified LU decomosition algorithm very similar to Algorithm 5 (designed for general matrices rather than only orthonormal. This more general algorithm can also be interreted as erforming Householder- QR on A but using R to rovide hints to the algorithm to avoid certain exensive comutations; see Aendix B for more details. The advantage of this aroach is that Y can be comuted without constructing Q exlicitly, which is as exensive as TSQR itself. Unfortunately, this method is not numerically stable. An intuition for the instability comes from considering a low-rank matrix A. Because the QR decomosition of A is not unique, the TSQR algorithm and the Householder-QR algorithm may comute different factor airs (Q TSQR, R TSQR and (Q Hh, R Hh. Performing an LU decomosition using A and R TSQR to recover the Householder vectors that reresent Q Hh is hoeless if R TSQR R Hh, even in exact arithmetic. In floating oint arithmetic, an ill-conditioned A corresonds to a factor R which is sensitive to roundoff error, so reconstructing the Householder vectors that would roduce the same R as comuted by TSQR is unstable. However, if A is well-conditioned, then this method is sufficient; one can view Algorithm 5 as alying the method to an orthonormal matrix (which is erfectly conditioned. One method of decreasing the sensitivity of R is to emloy column ivoting. That is, after erforming TSQR aly QR with column ivoting to the b b uer triangular factor R, yielding a factorization A = Q TSQR ( Q RP, or AP = (Q TSQR Q R, where the diagonal entries of R decrease monotonically [ ] in S R absolute value. We have observed exerimentally that erforming LU decomosition of AP is a 0 more stable oeration than not using column ivoting, though it is still not as robust as Algorithm 6. See Section 5 for a discussion of exeriments where the instability occurs (even with column ivoting. 4.3 CAQR-HR We refer to the D algorithm that uses TSQR-HR for anel factorizations as CAQR-HR. Because Householder- QR and TSQR-HR generate the same reresentation as outut of the anel factorization, the trailing matrix 1

15 Algorithm 7 [Y, T, R] = CAQR-HR(A Require: A is m n and distributed block-cyclically on = r c rocessors with block size b, so that each b b block A ij is owned by rocessor Π(i, j = (i mod r + r (j mod c 1: for i = 0 to n/b 1 do % Comute TSQR and reconstruct Householder reresentation using column of r rocessors : [ Yi:m/b 1,i, T i, R ii ] = Hh-Recon-TSQR(Ai:m/b 1,i % Udate trailing matrix using all rocessors 3: Π(i, i broadcasts T i to all other rocessors 4: for r [i, m/b 1], c [i + 1, n/b 1] do in arallel 5: Π(r, i broadcasts Y ri to all rocessors in its row, Π(r, : 6: 7: Π(r, c comutes W rc = Yri T rc Allreduce W c = W r rc so that rocessor column Π(:, c owns W c 8: Π(r, c comutes A rc = A rc Y ri T T W c 9: Set R ic = A ic 10: end for 11: end for ( n Ensure: A = i=1 (I Y :,it i Y:,i T R i udate can be erformed in exactly the same way. Thus, the only difference between D-Householder-QR and CAQR-HR, resented in Algorithm 7, is the subroutine call for the anel factorization (line. Algorithm 7 with block size b, and matrix m-by-n matrix A, with m n and m, n 0 (mod b distributed on a D r -by- c rocessor grid incurs the following costs over all n/b iterations., 1. Comute [ Y i:m/b 1,i, T i, R ii ] = Hh-Recon-TSQR ( Ai:m/b 1,i Equation (5 gives the cost of a single anel TSQR factorization with Householder reconstruction. We can sum over all iterations to obtain the cost of this ste in the D QR algorithm (line in Algorithm 7, n/b 1 i=0 ( 5mnb γ ( ( 5(m ibb γ r r 5n b r + 4nb 3. Π(i, i broadcasts T i to all other rocessors + 4b3 3 log r + β (b log r + α (4 log r = ( 4n log r log r + β (nb log r + α b The matrix T i is b-by-b and triangular, so we use a bidirectional exchange broadcast. Since there are a total of n/b iterations, the total communication cost for this ste of Algorithm 7 is ( n β (nb + α b log 3. Π(r, i broadcasts Y ri to all rocessors in its row, Π(r, : At this ste, each rocessor which took art in the TSQR and Householder reconstruction sends its local chunk of the anel of Householder vectors to the rest of the rocessors. At iteration i of Algorithm 7, each rocessor owns at most m/b i r blocks Y ri. Since all the broadcasts haen along rocessor rows, we assume they can be done simultaneously on the network. The communication along the critical ath is then given by 13

16 n/b 1 i=0 ( β ( ( (m/b i b mn n + α log c = β r r ( n + α b log c 4. Π(r, c comutes W rc = Yri T A rc Each block W rc is comuted on rocessor Π(r, c, using A rc, which it owns, and Y ri, which it just received from rocessor Π(r, i. At iteration i, rocessor j may own u to m/b i r n/b i c blocks W rc, Π(r, c = j. Each, block-by-block multily incurs b 3 flos, therefore, this ste incurs a total comutational cost of γ n/b 1 i=0 (m/b i(n/b ib 3 ( mn n 3 /3 = γ 5. Allreduce W c = r W rc so that rocessor column Π(:, c owns W c At iteration i of Algorithm 7, each rocessor j may own u to m/b i r n/b i c blocks W rc. The local art of the reduction can be done during the comutation of W rc on line 6 of Algorithm 7. Therefore, no rocess should contribute more than n/b i c b-by-b blocks W rc to the reduction. Using a bidirectional exchange all-reduction algorithm and summing over the iterations, we can obtain the cost of this ste throughout the entire execution of the D algorithm: n/b 1 i=0 ( β 6. Π(r, c comutes A rc = A rc Y ri T i W c ( ( (n/b i b n + α log r = β c c ( n + α b log r Since in our case, m n, it is faster to first multily T i by W c rather than Y ri by T i. Any rocessor j may udate u to m/b i r n/b i c blocks of A rc using m/b i r blocks of Y ri and n/b i c blocks of W c. When multilying T i by W c, we can exloit the triangular structure of T, to lower the flo count by a factor of two. Summed over all iterations, these two multilications incur a comutational cost of γ n/b 1 i=0 (m/b i(n/b ib 3 + (n/b ib3 c ( mn n 3 /3 = γ The overall costs of CAQR-HR are given to leading order by ( 10mnb 5n b γ + 4nb log r + n b + mn n 3 /3 + r 3 c ( mn n β (nb log r + + n 8n + α r c b log r + 4n b log c. If we ick r = c = (assuming m n and b = shown in the third row of Table. + n b c n log then we obtain the leading order costs ( mn n 3 ( /3 mn + n / γ + β + α (6 log, 14

17 Comaring the leading order costs of CAQR-HR with the existing aroaches, we note again the O(n log latency cost incurred by the D-Householder-QR algorithm. CAQR and CAQR-HR eliminate this synchronization bottleneck and reduce the latency cost to be indeendent of the number of columns of the matrix. Further, both the bandwidth and latency costs of CAQR-HR are factors of O(log lower than CAQR (when m n. As reviously discussed, CAQR includes an extra leading order bandwidth cost term (β n log /, as well as a comutational cost term (γ (n b/ c log r that requires the choice of a smaller block size and leads to an increase in the latency cost. Further, the Householder form allows us to decoule the block sizes used for the trailing matrix udate from the width of each TSQR. This is a simle algorithmic otimization to add to our D algorithm, which has the same leading order costs. We show this nested blocked aroach in Algorithm 8, which is very similar to Algorithm 7, and does not require any additional subroutines. While this algorithm does not have a theoretical benefit, it is highly beneficial in ractice, as we will demonstrate in the erformance section. We note that it is not ossible to aggregate the imlicit TSQR udates in this way. Algorithm 8 [Y, T, R] = CAQR-HR-Agg(A Require: A is m n and distributed block-cyclically on = r c rocessors with block size b 1, so that each b 1 b 1 block A ij is owned by rocessor Π(i, j = (i mod r + r (j mod c 1: for [ i = 0 to n/b 1 ] do : Yi:m/b 1,i, T i, R ii = CAQR-HR(Ai:m/b 1,i 3: Aggregate udates from line 5 of CAQR-HR into (m ib -by-b matrix Ȳri 4: for r [i, m/b 1], c [i + 1, n/b 1] do in arallel 5: 6: Π(r, c comutes W rc = Ȳ ri T rc Allreduce W c = W r rc so that rocessor column Π(:, c owns W c 7: Π(r, c comutes A rc = A rc Ȳri T T W c 8: Set R ic = A ic 9: end for 10: end for ( n Ensure: A = i=1 (I Y :,it i Y:,i T R i 5 Correctness and Stability In order to reconstruct the lower traezoidal Householder vectors that constitute an orthonormal matrix (u to column signs, we can use an algorithm for LU decomosition without ivoting on an associated matrix (Algorithm 5. Lemma 5.1 shows by uniqueness that this LU decomosition roduces the Householder vectors reresenting the orthogonal matrix in exact arithmetic. Given the exlicit orthogonal factor, the LU method is cheaer in terms of both comutation and communication than constructing the vectors via Householder-QR. Lemma 5.1. Given an orthonormal m b matrix Q, let the comact QR decomosition of Q given by the Householder-QR algorithm (Algorithm 1 be ([ ] In Q = Y T Y1 T S, 0 where Y is unit lower triangular, Y 1 is the to b b block of Y, and T [ is the ] uer triangular b b matrix S satisfying T 1 +T T = Y T Y. Then S is a diagonal sign matrix, and Q has a unique LU decomosition 0 given by [ ] S Q = Y ( T Y1 T S. (7 0 15

18 Proof. Since Q is orthonormal, it has full rank and therefore has a unique QR decomosition with ositive diagonal entries in the uer triangular matrix. This decomosition is Q = Q I n, so the Householder-QR algorithm must comute a decomosition that differs only by sign. Thus, S is a diagonal sign matrix. The uniqueness of T is guaranteed by [0, Examle.4]. We obtain Equation (7 by rearranging the Householder QR decomosition. Since Y is unit lower triangular and T, Y1 T, and S are all uer triangular, we [ have ] an LU decomosition. Since Y is full S column rank and T, Y1 T, and S are all nonsingular, Q has full column rank and therefore has a 0 unique LU decomosition with a unit lower triangular factor. Unfortunately, given an orthonormal matrix Q, the sign matrix [ ] S roduced by Householder-QR is not S known, so we cannot run a standard LU algorithm on Q. Lemma 5. shows that by running the 0 Modified-LU algorithm (Algorithm 5, we can chealy comute the sign matrix S during the course of the algorithm. Lemma 5.. In exact arithmetic, Algorithm 5 alied to an orthonormal m b matrix Q comutes the same Householder vectors Y and sign matrix S as the Householder-QR algorithm (Algorithm 1 alied to Q. Proof. Consider alying one ste of Modified-LU (Algorithm 5 to the orthonormal matrix Q, where we first set S 11 = sgn(q 11 and subtract it from q 11. Note that since all other entries of the first column are less than 1 in absolute value and the absolute value of the diagonal entry has been increased by 1, the diagonal entry is the maximum entry. Following the LU algorithm, all entries below the diagonal are scaled by the recirocal of the diagonal element: for i m, y i1 = q i1 q 11 + sgn(q 11, (8 where Y is the comuted lower triangular factor. The Schur comlement is udated as follows: for i m and j b, q i1 q 1j q ij = q ij q 11 + sgn(q 11. (9 Now consider alying one ste of the Householder-QR algorithm (Algorithm 1. To match the LA- PACK notation, we let α = q 11, β = sgn(α q 1 = sgn(α, and τ = β α β = 1 + α sgn(α, where we use the fact that the columns of Q have unit norm. Note that β = sgn(α = S 11, which matches the Modified-LU algorithm above. The Householder vector y is comuted by setting the diagonal entry to 1 1 and scaling the other entries of q 1 by α β = 1 α+sgn(α. Thus, comuting the Householder vector matches comuting the column of the lower triangular matrix from Modified-LU in Equation (8 above. Next, we consider the udate of the trailing matrix: Q = (I τyy T Q. Since Q has orthonormal columns, the dot roduct of the Householder vector with the jth column is given by The identity y T q j = m y i q ij = q 1j + i=1 m i= = q 1j q 11q 1j α + sgn(α = q i1 α + sgn(α q ij ( 1 α α + sgn(α ( α (1 + α sgn(α 1 = 1 α + sgn(α q 1j. imlies τy T q j = q 1j. Then the trailing matrix udate ( i m and j b is given by q ij = q ij y i (τy T q j = q ij 16 q i1q 1j α + sgn(α,

19 which matches the Schur comlement udate from Modified-LU in Equation (9 above. Finally, consider the jth element of first row of the trailing matrix (j : q ij = q ij y 1 (τy T q j = 0. Note that the comutation of the first row is not erformed in the Modified-LU algorithm. The corresonding row is not changed since the diagonal element is Y 11 = 1. However, in the case of Householder-QR, since the first row of the trailing matrix is zero, and because the Householder transformation reserves column norms, the udated (m 1 (b 1 trailing matrix is itself an orthonormal matrix. Thus, by induction, the rest of the two algorithms erform the same comutations. In order to bound the error in the decomosition of a general matrix using Householder reconstruction, we will use Lemma 5.4 below as well as the stability bounds on the TSQR or AllReduce algorithm rovided in [4]. We restate those results here. Lemma 5.3 ([4]. Let ˆR be the comuted uer triangular factor of m b matrix A obtained via the AllReduce algorithm using a binary tree of height L (m/ L b. Then there exists an orthonormal matrix Q such that A Q ˆR F f 1 (m, b, L, ε A F and Q ˆQ F f (m, b, L, ε, ( b, where f 1 (m, b, L, ε bγ c m + Lbγ L c b and f (m, b, L, ε bγ c m + Lbγ L c b for bγc m 1 and L bγ c b 1, where c is a small constant. The following lemma shows that the comutation erformed in Algorithm 5 is more stable than general LU decomosition. Because it is erformed on an orthonormal matrix, the growth factor in the uer triangular factor is bounded by, and we can bound the Frobenius norms of both comuted factors by functions of the number of columns. Further, we can stably comute the T factor (needed to aly a blocked Householder udate from the uer triangular LU factor, a comutationally cheaer method than using the Householder vectors themselves. Lemma 5.4. In floating oint arithmetic, given an orthonormal m b matrix Q, Algorithm 5 comutes factors S, Ỹ, and T such that ([ ] I QS 0 Ỹ T Ỹ 1 T f 3 (b, ε. F where and Y 1 is given by the first b rows of Y. ( f 3 (b, ε = γ b (b b Proof. This result follows from the standard error analysis of LU decomosition and triangular solve with multile right hand sides, and the roerties of the comuted [ lower ] and uer triangular factors. S First, we use the fact that for LU decomosition of Q into triangular factors Ỹ and Ũ, we have 0 (see [5, Section 9.3] ( [ ] S Q 0 Ỹ Ũ γ b Ỹ F Ũ F. F Second, we comute T T = (ŨSY1 using the standard algorithm for triangular solve with multile right hand sides to obtain (see [5, Section 8.1] Ũ T Ỹ 1 T T S = ŨS T Ỹ1 γ b T F Ỹ 1 T F. F F 17

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013