Reconstructing Householder Vectors from Tall-Skinny QR

Size: px
Start display at page:

Download "Reconstructing Householder Vectors from Tall-Skinny QR"

Transcription

1 Reconstructing Householder Vectors from Tall-Skinny QR Grey Ballard James Demmel Laura Grigori Mathias Jacquelin Hong Die Nguyen Edgar Solomonik Electrical Engineering and Comuter Sciences University of California at Berkeley Technical Reort No. UCB/EECS htt:// October 6, 013

2 Coyright 013, by the author(s. All rights reserved. Permission to make digital or hard coies of all or art of this work for ersonal or classroom use is granted without fee rovided that coies are not made or distributed for rofit or commercial advantage and that coies bear this notice and the full citation on the first age. To coy otherwise, to reublish, to ost on servers or to redistribute to lists, requires rior secific ermission. Acknowledgement We would like to thank Yusaku Yamamoto for sharing his slides from SIAM ALA 01 with us. We also thank Jack Poulson for hel configuring Elemental. Solomonik was suorted by a Deartment of Energy Comutational Science Graduate Fellowshi, grant number DE-FG0-97ER5308. This research used resources of the National Energy Research Scientific Comuting Center (NERSC, which is suorted by the Office of Science of the U.S. Deartment of Energy under Contract No. DE- AC0-05CH1131. We also acknowledge DOE grants DE-SC , DE-SC , DE-SC , DE-SC , DE-SC001000, AC0-05CH1131, and DARPA grant HR

3 Reconstructing Householder Vectors from Tall-Skinny QR October 6, 013 Grey Ballard 1, James Demmel, Laura Grigori 3, Mathias Jacquelin 4, Hong Die Nguyen, and Edgar Solomonik 1 Sandia National Laboratories, Livermore, USA University of California, Berkeley, Berkeley, USA 3 INRIA Paris - Rocquencourt, Paris, France 4 Lawrence Berkeley National Laboratory, Berkeley, USA Abstract The Tall-Skinny QR (TSQR algorithm is more communication efficient than the standard Householder algorithm for QR decomosition of matrices with many more rows than columns. However, TSQR roduces a different reresentation of the orthogonal factor and therefore requires more software develoment to suort the new reresentation. Further, imlicitly alying the orthogonal factor to the trailing matrix in the context of factoring a square matrix is more comlicated and costly than with the Householder reresentation. We show how to erform TSQR and then reconstruct the Householder vector reresentation with the same asymtotic communication efficiency and little extra comutational cost. We demonstrate the high erformance and numerical stability of this algorithm both theoretically and emirically. The new Householder reconstruction algorithm allows us to design more efficient arallel QR algorithms, with significantly lower latency cost comared to Householder QR and lower bandwidth and latency costs comared with Communication-Avoiding QR (CAQR algorithm. As a result, our final arallel QR algorithm outerforms ScaLAPACK and Elemental imlementations of Householder QR and our imlementation of CAQR on the Hoer Cray XE6 NERSC system. 1 Introduction Because of the rising costs of communication (i.e., data movement relative to comutation, so-called communication-avoiding algorithms ones that erform roughly the same comutation as alternatives but significantly reduce communication often run with reduced running times on today s architectures. In articular, the standard algorithm for comuting the QR decomosition of a tall and skinny matrix (one with many more rows than columns is often bottlenecked by communication costs. A more recent algorithm known as Tall-Skinny QR (TSQR is resented in [1] (the ideas go back to [] and overcomes this bottleneck by reformulating the comutation. In fact, the algorithm is communication otimal, attaining the lower bound for communication costs of QR decomosition (u to a logarithmic factor in the number of rocessors [3]. Not only is communication reduced in theory, but the algorithm has been demonstrated to erform better on a variety of architectures, including multicore rocessors [4], grahics rocessing units [5], and distributed-memory systems [6]. The standard algorithm for QR decomosition, which is imlemented in LAPACK [7], ScaLAPACK [8], and Elemental [9] is known as Householder-QR (given below as Algorithm 1. For tall and skinny matrices, the algorithm works column-by-column, comuting a Householder vector and alying the corresonding transformation for each column in the matrix. When the matrix is distributed across a arallel machine, 1

4 this requires one arallel reduction er column. The TSQR algorithm (given below as Algorithm, on the other hand, erforms only one reduction during the entire comutation. Therefore, TSQR requires asymtotically less inter-rocessor synchronization than Householder-QR on arallel machines (TSQR also achieves asymtotically higher cache reuse on sequential machines. Comuting the QR decomosition of a tall and skinny matrix is an imortant kernel in many contexts, including standalone least squares roblems, eigenvalue and singular value comutations, and Krylov subsace and other iterative methods. In addition, the tall and skinny factorization is a standard building block in the comutation of the QR decomosition of general (not necessarily tall and skinny matrices. In articular, most algorithms work by factoring a tall and skinny anel of the matrix, alying the orthogonal factor to the trailing matrix, and then continuing on to the next anel. Although Householder- QR is bottlenecked by communication in the anel factorization, it can aly the orthogonal factor as an aggregated Householder transformation efficiently, using matrix multilication [10]. The Communication-Avoiding QR (CAQR [1] algorithm uses TSQR to factor each anel of a general matrix. One difficulty faced by CAQR is that TSQR comutes an orthogonal factor that is imlicitly reresented in a different format than that of Householder-QR. While Householder-QR reresents the orthogonal factor as a set of Householder vectors (one er column, TSQR comutes a tree of smaller sets of Householder vectors (though with the same total number of nonzero arameters. In CAQR, this difference in reresentation imlies that the trailing matrix udate is done using the imlicit tree reresentation rather than matrix multilication as ossible with Householder-QR. From a software engineering ersective, this means writing and tuning more comlicated code. Furthermore, from a erformance ersective, the udate of the trailing matrix within CAQR is less communication efficient than the udate within Householder-QR by a logarithmic factor in the number of rocessors. Building on a method introduced by Yamamoto [11], we show that the standard Householder vector reresentation may be recovered from the imlicit TSQR reresentation for roughly the same cost as the TSQR itself. The key idea is that the Householder vectors that reresent an orthonormal matrix can be comuted via LU decomosition (without ivoting of the orthonormal matrix subtracted from a diagonal sign matrix. We rove that this reconstruction is as numerically stable as Householder-QR (indeendent of the matrix condition number, and validate this roof with exerimental results. This reconstruction method allows us to get the best of TSQR algorithm (avoiding synchronization as well as the best of the Householder-QR algorithm (efficient trailing matrix udates via matrix multilication. By obtaining Householder vectors from the TSQR reresentation, we can logically decoule the block size of the trailing matrix udates from the number of columns in each TSQR. This abstraction makes it ossible to otimize anel factorization and the trailing matrix udates indeendently. Our resulting arallel imlementation outerforms ScaLAPACK, Elemental, and our own lightly-otimized CAQR imlementation on the Hoer Cray XE6 latform at NERSC. While we do not exerimentally study sequential erformance, we exect our algorithm will also be beneficial in this setting, due to the cache efficiency gained by using TSQR. Performance cost model In this section, we detail our algorithmic erformance cost model for arallel execution on rocessors. We will use the α-β-γ model which exresses algorithmic costs in terms of comutation and communication costs, the latter being comosed of a bandwidth cost as well as a latency cost. We assume the cost of a message of size w words is α+βw, where α is the er-message latency cost and β is the er-word bandwidth cost. We let γ be the cost er floating oint oeration. We ignore the network toology and measure the costs in arallel, so that the cost of two disjoint airs of rocessors communicating the same-sized message simultaneously is the same as that of one message. We also assume that two rocessors can exchange equal-sized messages simultaneously, but a rocessor can communicate with only one other rocessor at a time. Our algorithmic analysis will deend on the costs of collective communication, articularly broadcasts,

5 reductions, and all-reductions, and we consider both tree-based and bidirectional-exchange (or recursive doubling/halving algorithms [1, 13, 14]. For arrays of size w, the collectives scatter, gather, allgather, and reduce-scatter can all be erformed with communication cost α log + β 1 w (1 (reduce-scatter also incurs a comutational cost of γ (( 1/w. Since a broadcast can be erformed with scatter and all-gather, a reduction can be erformed with reduce-scatter and gather, and an all-reduction can be erformed with reduce-scatter and all-gather, the communication costs of those collectives for large arrays are twice that of Equation (1. For small arrays (w <, it is not ossible to use ielined or recursive-doubling algorithms, which require the subdivision of the message into many ieces, therefore collectives such as a binomial tree broadcast must be used. In this case, the communication cost of broadcast, reduction, and all-reduction is α log + β w log. 3 Previous Work We distinguish between two tyes of QR factorization algorithms. We call an algorithm that distributes entire rows of the matrix to rocessors a 1D algorithm. Such algorithms are often used for tall and skinny matrices. Algorithms which distribute the matrix across a D grid of r c rocessors are known as D algorithms. Many right-looking D algorithms for QR decomosition of nearly square matrices divide the matrix into column anels and work anel-by-anel, factoring the anel with a 1D algorithm and then udating the trailing matrix. We consider two such existing algorithms in this section: D-Householder-QR (using Householder-QR and CAQR (using TSQR. 3.1 Householder-QR We first resent Householder-QR in Algorithm 1, following [15] so that each Householder vector has a unit diagonal entry. We use LAPACK [7] notation for the scalar quantities. 1 We resent Algorithm 1 in Matlab-style notation as a sequential algorithm. The algorithm works column-by-column, comuting a Householder vector and then udating the trailing matrix to the right. The Householder vectors are stored in an m b lower triangular matrix Y. Note that we do not include τ as art of the outut because it can be recomuted from Y. While the algorithm works for general m and n, it is most commonly used when m n, such as a anel factorization within a square QR decomosition. In LAPACK terms, this algorithm corresonds to geqr and is used as a subroutine in geqrf. In this case, we also comute an uer triangular matrix T so that n Q = (I τ i y i yi T = I Y T Y T, i=1 which allows the alication of Q T to the trailing matrix to be done efficiently using matrix multilication. Comuting T is done in LAPACK with larft but can also be comuted from Y T Y by solving the equation Y T Y = T 1 + T T for T 1 (since Y T Y is symmetric and T 1 is triangular, the off-diagonal entries are equivalent and the diagonal entries differ by a factor of [16] D Algorithm We will make use of Householder-QR as a sequential algorithm, but there are arallelizations of the algorithm in libraries such as ScaLAPACK [8] and Elemental [9] against which we will comare our new aroach. Assuming a 1D distribution across rocessors, the arallelization of Householder-QR (Algorithm 1 requires communication at lines and 8, both of which can be erformed using an all-reduction. 1 However, we deart from the LAPACK code in that there is no check for a zero norm of a subcolumn. 3

6 Algorithm 1 [Y, R] = Householder-QR(A Require: A is m b 1: for i = 1 to b do % Comute the Householder vector : α = A(i, i, β = A(i:m, i 3: if α > 0 then 4: β = β 5: end if 6: A(i, i = β, τ(i = β α β 7: A(i+1:m, i = 1 α β A(i+1:m, i % Aly the Householder transformation to the trailing matrix 8: z = τ(i [A(i, i+1:b + A(i+1:m, i T A(i+1:m, i+1:b] 9: A(i+1:m, i+1:b = A(i+1:m, i+1:b A(i+1:m, i z 10: end for Ensure: A = ( n i=1 (I τ iy i y T i R Ensure: R overwrites the uer triangle and Y (the Householder vectors has imlicit unit diagonal and overwrites the strict lower triangle of A; τ is an array of length b with τ i = /(y T i y i Because these stes occur for each column in the matrix, the total latency cost of the algorithm is b log. This synchronization cost is a otential arallel scaling bottleneck, since it grows with the number of columns of the matrix and does not decrease with the number of rocessors. The bandwidth cost is dominated by the all-reduction of the z vector, which has length b i at the ith ste. Thus, the bandwidth cost of Householder-QR is (b / log. The comutation within the algorithm is load balanced for a arallel cost of (mb b 3 /3/ flos D Algorithm In the context of a D algorithm, in order to erform an udate with the comuted Householder vectors, we must also comute the T matrix from Y in arallel. The leading order cost of comuting T 1 from Y T Y is mb / flos lus the cost of reducing a symmetric b b matrix, α log + β b /; note that the communication costs are lower order terms comared to comuting Y. We resent the costs of arallel Householder-QR in the first row of Table 1, combining the costs of Algorithm 1 with those of comuting T. We refer to the D algorithm that uses Householder-QR as the anel factorization as D-Householder- QR. In ScaLAPACK terms, this algorithm corresonds to xgeqrf. The overall cost of D-Householder- QR, which includes anel factorizations and trailing matrix udates, is given to leading order by ( 6mnb 3n b γ r + n b + mn n 3 /3 c +β (nb log r + mn n r + n c +α The derivation of these costs is very similar to that of CAQR-HR (see Section 4.3. (assuming m n and b = n/( log then we obtain the leading order costs γ (mn n 3 /3/ + β (mn + n / + α n log. ( n log r + n b log c. If we ick r = c = Note that these costs match those of [1, 8], with excetions coming from the use of more efficient collectives. The choice of b is made to reserve the leading constants of the arallel comutational cost. We resent the costs of D-Householder-QR in the first row of Table. 3. Communication-Avoiding QR In this section we resent arallel Tall-Skinny QR (TSQR [1, Algorithm 1] and Communication-Avoiding QR (CAQR [1, Algorithm ], which are algorithms for comuting a QR decomosition that are more 4

7 communication efficient than Householder-QR, articularly for tall and skinny matrices D Algorithm (TSQR We resent a simlified version of TSQR in Algorithm : we assume the number of rocessors is a ower of two and use a binary reduction tree (TSQR can be erformed with any tree. Note also that we resent a reduction algorithm rather than an all-reduction (i.e., the final R resides on only one rocessor at the end of the algorithm. TSQR assumes the tall-skinny matrix A is distributed in block row layout so that each rocessor owns a (m/ n submatrix. After each rocessor comutes a local QR factorization of its submatrix (line 1, the algorithm works by reducing the remaining n n triangles to one final uer triangular R = Q T A (lines 10. The Q that triangularizes A is stored imlicitly as a tree of sets of Householder vectors, given by {Y i,k }. In articular, {Y i,k } is the set of Householder vectors comuted by rocessor i at the kth level of the tree. The ith leaf of tree, Y i,0 is the set of Householder vectors which rocessor i comutes by doing a local QR on its art of the initial matrix A. In the case of a binary tree, every internal node of the tree consists of a QR factorization of two stacked b b triangles (line 6. This sarsity structure can be exloited, saving a constant factor of comutation comared to a QR factorization of a dense b b matrix. In fact, as of version 3.4, LAPACK has subroutines for exloiting this and similar sarsity structures (tqrt. Furthermore, the Householder vectors generated during the QR factorization of stacked triangles have similar sarsity; the structure of the Y i,k for k > 0 is an identity matrix stacked on to of a triangle. Because the set {Y i,k } constitutes the imlicit tree reresentation of the orthogonal factor, it is imortant to note how the tree is distributed across rocessors since the Q factor will be either alied to a matrix (see [1, Algorithm ] or constructed exlicitly (see Section For a given Y i,k, i indicates the rocessor number (both where it is comuted and where it is stored, and k indicates the level in the tree. Although 0 i 1 and 0 k log, many of the Y i,k are null; for examle, Y i,log = for i > 0 and Y 1,k = for k > 0. In the case of the binary tree in Algorithm, rocessor 0 comutes and stores Y 0,k for 0 k log and thus requires O(b log extra memory. The costs and analysis of TSQR are given in [1, 17]: ( mb γ + b3 3 log ( b + β log + α log. We tabulate these costs in the second row of Table 1. We note that the TSQR inner tree factorizations require an extra comutational cost O(b 3 log and a bandwidth cost of O(b log. Also note that in the context of a D algorithm, using TSQR as the anel factorization imlies that there is no b b T matrix to comute; the udate of the trailing matrix is erformed differently. 3.. D Algorithm (CAQR The D algorithm that uses TSQR for anel factorizations is known as CAQR. In order to udate the trailing matrix within CAQR, the imlicit orthogonal factor comuted from TSQR needs to be alied as a tree in the same order as it was comuted. See [1, Algorithm ] for a descrition of this rocess, or see [18, Algorithm 4] for seudocode that matches the binary tree in Algorithm. We refer to this alication of imlicit Q T as Aly-TSQR-Q T. The algorithm has the same tree deendency flow structure as TSQR but requires a bidirectional exchange between aired nodes in the tree. We note that in internal nodes of the tree it is ossible to exloit the additional sarsity structure of Y i,k (an identity matrix stacked on to of a triangular matrix, which our imlementation does via the use of the LAPACK v3.4+ routine tmqrt. Further, since A is m n and intermediate values of rows of A are communicated, the trailing matrix udate costs more than TSQR when n > b. In the context of CAQR on a square matrix, Aly-TSQR-Q T is erformed on a trailing matrix with n m columns. The extra work in the alication of the inner leaves of the tree is roortional to O(n b log(/ and bandwidth cost roortional to O(n log(/. Since the cost of Aly-TSQR-Q T is almost leading order in CAQR, it is desirable in ractice to otimize 5

8 Algorithm [{Y i,k }, R] = TSQR(A Require: Number of rocessors is and i is the rocessor index Require: A is m b matrix distributed in block row layout; A i is rocessor i s block 1: [Y i,0, R i ] = Householder-QR(A i : for k = 1 to log do 3: if i 0 mod k and i + k 1 < then 4: j = i + k 1 5: Receive R j from rocessor j 6: [Y i,k, R i ] = Householder-QR 7: else if i k 1 mod k then 8: Send R i to rocessor i k 1 9: end if 10: end for 11: if i = 0 then ([ Ri R j ] 1: R = R 0 13: end if Ensure: A = QR with Q imlicitly reresented by {Y i,k } Ensure: R is stored by rocessor 0 and Y i,k is stored by rocessor i the udate routine. However, the tree deendency structure comlicates this manual develoer or comiler otimization task. The overall cost of CAQR is given to leading order by ( 6mnb 3n ( b 4nb γ + r 3 (( nb β + n log r + c + 3n b c mn n r log r + 6mn n ( 3n + α b log r + 4n b log c See [1] for a discussion of these costs and [17] for detailed analysis. Note that the bandwidth cost is slightly lower here due to the use of more efficient broadcasts. If we ick r = c = (assuming m n and b = then we obtain the leading order costs n log ( mn n 3 /3 γ ( mn + n log + β + α. ( 7 log 3. Again, we choose b to reserve the leading constants of the comutational cost. Note that b needs to be chosen smaller here than in Section 3.1. due to the costs associated with Aly-TSQR-Q T. It is ossible to reduce the costs of Aly-TSQR-Q T further using ideas from efficient recursive doubling/halving collectives. See Aendix A for details Constructing Exlicit Q from TSQR In many use cases of QR decomosition, an exlicit orthogonal factor is not necessary; rather, we need only the ability to aly the matrix (or its transose to another matrix, as done in the revious section. For our uroses (see Section 4, we will need to form the exlicit m b orthonormal matrix from the imlicit tree reresentation. Though it is not necessary within CAQR, we describe it here because it is a known algorithm (see [19, Figure 4] and the structure of the algorithm is very similar to TSQR. In LAPACK terms, constructing (i.e., generating the orthogonal factor when it is stored as a set of Householder vectors is done with orgqr. 6

9 Algorithm 3 [B] = Aly-TSQR-Q T ({Y i,k }, A Require: Number of rocessors is and i is the rocessor index Require: A is m n matrix distributed in block row layout; A i is rocessor i s block Require: {Y i,k } is the imlicit tree TSQR reresentation of b Householder vectors of length m. 1: B i = Aly-Householder-Q T (Y i,0, A i : Let B i be the first b rows of B i 3: for k = 1 to log do 4: if i 0 mod k and i + k 1 < then 5: j = i + k 1 6: Receive [ ] B j from rocessor j [ ] Bi Bi 7: = Aly-Householder-Q (Y B T i,k, j B j 8: Send B j back to rocessor j 9: else if i k 1 mod k then 10: Send B i to rocessor i k 1 11: Receive udated rows B i from rocessor i k 1 1: Set the first b rows of B i to B i 13: end if 14: end for Ensure: B = Q T A with rocessor i owning block B i, where Q is the orthogonal matrix imlicitly reresented by {Y i,k } Algorithm 4 resents the method for constructing the m b matrix Q by alying the (imlicit square orthogonal factor to the first b columns of the m m identity matrix. Note that while we resent Algorithm 4 assuming a binary tree, any tree shae is ossible, as long as the imlicit Q is comuted using the same tree shae as TSQR. While the nodes of the tree are comuted from leaves to root, they will be alied in reverse order from root to leaves. Note that in order to minimize the comutational cost, the sarsity of the identity matrix at the root node and the sarsity structure of {Y i,k } at the inner tree nodes is exloited. In articular, since I b is uer triangular and Y 0, log has the structure of identity stacked on to of an uer triangle, the outut of the root node alication in line 7 is a stack of two uer triangles. By induction, since every Y i,k for k > 0 has the same structure, all outut Q i,k are triangular matrices. At the leaf nodes, when Y i,0 is alied in line 13, the outut is a dense (m/ b block. Since the communicated matrices Q j are triangular just as R j was triangular in the TSQR algorithm, Construct-TSQR-Q incurs the exact same comutational and communication costs as TSQR. So, we can reconstruct the unique art of the Q matrix from the imlicit form given by TSQR for the same cost as the TSQR itself. 3.3 Yamamoto s Basis-Kernel Reresentation The main goal of this work is to combine Householder-QR with CAQR; Yamamoto [11] rooses a scheme to achieve this. As described in Section 3.1, D-Householder-QR suffers from a communication bottleneck in the anel factorization. TSQR alleviates that bottleneck but requires a more comlicated (and slightly less efficient trailing matrix udate. Motivated in art to imrove the erformance and rogrammability of a hybrid CPU/GPU imlementation, Yamamoto suggests comuting a reresentation of the orthogonal factor that triangularizes the anel that mimics the reresentation in Householder-QR. As described by Sun and Bischof [0], there are many so-called basis-kernel reresentations of an orthogonal matrix. See also [1] for general discussion of block reflectors. For examle, the Householder- 7

10 Algorithm 4 Q = Construct-TSQR-Q({Y i,k } Require: Number of rocessors is and i is the rocessor index Require: {Y i,k } is comuted by Algorithm so that Y i,k is stored by rocessor i 1: if i = 0 then : Q0 = I b 3: end if 4: for k = log down to 1 do 5: if i 0 mod k and i + k 1 < then 6: j = i + k 1 7: [ Qi Q j ] = Aly-Householder-Q ( Y i,k, [ ] Qi 0 8: Send Q j to rocessor j 9: else if i k 1 mod k then 10: Receive Q i from rocessor i k 1 11: end if 1: end for ( [ ] Qi 13: Q i = Aly-Q-to-Triangle Y i,0, 0 Ensure: Q is orthonormal m b matrix distributed in block row layout; Q i is rocessor i s block QR algorithm comutes a lower triangular matrix Y such that A = (I Y T Y1 T R, so that [ ] Q = I Y T Y T Y1 = I T [ Y T ]. ( Y 1 Y T Here, Y is called the basis and T is called the kernel in this reresentation of the square orthogonal factor Q. However, there are many such basis-kernel reresentations if we do not restrict Y and T to be lower and uer triangular matrices, resectively. Yamamoto [ ] [11] chooses a basis-kernel reresentation that is easy to comute. For an m b matrix A, Q1 let A = R where Q 1 and R are b b. Then define the basis-kernel reresentation Q [ ] Q = I Ỹ T Ỹ T Q1 I [I Q1 ] T [ = I (Q1 I T Q T ], (3 Q [ ] R where I Q 1 is assumed to be nonsingular. It can be easily verified that Q T Q = I and Q T A = ; in 0 fact, this is the reresentation suggested and validated by [, Theorem 3]. Note that both the basis and kernel matrices Ỹ and T are dense. The main advantage of basis-kernel reresentations is that they can be used to aly the orthogonal factor (or its transose very efficiently using matrix multilication. In articular, the comutational comlexity of alying Q T using any basis-kernel is the same to leading order, assuming Y has the same dimensions as A and m b. Thus, it is not necessary to reconstruct the Householder vectors; from a comutational ersective, finding any basis-kernel reresentation of the orthogonal factor comuted by TSQR will do. Note also that in order to aly Q T with the reresentation in Equation (3, we need to aly the inverse of I Q 1, so we need to erform an LU decomosition of the b b matrix and then aly the inverses of the triangular factors using triangular solves. The assumtion that I Q 1 is nonsingular can be droed by relacing I with a diagonal sign matrix S chosen so that S Q 1 is nonsingular [3]; in this case the reresentation becomes [ ] QS = I Ỹ T Ỹ T Q1 S = I S [ ] T [ S Q 1 (Q1 S T Q T ]. (4 Q 8

11 Yamamoto s aroach is very closely related to TSQR-HR (Algorithm 6, resented in Section 4. We comare the methods in Section New Algorithms We first resent our main contribution, a arallel algorithm that erforms TSQR and then reconstructs the Householder vector reresentation from the TSQR reresentation of the orthogonal factor. We then show that this reconstruction algorithm may be used as a building block for more efficient D QR algorithms. In articular, the algorithm is able to combine two existing aroaches for D QR factorizations, leveraging the efficiency of TSQR in anel factorizations and the efficiency of Householder-QR in trailing matrix udates. While Householder reconstruction adds some extra cost to the anel factorization, we show that its use in the D algorithm reduces overall communication comared to both D-Householder-QR and CAQR. 4.1 TSQR with Householder Reconstruction The basic stes of our 1D algorithm include erforming TSQR, constructing the exlicit tall-skinny Q factor, and then comuting the Householder vectors corresonding to Q. The key idea of our reconstruction algorithm is that erforming Householder-QR on an orthonormal matrix Q is the same as erforming an LU decomosition on Q S, where S is a diagonal sign matrix corresonding to the sign choices made inside the Householder-QR algorithm. This claim is roved exlicitly in Lemma 5.. Informally, ignoring signs, if Q = I Y T Y T 1 with Y a matrix of Householder vectors, then Y ( T Y T 1 is uer triangular. of Q I since Y is unit lower triangular and T Y T 1 is an LU decomosition In this section we resent Modified-LU as Algorithm 5, which can be alied to any orthonormal matrix (not necessarily one obtained from TSQR. Ignoring lines 1, 3, and 4, it is exactly LU decomosition without ivoting. Note that with the choice of S, no ivoting is required since the effective diagonal entry will be at least 1 in absolute value and all other entries in the column are bounded by 1 (the matrix is orthonormal. 3 This holds true throughout the entire factorization because the trailing matrix remains orthonormal, which we rove within Lemma 5.. Algorithm 5 [L, U, S] = Modified-LU(Q Require: Q is m b orthonormal matrix 1: S = 0 : for i = 1 to b do 3: S(i, i = sgn(q(i, i 4: Q(i, i = Q(i, i S(i, i % Scale ith column of L by diagonal element 5: Q(i+1:m, i = 1 Q(i,i Q(i+1:m, i % Perform Schur comlement udate 6: Q(i+1:m, i+1:b = Q(i+1:m, i+1:b Q(i+1:m, i Q(i, i+1:b 7: end for Ensure: U overwrites the uer triangle and L has imlicit unit diagonal and overwrites the strict lower triangle of Q; S is diagonal so that Q S = LU Given the algorithms of the revious sections, we now resent the full aroach for comuting the QR decomosition of a tall-skinny matrix using TSQR and Householder reconstruction. That is, in this section we resent an algorithm such that the format of the outut of the algorithm is identical to that of Householder-QR. However, we argue that the communication costs of this aroach are much less than those of erforming Householder-QR. 3 We use the convention sgn(0 = 1. 9

12 Householder-QR 3mb TSQR mb TSQR-HR 5mb Flos Words Messages b log b log b3 3 + b3 3 log b log log + 4b3 3 log b log 4 log Table 1: Costs of QR factorization of tall-skinny m b matrix distributed over rocessors in 1D fashion. We assume these algorithms are used as anel factorizations in the context of a D algorithm alied to an m n matrix. Thus, costs of Householder-QR and TSQR-HR include the costs of comuting T. The method, given as Algorithm 6, is to erform TSQR (line 1, construct the tall-skinny Q factor exlicitly (line, and then comute the Householder vectors that reresent that orthogonal factor using Modified-LU (line 3. The R factor is comuted in line 1 and the Householder vectors (the columns of Y are comuted in line 3. An added benefit of the aroach is that the triangular T matrix, which allows for block alication of the Householder vectors, can be comuted very chealy. That is, a triangular solve involving the uer triangular factor from Modified-LU comutes the T so that A = (I Y T Y1 T R. To comute T directly from Y (as is necessary if Householder-QR is used requires O(nb flos; here the triangular solve involves O(b 3 flos. Our aroach for comuting T is given in line 4, and line 5 ensures sign agreement between the columns of the (imlicitly stored orthogonal factor and rows of R. Algorithm 6 [Y, T, R] = TSQR-HR(A Require: A is m b matrix distributed in block row layout 1: [{Y i,k }, R] = TSQR(A : Q = Construct-TSQR-Q({Y i,k } 3: [Y, U, S] = Modified-LU(Q 4: T = USY1 T 5: R = S R Ensure: A = (I Y T Y1 T R, where Y is m b and unit lower triangular, Y 1 is to b b block of Y, and T and R are b b and uer triangular On rocessors, Algorithm 6 incurs the following costs (ignoring lower order terms: 1. Comute [{Y i,k }, R ] = TSQR(A The comutational costs of this ste come from lines 1 and 6 in Algorithm. Line 1 corresonds to a QR factorization of a (m/ b matrix, with a flo count of (m/b b 3 /3 (each rocessor erforms this ste simultaneously. Line 6 corresonds to a QR factorization of a b b triangle stacked on to of a b b triangle. Exloiting the sarsity structure, the flo count is b 3 /3; this occurs at every internal node of the binary tree, so the total cost in arallel is (b 3 /3 log. The communication costs of Algorithm occur in lines 5 and 8. Since every R i,k is a b b uer triangular matrix, the cost of a single message is α + β (b /. This occurs at every internal node in the tree, so the total communication cost in arallel is a factor log larger. Thus, the cost of this ste is ( ( mb γ + b3 b 3 log + β log + α log.. Q = Construct-TSQR-Q({Y i,k } The comutational costs of this ste come from lines 7 and 13 in Algorithm 4. Note that for k > 0, Y i,k is a b b matrix: the identity matrix stacked on to of an uer triangular matrix. Furthermore, Q i,k is an uer triangular matrix. Exloiting the structure of Y i,k and Q i,k, the cost of line 7 is b 3 /3, which occurs at every internal node of the tree. Each Y i,0 is a (m/ b lower triangular 10

13 matrix of Householder vectors, so the cost of alying them to an uer triangular matrix in line 13 is (m/b b 3 /3. Note that the comutational cost of these two lines is the same as those of the revious ste in Algorithm. The communication attern of Algorithm 4 is identical to Algorithm, so the communication cost is also the same as that of the revious ste. Thus, the cost of constructing Q is ( mb γ + b3 3 log 3. [Y, U, S] = Modified-LU(Q ( b + β log + α log. Ignoring lines 3 4 in Algorithm 5, Modified-LU is the same as LU without ivoting. In arallel, the algorithm consists of a b b (modified LU factorization of the to block followed by arallel triangular solves to comute the rest of the lower triangular factor. The flo count of the b b LU factorization is b 3 /3, and the cost of each rocessor s triangular solve is (m/b. The communication cost of arallel Modified-LU is only that of a broadcast of the uer triangular b b U factor (for which we use a bidirectional-exchange algorithm: β (b + α ( log. Thus, the cost of this ste is 4. T = USY T 1 and R = SR ( mb γ + b3 + β b + α log 3 The last two stes consist of comuting T and scaling the final R aroriately. Since S is a sign matrix, comuting US and SR requires no floating oint oerations and can be done locally on one rocessor. Thus, the only comutational cost is erforming a b b triangular solve involving Y1 T. If we ignore the triangular structure of the outut, the flo count of this oeration is b 3. However, since this oeration occurs on the same rocessor that comutes the to m/ b block of Y and U, it can be overlaed with the revious ste (Modified-LU. After the to rocessor erforms the b b LU factorization and broadcasts the U factor, it comutes only m/ b rows of Y (all other rocessors udate m/ rows. Thus, erforming an extra b b triangular solve on the to rocessor adds no comutational cost to the critical ath of the algorithm. Thus, TSQR-HR(A where A is m-by-b incurs the following costs (ignoring lower order terms: ( 5mb γ + 4b3 3 log + β (b log + α (4 log. (5 Note that the LU factorization required in Yamamoto s aroach (see Section 3.3 is equivalent to erforming Modified-LU( Q 1. In Algorithm 6, the Modified-LU algorithm is alied to an m b matrix rather than to only the to b b block; since no ivoting is required, the only difference is the udate of the bottom m b rows with a triangular solve. Thus it is not hard to see that, ignoring signs, the Householder basis-kernel reresentation in Equation ( can be obtained from the reresentation given in Equation (3 with two triangular solves: if the LU factorization gives I Q 1 = LU, then Y = Ỹ U 1 and T = UL T. Indeed, erforming these two oerations and handling the signs correctly gives Algorithm 6. While Yamamoto s aroach avoids erforming the triangular solve on Q, it still involves erforming both TSQR and Construct-TSQR-Q. 11

14 Flos Words Messages D-Householder-QR mn n 3 /3 n log CAQR CAQR-HR mn n 3 /3 mn n 3 /3 mn+n / mn+n log 7 mn+n / log 3 6 log Table : Costs of QR factorization of m n matrix distributed over rocessors in D fashion. Here we assume a square rocessor grid ( r = c. We also choose block sizes for each algorithm indeendently to ensure the leading order terms for flos are identical. 4. LU Decomosition of A R Comuting the Householder vectors basis-kernel reresentation via Algorithm 6 requires forming the exlicit m b orthonormal matrix. It is ossible to comute this reresentation [ ] more chealy, though at R1 the exense of losing numerical stability. Suose A is full rank, R = is the uer triangular factor 0 comuted by TSQR, and A R is also full rank. Then A = (I Y T Y1 T R 1 imlies that the unique LU decomosition of A R is given by A R = Y ( T Y1 T R 1, (6 since Y is unit lower triangular and T, Y1 T, and R 1 are all uer triangular. The assumtion that A R is full rank can be droed [ ] by choosing an aroriate diagonal sign matrix S and comuting the LU SR1 decomosition of A. 0 Not surrisingly, choosing the signs to match the choices of the Householder-QR algorithm alied to A guarantees that the matrix is nonsingular. In fact, the signs can be comuted during the course of a modified LU decomosition algorithm very similar to Algorithm 5 (designed for general matrices rather than only orthonormal. This more general algorithm can also be interreted as erforming Householder- QR on A but using R to rovide hints to the algorithm to avoid certain exensive comutations; see Aendix B for more details. The advantage of this aroach is that Y can be comuted without constructing Q exlicitly, which is as exensive as TSQR itself. Unfortunately, this method is not numerically stable. An intuition for the instability comes from considering a low-rank matrix A. Because the QR decomosition of A is not unique, the TSQR algorithm and the Householder-QR algorithm may comute different factor airs (Q TSQR, R TSQR and (Q Hh, R Hh. Performing an LU decomosition using A and R TSQR to recover the Householder vectors that reresent Q Hh is hoeless if R TSQR R Hh, even in exact arithmetic. In floating oint arithmetic, an ill-conditioned A corresonds to a factor R which is sensitive to roundoff error, so reconstructing the Householder vectors that would roduce the same R as comuted by TSQR is unstable. However, if A is well-conditioned, then this method is sufficient; one can view Algorithm 5 as alying the method to an orthonormal matrix (which is erfectly conditioned. One method of decreasing the sensitivity of R is to emloy column ivoting. That is, after erforming TSQR aly QR with column ivoting to the b b uer triangular factor R, yielding a factorization A = Q TSQR ( Q RP, or AP = (Q TSQR Q R, where the diagonal entries of R decrease monotonically [ ] in S R absolute value. We have observed exerimentally that erforming LU decomosition of AP is a 0 more stable oeration than not using column ivoting, though it is still not as robust as Algorithm 6. See Section 5 for a discussion of exeriments where the instability occurs (even with column ivoting. 4.3 CAQR-HR We refer to the D algorithm that uses TSQR-HR for anel factorizations as CAQR-HR. Because Householder- QR and TSQR-HR generate the same reresentation as outut of the anel factorization, the trailing matrix 1

15 Algorithm 7 [Y, T, R] = CAQR-HR(A Require: A is m n and distributed block-cyclically on = r c rocessors with block size b, so that each b b block A ij is owned by rocessor Π(i, j = (i mod r + r (j mod c 1: for i = 0 to n/b 1 do % Comute TSQR and reconstruct Householder reresentation using column of r rocessors : [ Yi:m/b 1,i, T i, R ii ] = Hh-Recon-TSQR(Ai:m/b 1,i % Udate trailing matrix using all rocessors 3: Π(i, i broadcasts T i to all other rocessors 4: for r [i, m/b 1], c [i + 1, n/b 1] do in arallel 5: Π(r, i broadcasts Y ri to all rocessors in its row, Π(r, : 6: 7: Π(r, c comutes W rc = Yri T rc Allreduce W c = W r rc so that rocessor column Π(:, c owns W c 8: Π(r, c comutes A rc = A rc Y ri T T W c 9: Set R ic = A ic 10: end for 11: end for ( n Ensure: A = i=1 (I Y :,it i Y:,i T R i udate can be erformed in exactly the same way. Thus, the only difference between D-Householder-QR and CAQR-HR, resented in Algorithm 7, is the subroutine call for the anel factorization (line. Algorithm 7 with block size b, and matrix m-by-n matrix A, with m n and m, n 0 (mod b distributed on a D r -by- c rocessor grid incurs the following costs over all n/b iterations., 1. Comute [ Y i:m/b 1,i, T i, R ii ] = Hh-Recon-TSQR ( Ai:m/b 1,i Equation (5 gives the cost of a single anel TSQR factorization with Householder reconstruction. We can sum over all iterations to obtain the cost of this ste in the D QR algorithm (line in Algorithm 7, n/b 1 i=0 ( 5mnb γ ( ( 5(m ibb γ r r 5n b r + 4nb 3. Π(i, i broadcasts T i to all other rocessors + 4b3 3 log r + β (b log r + α (4 log r = ( 4n log r log r + β (nb log r + α b The matrix T i is b-by-b and triangular, so we use a bidirectional exchange broadcast. Since there are a total of n/b iterations, the total communication cost for this ste of Algorithm 7 is ( n β (nb + α b log 3. Π(r, i broadcasts Y ri to all rocessors in its row, Π(r, : At this ste, each rocessor which took art in the TSQR and Householder reconstruction sends its local chunk of the anel of Householder vectors to the rest of the rocessors. At iteration i of Algorithm 7, each rocessor owns at most m/b i r blocks Y ri. Since all the broadcasts haen along rocessor rows, we assume they can be done simultaneously on the network. The communication along the critical ath is then given by 13

16 n/b 1 i=0 ( β ( ( (m/b i b mn n + α log c = β r r ( n + α b log c 4. Π(r, c comutes W rc = Yri T A rc Each block W rc is comuted on rocessor Π(r, c, using A rc, which it owns, and Y ri, which it just received from rocessor Π(r, i. At iteration i, rocessor j may own u to m/b i r n/b i c blocks W rc, Π(r, c = j. Each, block-by-block multily incurs b 3 flos, therefore, this ste incurs a total comutational cost of γ n/b 1 i=0 (m/b i(n/b ib 3 ( mn n 3 /3 = γ 5. Allreduce W c = r W rc so that rocessor column Π(:, c owns W c At iteration i of Algorithm 7, each rocessor j may own u to m/b i r n/b i c blocks W rc. The local art of the reduction can be done during the comutation of W rc on line 6 of Algorithm 7. Therefore, no rocess should contribute more than n/b i c b-by-b blocks W rc to the reduction. Using a bidirectional exchange all-reduction algorithm and summing over the iterations, we can obtain the cost of this ste throughout the entire execution of the D algorithm: n/b 1 i=0 ( β 6. Π(r, c comutes A rc = A rc Y ri T i W c ( ( (n/b i b n + α log r = β c c ( n + α b log r Since in our case, m n, it is faster to first multily T i by W c rather than Y ri by T i. Any rocessor j may udate u to m/b i r n/b i c blocks of A rc using m/b i r blocks of Y ri and n/b i c blocks of W c. When multilying T i by W c, we can exloit the triangular structure of T, to lower the flo count by a factor of two. Summed over all iterations, these two multilications incur a comutational cost of γ n/b 1 i=0 (m/b i(n/b ib 3 + (n/b ib3 c ( mn n 3 /3 = γ The overall costs of CAQR-HR are given to leading order by ( 10mnb 5n b γ + 4nb log r + n b + mn n 3 /3 + r 3 c ( mn n β (nb log r + + n 8n + α r c b log r + 4n b log c. If we ick r = c = (assuming m n and b = shown in the third row of Table. + n b c n log then we obtain the leading order costs ( mn n 3 ( /3 mn + n / γ + β + α (6 log, 14

17 Comaring the leading order costs of CAQR-HR with the existing aroaches, we note again the O(n log latency cost incurred by the D-Householder-QR algorithm. CAQR and CAQR-HR eliminate this synchronization bottleneck and reduce the latency cost to be indeendent of the number of columns of the matrix. Further, both the bandwidth and latency costs of CAQR-HR are factors of O(log lower than CAQR (when m n. As reviously discussed, CAQR includes an extra leading order bandwidth cost term (β n log /, as well as a comutational cost term (γ (n b/ c log r that requires the choice of a smaller block size and leads to an increase in the latency cost. Further, the Householder form allows us to decoule the block sizes used for the trailing matrix udate from the width of each TSQR. This is a simle algorithmic otimization to add to our D algorithm, which has the same leading order costs. We show this nested blocked aroach in Algorithm 8, which is very similar to Algorithm 7, and does not require any additional subroutines. While this algorithm does not have a theoretical benefit, it is highly beneficial in ractice, as we will demonstrate in the erformance section. We note that it is not ossible to aggregate the imlicit TSQR udates in this way. Algorithm 8 [Y, T, R] = CAQR-HR-Agg(A Require: A is m n and distributed block-cyclically on = r c rocessors with block size b 1, so that each b 1 b 1 block A ij is owned by rocessor Π(i, j = (i mod r + r (j mod c 1: for [ i = 0 to n/b 1 ] do : Yi:m/b 1,i, T i, R ii = CAQR-HR(Ai:m/b 1,i 3: Aggregate udates from line 5 of CAQR-HR into (m ib -by-b matrix Ȳri 4: for r [i, m/b 1], c [i + 1, n/b 1] do in arallel 5: 6: Π(r, c comutes W rc = Ȳ ri T rc Allreduce W c = W r rc so that rocessor column Π(:, c owns W c 7: Π(r, c comutes A rc = A rc Ȳri T T W c 8: Set R ic = A ic 9: end for 10: end for ( n Ensure: A = i=1 (I Y :,it i Y:,i T R i 5 Correctness and Stability In order to reconstruct the lower traezoidal Householder vectors that constitute an orthonormal matrix (u to column signs, we can use an algorithm for LU decomosition without ivoting on an associated matrix (Algorithm 5. Lemma 5.1 shows by uniqueness that this LU decomosition roduces the Householder vectors reresenting the orthogonal matrix in exact arithmetic. Given the exlicit orthogonal factor, the LU method is cheaer in terms of both comutation and communication than constructing the vectors via Householder-QR. Lemma 5.1. Given an orthonormal m b matrix Q, let the comact QR decomosition of Q given by the Householder-QR algorithm (Algorithm 1 be ([ ] In Q = Y T Y1 T S, 0 where Y is unit lower triangular, Y 1 is the to b b block of Y, and T [ is the ] uer triangular b b matrix S satisfying T 1 +T T = Y T Y. Then S is a diagonal sign matrix, and Q has a unique LU decomosition 0 given by [ ] S Q = Y ( T Y1 T S. (7 0 15

18 Proof. Since Q is orthonormal, it has full rank and therefore has a unique QR decomosition with ositive diagonal entries in the uer triangular matrix. This decomosition is Q = Q I n, so the Householder-QR algorithm must comute a decomosition that differs only by sign. Thus, S is a diagonal sign matrix. The uniqueness of T is guaranteed by [0, Examle.4]. We obtain Equation (7 by rearranging the Householder QR decomosition. Since Y is unit lower triangular and T, Y1 T, and S are all uer triangular, we [ have ] an LU decomosition. Since Y is full S column rank and T, Y1 T, and S are all nonsingular, Q has full column rank and therefore has a 0 unique LU decomosition with a unit lower triangular factor. Unfortunately, given an orthonormal matrix Q, the sign matrix [ ] S roduced by Householder-QR is not S known, so we cannot run a standard LU algorithm on Q. Lemma 5. shows that by running the 0 Modified-LU algorithm (Algorithm 5, we can chealy comute the sign matrix S during the course of the algorithm. Lemma 5.. In exact arithmetic, Algorithm 5 alied to an orthonormal m b matrix Q comutes the same Householder vectors Y and sign matrix S as the Householder-QR algorithm (Algorithm 1 alied to Q. Proof. Consider alying one ste of Modified-LU (Algorithm 5 to the orthonormal matrix Q, where we first set S 11 = sgn(q 11 and subtract it from q 11. Note that since all other entries of the first column are less than 1 in absolute value and the absolute value of the diagonal entry has been increased by 1, the diagonal entry is the maximum entry. Following the LU algorithm, all entries below the diagonal are scaled by the recirocal of the diagonal element: for i m, y i1 = q i1 q 11 + sgn(q 11, (8 where Y is the comuted lower triangular factor. The Schur comlement is udated as follows: for i m and j b, q i1 q 1j q ij = q ij q 11 + sgn(q 11. (9 Now consider alying one ste of the Householder-QR algorithm (Algorithm 1. To match the LA- PACK notation, we let α = q 11, β = sgn(α q 1 = sgn(α, and τ = β α β = 1 + α sgn(α, where we use the fact that the columns of Q have unit norm. Note that β = sgn(α = S 11, which matches the Modified-LU algorithm above. The Householder vector y is comuted by setting the diagonal entry to 1 1 and scaling the other entries of q 1 by α β = 1 α+sgn(α. Thus, comuting the Householder vector matches comuting the column of the lower triangular matrix from Modified-LU in Equation (8 above. Next, we consider the udate of the trailing matrix: Q = (I τyy T Q. Since Q has orthonormal columns, the dot roduct of the Householder vector with the jth column is given by The identity y T q j = m y i q ij = q 1j + i=1 m i= = q 1j q 11q 1j α + sgn(α = q i1 α + sgn(α q ij ( 1 α α + sgn(α ( α (1 + α sgn(α 1 = 1 α + sgn(α q 1j. imlies τy T q j = q 1j. Then the trailing matrix udate ( i m and j b is given by q ij = q ij y i (τy T q j = q ij 16 q i1q 1j α + sgn(α,

19 which matches the Schur comlement udate from Modified-LU in Equation (9 above. Finally, consider the jth element of first row of the trailing matrix (j : q ij = q ij y 1 (τy T q j = 0. Note that the comutation of the first row is not erformed in the Modified-LU algorithm. The corresonding row is not changed since the diagonal element is Y 11 = 1. However, in the case of Householder-QR, since the first row of the trailing matrix is zero, and because the Householder transformation reserves column norms, the udated (m 1 (b 1 trailing matrix is itself an orthonormal matrix. Thus, by induction, the rest of the two algorithms erform the same comutations. In order to bound the error in the decomosition of a general matrix using Householder reconstruction, we will use Lemma 5.4 below as well as the stability bounds on the TSQR or AllReduce algorithm rovided in [4]. We restate those results here. Lemma 5.3 ([4]. Let ˆR be the comuted uer triangular factor of m b matrix A obtained via the AllReduce algorithm using a binary tree of height L (m/ L b. Then there exists an orthonormal matrix Q such that A Q ˆR F f 1 (m, b, L, ε A F and Q ˆQ F f (m, b, L, ε, ( b, where f 1 (m, b, L, ε bγ c m + Lbγ L c b and f (m, b, L, ε bγ c m + Lbγ L c b for bγc m 1 and L bγ c b 1, where c is a small constant. The following lemma shows that the comutation erformed in Algorithm 5 is more stable than general LU decomosition. Because it is erformed on an orthonormal matrix, the growth factor in the uer triangular factor is bounded by, and we can bound the Frobenius norms of both comuted factors by functions of the number of columns. Further, we can stably comute the T factor (needed to aly a blocked Householder udate from the uer triangular LU factor, a comutationally cheaer method than using the Householder vectors themselves. Lemma 5.4. In floating oint arithmetic, given an orthonormal m b matrix Q, Algorithm 5 comutes factors S, Ỹ, and T such that ([ ] I QS 0 Ỹ T Ỹ 1 T f 3 (b, ε. F where and Y 1 is given by the first b rows of Y. ( f 3 (b, ε = γ b (b b Proof. This result follows from the standard error analysis of LU decomosition and triangular solve with multile right hand sides, and the roerties of the comuted [ lower ] and uer triangular factors. S First, we use the fact that for LU decomosition of Q into triangular factors Ỹ and Ũ, we have 0 (see [5, Section 9.3] ( [ ] S Q 0 Ỹ Ũ γ b Ỹ F Ũ F. F Second, we comute T T = (ŨSY1 using the standard algorithm for triangular solve with multile right hand sides to obtain (see [5, Section 8.1] Ũ T Ỹ 1 T T S = ŨS T Ỹ1 γ b T F Ỹ 1 T F. F F 17

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel algorithms for dense matrix factorizations Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Numerous alications in statistics, articularly in the fitting of linear models. Notation and conventions: Elements of a matrix A are denoted by a ij, where i indexes the rows and

More information

Avoiding Communication in Distributed-Memory Tridiagonalization

Avoiding Communication in Distributed-Memory Tridiagonalization Avoiding Communication in Distributed-Memory Tridiagonalization SIAM CSE 15 Nicholas Knight University of California, Berkeley March 14, 2015 Joint work with: Grey Ballard (SNL) James Demmel (UCB) Laura

More information

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem A communication-avoiding arallel algorithm for the symmetric eigenvalue roblem Edgar Solomonik ETH Zurich Email: solomonik@inf.ethz.ch Grey Ballard Sandia National Laboratory Email: gmballa@sandia.gov

More information

PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY

PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK POULSON Abstract We exose a systematic aroach for develoing distributed memory arallel matrixmatrix

More information

PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY

PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK POULSON Abstract We exose a systematic aroach for develoing distributed memory arallel matrix matrix

More information

4. Score normalization technical details We now discuss the technical details of the score normalization method.

4. Score normalization technical details We now discuss the technical details of the score normalization method. SMT SCORING SYSTEM This document describes the scoring system for the Stanford Math Tournament We begin by giving an overview of the changes to scoring and a non-technical descrition of the scoring rules

More information

John Weatherwax. Analysis of Parallel Depth First Search Algorithms

John Weatherwax. Analysis of Parallel Depth First Search Algorithms Sulementary Discussions and Solutions to Selected Problems in: Introduction to Parallel Comuting by Viin Kumar, Ananth Grama, Anshul Guta, & George Karyis John Weatherwax Chater 8 Analysis of Parallel

More information

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018 Comuter arithmetic Intensive Comutation Annalisa Massini 7/8 Intensive Comutation - 7/8 References Comuter Architecture - A Quantitative Aroach Hennessy Patterson Aendix J Intensive Comutation - 7/8 3

More information

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning TNN-2009-P-1186.R2 1 Uncorrelated Multilinear Princial Comonent Analysis for Unsuervised Multilinear Subsace Learning Haiing Lu, K. N. Plataniotis and A. N. Venetsanooulos The Edward S. Rogers Sr. Deartment

More information

Linear diophantine equations for discrete tomography

Linear diophantine equations for discrete tomography Journal of X-Ray Science and Technology 10 001 59 66 59 IOS Press Linear diohantine euations for discrete tomograhy Yangbo Ye a,gewang b and Jiehua Zhu a a Deartment of Mathematics, The University of Iowa,

More information

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H:

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H: Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 25, 2017 Due: November 08, 2017 A. Growth function Growth function of stum functions.

More information

arxiv: v1 [physics.data-an] 26 Oct 2012

arxiv: v1 [physics.data-an] 26 Oct 2012 Constraints on Yield Parameters in Extended Maximum Likelihood Fits Till Moritz Karbach a, Maximilian Schlu b a TU Dortmund, Germany, moritz.karbach@cern.ch b TU Dortmund, Germany, maximilian.schlu@cern.ch

More information

GIVEN an input sequence x 0,..., x n 1 and the

GIVEN an input sequence x 0,..., x n 1 and the 1 Running Max/Min Filters using 1 + o(1) Comarisons er Samle Hao Yuan, Member, IEEE, and Mikhail J. Atallah, Fellow, IEEE Abstract A running max (or min) filter asks for the maximum or (minimum) elements

More information

Estimation of the large covariance matrix with two-step monotone missing data

Estimation of the large covariance matrix with two-step monotone missing data Estimation of the large covariance matrix with two-ste monotone missing data Masashi Hyodo, Nobumichi Shutoh 2, Takashi Seo, and Tatjana Pavlenko 3 Deartment of Mathematical Information Science, Tokyo

More information

q-ary Symmetric Channel for Large q

q-ary Symmetric Channel for Large q List-Message Passing Achieves Caacity on the q-ary Symmetric Channel for Large q Fan Zhang and Henry D Pfister Deartment of Electrical and Comuter Engineering, Texas A&M University {fanzhang,hfister}@tamuedu

More information

The Knuth-Yao Quadrangle-Inequality Speedup is a Consequence of Total-Monotonicity

The Knuth-Yao Quadrangle-Inequality Speedup is a Consequence of Total-Monotonicity The Knuth-Yao Quadrangle-Ineuality Seedu is a Conseuence of Total-Monotonicity Wolfgang W. Bein Mordecai J. Golin Lawrence L. Larmore Yan Zhang Abstract There exist several general techniues in the literature

More information

Information collection on a graph

Information collection on a graph Information collection on a grah Ilya O. Ryzhov Warren Powell February 10, 2010 Abstract We derive a knowledge gradient olicy for an otimal learning roblem on a grah, in which we use sequential measurements

More information

Chater Matrix Norms and Singular Value Decomosition Introduction In this lecture, we introduce the notion of a norm for matrices The singular value de

Chater Matrix Norms and Singular Value Decomosition Introduction In this lecture, we introduce the notion of a norm for matrices The singular value de Lectures on Dynamic Systems and Control Mohammed Dahleh Munther A Dahleh George Verghese Deartment of Electrical Engineering and Comuter Science Massachuasetts Institute of Technology c Chater Matrix Norms

More information

Modeling and Estimation of Full-Chip Leakage Current Considering Within-Die Correlation

Modeling and Estimation of Full-Chip Leakage Current Considering Within-Die Correlation 6.3 Modeling and Estimation of Full-Chi Leaage Current Considering Within-Die Correlation Khaled R. eloue, Navid Azizi, Farid N. Najm Deartment of ECE, University of Toronto,Toronto, Ontario, Canada {haled,nazizi,najm}@eecg.utoronto.ca

More information

For q 0; 1; : : : ; `? 1, we have m 0; 1; : : : ; q? 1. The set fh j(x) : j 0; 1; ; : : : ; `? 1g forms a basis for the tness functions dened on the i

For q 0; 1; : : : ; `? 1, we have m 0; 1; : : : ; q? 1. The set fh j(x) : j 0; 1; ; : : : ; `? 1g forms a basis for the tness functions dened on the i Comuting with Haar Functions Sami Khuri Deartment of Mathematics and Comuter Science San Jose State University One Washington Square San Jose, CA 9519-0103, USA khuri@juiter.sjsu.edu Fax: (40)94-500 Keywords:

More information

Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models

Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models Ketan N. Patel, Igor L. Markov and John P. Hayes University of Michigan, Ann Arbor 48109-2122 {knatel,imarkov,jhayes}@eecs.umich.edu

More information

Outline. EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Simple Error Detection Coding

Outline. EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Simple Error Detection Coding Outline EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Error detection using arity Hamming code for error detection/correction Linear Feedback Shift

More information

An Investigation on the Numerical Ill-conditioning of Hybrid State Estimators

An Investigation on the Numerical Ill-conditioning of Hybrid State Estimators An Investigation on the Numerical Ill-conditioning of Hybrid State Estimators S. K. Mallik, Student Member, IEEE, S. Chakrabarti, Senior Member, IEEE, S. N. Singh, Senior Member, IEEE Deartment of Electrical

More information

Asymptotically Optimal Simulation Allocation under Dependent Sampling

Asymptotically Optimal Simulation Allocation under Dependent Sampling Asymtotically Otimal Simulation Allocation under Deendent Samling Xiaoing Xiong The Robert H. Smith School of Business, University of Maryland, College Park, MD 20742-1815, USA, xiaoingx@yahoo.com Sandee

More information

Topic: Lower Bounds on Randomized Algorithms Date: September 22, 2004 Scribe: Srinath Sridhar

Topic: Lower Bounds on Randomized Algorithms Date: September 22, 2004 Scribe: Srinath Sridhar 15-859(M): Randomized Algorithms Lecturer: Anuam Guta Toic: Lower Bounds on Randomized Algorithms Date: Setember 22, 2004 Scribe: Srinath Sridhar 4.1 Introduction In this lecture, we will first consider

More information

Analysis of execution time for parallel algorithm to dertmine if it is worth the effort to code and debug in parallel

Analysis of execution time for parallel algorithm to dertmine if it is worth the effort to code and debug in parallel Performance Analysis Introduction Analysis of execution time for arallel algorithm to dertmine if it is worth the effort to code and debug in arallel Understanding barriers to high erformance and redict

More information

CMSC 425: Lecture 4 Geometry and Geometric Programming

CMSC 425: Lecture 4 Geometry and Geometric Programming CMSC 425: Lecture 4 Geometry and Geometric Programming Geometry for Game Programming and Grahics: For the next few lectures, we will discuss some of the basic elements of geometry. There are many areas

More information

Round-off Errors and Computer Arithmetic - (1.2)

Round-off Errors and Computer Arithmetic - (1.2) Round-off Errors and Comuter Arithmetic - (.). Round-off Errors: Round-off errors is roduced when a calculator or comuter is used to erform real number calculations. That is because the arithmetic erformed

More information

Specialized and hybrid Newton schemes for the matrix pth root

Specialized and hybrid Newton schemes for the matrix pth root Secialized and hybrid Newton schemes for the matrix th root Braulio De Abreu Marlliny Monsalve Marcos Raydan June 15, 2006 Abstract We discuss different variants of Newton s method for comuting the th

More information

Improved Capacity Bounds for the Binary Energy Harvesting Channel

Improved Capacity Bounds for the Binary Energy Harvesting Channel Imroved Caacity Bounds for the Binary Energy Harvesting Channel Kaya Tutuncuoglu 1, Omur Ozel 2, Aylin Yener 1, and Sennur Ulukus 2 1 Deartment of Electrical Engineering, The Pennsylvania State University,

More information

Theoretically Optimal and Empirically Efficient R-trees with Strong Parallelizability

Theoretically Optimal and Empirically Efficient R-trees with Strong Parallelizability Theoretically Otimal and Emirically Efficient R-trees with Strong Parallelizability Jianzhong Qi, Yufei Tao, Yanchuan Chang, Rui Zhang School of Comuting and Information Systems, The University of Melbourne

More information

Information collection on a graph

Information collection on a graph Information collection on a grah Ilya O. Ryzhov Warren Powell October 25, 2009 Abstract We derive a knowledge gradient olicy for an otimal learning roblem on a grah, in which we use sequential measurements

More information

GOOD MODELS FOR CUBIC SURFACES. 1. Introduction

GOOD MODELS FOR CUBIC SURFACES. 1. Introduction GOOD MODELS FOR CUBIC SURFACES ANDREAS-STEPHAN ELSENHANS Abstract. This article describes an algorithm for finding a model of a hyersurface with small coefficients. It is shown that the aroach works in

More information

A Recursive Block Incomplete Factorization. Preconditioner for Adaptive Filtering Problem

A Recursive Block Incomplete Factorization. Preconditioner for Adaptive Filtering Problem Alied Mathematical Sciences, Vol. 7, 03, no. 63, 3-3 HIKARI Ltd, www.m-hiari.com A Recursive Bloc Incomlete Factorization Preconditioner for Adative Filtering Problem Shazia Javed School of Mathematical

More information

Parallelism and Locality in Priority Queues. A. Ranade S. Cheng E. Deprit J. Jones S. Shih. University of California. Berkeley, CA 94720

Parallelism and Locality in Priority Queues. A. Ranade S. Cheng E. Deprit J. Jones S. Shih. University of California. Berkeley, CA 94720 Parallelism and Locality in Priority Queues A. Ranade S. Cheng E. Derit J. Jones S. Shih Comuter Science Division University of California Berkeley, CA 94720 Abstract We exlore two ways of incororating

More information

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES AARON ZWIEBACH Abstract. In this aer we will analyze research that has been recently done in the field of discrete

More information

Paper C Exact Volume Balance Versus Exact Mass Balance in Compositional Reservoir Simulation

Paper C Exact Volume Balance Versus Exact Mass Balance in Compositional Reservoir Simulation Paer C Exact Volume Balance Versus Exact Mass Balance in Comositional Reservoir Simulation Submitted to Comutational Geosciences, December 2005. Exact Volume Balance Versus Exact Mass Balance in Comositional

More information

c Copyright by Helen J. Elwood December, 2011

c Copyright by Helen J. Elwood December, 2011 c Coyright by Helen J. Elwood December, 2011 CONSTRUCTING COMPLEX EQUIANGULAR PARSEVAL FRAMES A Dissertation Presented to the Faculty of the Deartment of Mathematics University of Houston In Partial Fulfillment

More information

A generalization of Amdahl's law and relative conditions of parallelism

A generalization of Amdahl's law and relative conditions of parallelism A generalization of Amdahl's law and relative conditions of arallelism Author: Gianluca Argentini, New Technologies and Models, Riello Grou, Legnago (VR), Italy. E-mail: gianluca.argentini@riellogrou.com

More information

A randomized sorting algorithm on the BSP model

A randomized sorting algorithm on the BSP model A randomized sorting algorithm on the BSP model Alexandros V. Gerbessiotis a, Constantinos J. Siniolakis b a CS Deartment, New Jersey Institute of Technology, Newark, NJ 07102, USA b The American College

More information

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek Use of Transformations and the Reeated Statement in PROC GLM in SAS Ed Stanek Introduction We describe how the Reeated Statement in PROC GLM in SAS transforms the data to rovide tests of hyotheses of interest.

More information

ON POLYNOMIAL SELECTION FOR THE GENERAL NUMBER FIELD SIEVE

ON POLYNOMIAL SELECTION FOR THE GENERAL NUMBER FIELD SIEVE MATHEMATICS OF COMPUTATIO Volume 75, umber 256, October 26, Pages 237 247 S 25-5718(6)187-9 Article electronically ublished on June 28, 26 O POLYOMIAL SELECTIO FOR THE GEERAL UMBER FIELD SIEVE THORSTE

More information

Radial Basis Function Networks: Algorithms

Radial Basis Function Networks: Algorithms Radial Basis Function Networks: Algorithms Introduction to Neural Networks : Lecture 13 John A. Bullinaria, 2004 1. The RBF Maing 2. The RBF Network Architecture 3. Comutational Power of RBF Networks 4.

More information

A New Perspective on Learning Linear Separators with Large L q L p Margins

A New Perspective on Learning Linear Separators with Large L q L p Margins A New Persective on Learning Linear Searators with Large L q L Margins Maria-Florina Balcan Georgia Institute of Technology Christoher Berlind Georgia Institute of Technology Abstract We give theoretical

More information

Analysis of Multi-Hop Emergency Message Propagation in Vehicular Ad Hoc Networks

Analysis of Multi-Hop Emergency Message Propagation in Vehicular Ad Hoc Networks Analysis of Multi-Ho Emergency Message Proagation in Vehicular Ad Hoc Networks ABSTRACT Vehicular Ad Hoc Networks (VANETs) are attracting the attention of researchers, industry, and governments for their

More information

Approximating min-max k-clustering

Approximating min-max k-clustering Aroximating min-max k-clustering Asaf Levin July 24, 2007 Abstract We consider the roblems of set artitioning into k clusters with minimum total cost and minimum of the maximum cost of a cluster. The cost

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization

More information

POINTS ON CONICS MODULO p

POINTS ON CONICS MODULO p POINTS ON CONICS MODULO TEAM 2: JONGMIN BAEK, ANAND DEOPURKAR, AND KATHERINE REDFIELD Abstract. We comute the number of integer oints on conics modulo, where is an odd rime. We extend our results to conics

More information

New weighing matrices and orthogonal designs constructed using two sequences with zero autocorrelation function - a review

New weighing matrices and orthogonal designs constructed using two sequences with zero autocorrelation function - a review University of Wollongong Research Online Faculty of Informatics - Paers (Archive) Faculty of Engineering and Information Sciences 1999 New weighing matrices and orthogonal designs constructed using two

More information

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III AI*IA 23 Fusion of Multile Pattern Classifiers PART III AI*IA 23 Tutorial on Fusion of Multile Pattern Classifiers by F. Roli 49 Methods for fusing multile classifiers Methods for fusing multile classifiers

More information

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL Technical Sciences and Alied Mathematics MODELING THE RELIABILITY OF CISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL Cezar VASILESCU Regional Deartment of Defense Resources Management

More information

Convex Optimization methods for Computing Channel Capacity

Convex Optimization methods for Computing Channel Capacity Convex Otimization methods for Comuting Channel Caacity Abhishek Sinha Laboratory for Information and Decision Systems (LIDS), MIT sinhaa@mit.edu May 15, 2014 We consider a classical comutational roblem

More information

CALU: A Communication Optimal LU Factorization Algorithm

CALU: A Communication Optimal LU Factorization Algorithm CALU: A Communication Optimal LU Factorization Algorithm James Demmel Laura Grigori Hua Xiang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-9

More information

Correspondence Between Fractal-Wavelet. Transforms and Iterated Function Systems. With Grey Level Maps. F. Mendivil and E.R.

Correspondence Between Fractal-Wavelet. Transforms and Iterated Function Systems. With Grey Level Maps. F. Mendivil and E.R. 1 Corresondence Between Fractal-Wavelet Transforms and Iterated Function Systems With Grey Level Mas F. Mendivil and E.R. Vrscay Deartment of Alied Mathematics Faculty of Mathematics University of Waterloo

More information

The Graph Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule

The Graph Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule The Grah Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule STEFAN D. BRUDA Deartment of Comuter Science Bisho s University Lennoxville, Quebec J1M 1Z7 CANADA bruda@cs.ubishos.ca

More information

Feedback-error control

Feedback-error control Chater 4 Feedback-error control 4.1 Introduction This chater exlains the feedback-error (FBE) control scheme originally described by Kawato [, 87, 8]. FBE is a widely used neural network based controller

More information

ANALYTIC NUMBER THEORY AND DIRICHLET S THEOREM

ANALYTIC NUMBER THEORY AND DIRICHLET S THEOREM ANALYTIC NUMBER THEORY AND DIRICHLET S THEOREM JOHN BINDER Abstract. In this aer, we rove Dirichlet s theorem that, given any air h, k with h, k) =, there are infinitely many rime numbers congruent to

More information

A Social Welfare Optimal Sequential Allocation Procedure

A Social Welfare Optimal Sequential Allocation Procedure A Social Welfare Otimal Sequential Allocation Procedure Thomas Kalinowsi Universität Rostoc, Germany Nina Narodytsa and Toby Walsh NICTA and UNSW, Australia May 2, 201 Abstract We consider a simle sequential

More information

ON THE DEVELOPMENT OF PARAMETER-ROBUST PRECONDITIONERS AND COMMUTATOR ARGUMENTS FOR SOLVING STOKES CONTROL PROBLEMS

ON THE DEVELOPMENT OF PARAMETER-ROBUST PRECONDITIONERS AND COMMUTATOR ARGUMENTS FOR SOLVING STOKES CONTROL PROBLEMS Electronic Transactions on Numerical Analysis. Volume 44,. 53 72, 25. Coyright c 25,. ISSN 68 963. ETNA ON THE DEVELOPMENT OF PARAMETER-ROBUST PRECONDITIONERS AND COMMUTATOR ARGUMENTS FOR SOLVING STOKES

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at

More information

Online Appendix to Accompany AComparisonof Traditional and Open-Access Appointment Scheduling Policies

Online Appendix to Accompany AComparisonof Traditional and Open-Access Appointment Scheduling Policies Online Aendix to Accomany AComarisonof Traditional and Oen-Access Aointment Scheduling Policies Lawrence W. Robinson Johnson Graduate School of Management Cornell University Ithaca, NY 14853-6201 lwr2@cornell.edu

More information

Preconditioning techniques for Newton s method for the incompressible Navier Stokes equations

Preconditioning techniques for Newton s method for the incompressible Navier Stokes equations Preconditioning techniques for Newton s method for the incomressible Navier Stokes equations H. C. ELMAN 1, D. LOGHIN 2 and A. J. WATHEN 3 1 Deartment of Comuter Science, University of Maryland, College

More information

A New Method of DDB Logical Structure Synthesis Using Distributed Tabu Search

A New Method of DDB Logical Structure Synthesis Using Distributed Tabu Search A New Method of DDB Logical Structure Synthesis Using Distributed Tabu Search Eduard Babkin and Margarita Karunina 2, National Research University Higher School of Economics Det of nformation Systems and

More information

Universal Finite Memory Coding of Binary Sequences

Universal Finite Memory Coding of Binary Sequences Deartment of Electrical Engineering Systems Universal Finite Memory Coding of Binary Sequences Thesis submitted towards the degree of Master of Science in Electrical and Electronic Engineering in Tel-Aviv

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests 009 American Control Conference Hyatt Regency Riverfront, St. Louis, MO, USA June 0-, 009 FrB4. System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests James C. Sall Abstract

More information

216 S. Chandrasearan and I.C.F. Isen Our results dier from those of Sun [14] in two asects: we assume that comuted eigenvalues or singular values are

216 S. Chandrasearan and I.C.F. Isen Our results dier from those of Sun [14] in two asects: we assume that comuted eigenvalues or singular values are Numer. Math. 68: 215{223 (1994) Numerische Mathemati c Sringer-Verlag 1994 Electronic Edition Bacward errors for eigenvalue and singular value decomositions? S. Chandrasearan??, I.C.F. Isen??? Deartment

More information

PROFIT MAXIMIZATION. π = p y Σ n i=1 w i x i (2)

PROFIT MAXIMIZATION. π = p y Σ n i=1 w i x i (2) PROFIT MAXIMIZATION DEFINITION OF A NEOCLASSICAL FIRM A neoclassical firm is an organization that controls the transformation of inuts (resources it owns or urchases into oututs or roducts (valued roducts

More information

Eigenanalysis of Finite Element 3D Flow Models by Parallel Jacobi Davidson

Eigenanalysis of Finite Element 3D Flow Models by Parallel Jacobi Davidson Eigenanalysis of Finite Element 3D Flow Models by Parallel Jacobi Davidson Luca Bergamaschi 1, Angeles Martinez 1, Giorgio Pini 1, and Flavio Sartoretto 2 1 Diartimento di Metodi e Modelli Matematici er

More information

State Estimation with ARMarkov Models

State Estimation with ARMarkov Models Deartment of Mechanical and Aerosace Engineering Technical Reort No. 3046, October 1998. Princeton University, Princeton, NJ. State Estimation with ARMarkov Models Ryoung K. Lim 1 Columbia University,

More information

Distributed Rule-Based Inference in the Presence of Redundant Information

Distributed Rule-Based Inference in the Presence of Redundant Information istribution Statement : roved for ublic release; distribution is unlimited. istributed Rule-ased Inference in the Presence of Redundant Information June 8, 004 William J. Farrell III Lockheed Martin dvanced

More information

Elementary Analysis in Q p

Elementary Analysis in Q p Elementary Analysis in Q Hannah Hutter, May Szedlák, Phili Wirth November 17, 2011 This reort follows very closely the book of Svetlana Katok 1. 1 Sequences and Series In this section we will see some

More information

Finding a sparse vector in a subspace: linear sparsity using alternating directions

Finding a sparse vector in a subspace: linear sparsity using alternating directions IEEE TRANSACTION ON INFORMATION THEORY VOL XX NO XX 06 Finding a sarse vector in a subsace: linear sarsity using alternating directions Qing Qu Student Member IEEE Ju Sun Student Member IEEE and John Wright

More information

MATH 2710: NOTES FOR ANALYSIS

MATH 2710: NOTES FOR ANALYSIS MATH 270: NOTES FOR ANALYSIS The main ideas we will learn from analysis center around the idea of a limit. Limits occurs in several settings. We will start with finite limits of sequences, then cover infinite

More information

Elliptic Curves and Cryptography

Elliptic Curves and Cryptography Ellitic Curves and Crytograhy Background in Ellitic Curves We'll now turn to the fascinating theory of ellitic curves. For simlicity, we'll restrict our discussion to ellitic curves over Z, where is a

More information

End-to-End Delay Minimization in Thermally Constrained Distributed Systems

End-to-End Delay Minimization in Thermally Constrained Distributed Systems End-to-End Delay Minimization in Thermally Constrained Distributed Systems Pratyush Kumar, Lothar Thiele Comuter Engineering and Networks Laboratory (TIK) ETH Zürich, Switzerland {ratyush.kumar, lothar.thiele}@tik.ee.ethz.ch

More information

Generalized Coiflets: A New Family of Orthonormal Wavelets

Generalized Coiflets: A New Family of Orthonormal Wavelets Generalized Coiflets A New Family of Orthonormal Wavelets Dong Wei, Alan C Bovik, and Brian L Evans Laboratory for Image and Video Engineering Deartment of Electrical and Comuter Engineering The University

More information

SCHUR S LEMMA AND BEST CONSTANTS IN WEIGHTED NORM INEQUALITIES. Gord Sinnamon The University of Western Ontario. December 27, 2003

SCHUR S LEMMA AND BEST CONSTANTS IN WEIGHTED NORM INEQUALITIES. Gord Sinnamon The University of Western Ontario. December 27, 2003 SCHUR S LEMMA AND BEST CONSTANTS IN WEIGHTED NORM INEQUALITIES Gord Sinnamon The University of Western Ontario December 27, 23 Abstract. Strong forms of Schur s Lemma and its converse are roved for mas

More information

arxiv: v2 [quant-ph] 2 Aug 2012

arxiv: v2 [quant-ph] 2 Aug 2012 Qcomiler: quantum comilation with CSD method Y. G. Chen a, J. B. Wang a, a School of Physics, The University of Western Australia, Crawley WA 6009 arxiv:208.094v2 [quant-h] 2 Aug 202 Abstract In this aer,

More information

Positive decomposition of transfer functions with multiple poles

Positive decomposition of transfer functions with multiple poles Positive decomosition of transfer functions with multile oles Béla Nagy 1, Máté Matolcsi 2, and Márta Szilvási 1 Deartment of Analysis, Technical University of Budaest (BME), H-1111, Budaest, Egry J. u.

More information

16. Binary Search Trees

16. Binary Search Trees Dictionary imlementation 16. Binary Search Trees [Ottman/Widmayer, Ka..1, Cormen et al, Ka. 1.1-1.] Hashing: imlementation of dictionaries with exected very fast access times. Disadvantages of hashing:

More information

Recursive Estimation of the Preisach Density function for a Smart Actuator

Recursive Estimation of the Preisach Density function for a Smart Actuator Recursive Estimation of the Preisach Density function for a Smart Actuator Ram V. Iyer Deartment of Mathematics and Statistics, Texas Tech University, Lubbock, TX 7949-142. ABSTRACT The Preisach oerator

More information

FE FORMULATIONS FOR PLASTICITY

FE FORMULATIONS FOR PLASTICITY G These slides are designed based on the book: Finite Elements in Plasticity Theory and Practice, D.R.J. Owen and E. Hinton, 1970, Pineridge Press Ltd., Swansea, UK. 1 Course Content: A INTRODUCTION AND

More information

Supplementary Materials for Robust Estimation of the False Discovery Rate

Supplementary Materials for Robust Estimation of the False Discovery Rate Sulementary Materials for Robust Estimation of the False Discovery Rate Stan Pounds and Cheng Cheng This sulemental contains roofs regarding theoretical roerties of the roosed method (Section S1), rovides

More information

Real-Time Computing with Lock-Free Shared Objects

Real-Time Computing with Lock-Free Shared Objects Real-Time Comuting with Lock-Free Shared Objects JAMES H. ADERSO, SRIKATH RAMAMURTHY, and KEVI JEFFAY University of orth Carolina This article considers the use of lock-free shared objects within hard

More information

Efficient Approximations for Call Admission Control Performance Evaluations in Multi-Service Networks

Efficient Approximations for Call Admission Control Performance Evaluations in Multi-Service Networks Efficient Aroximations for Call Admission Control Performance Evaluations in Multi-Service Networks Emre A. Yavuz, and Victor C. M. Leung Deartment of Electrical and Comuter Engineering The University

More information

16. Binary Search Trees

16. Binary Search Trees Dictionary imlementation 16. Binary Search Trees [Ottman/Widmayer, Ka..1, Cormen et al, Ka. 12.1-12.] Hashing: imlementation of dictionaries with exected very fast access times. Disadvantages of hashing:

More information

New Schedulability Test Conditions for Non-preemptive Scheduling on Multiprocessor Platforms

New Schedulability Test Conditions for Non-preemptive Scheduling on Multiprocessor Platforms New Schedulability Test Conditions for Non-reemtive Scheduling on Multirocessor Platforms Technical Reort May 2008 Nan Guan 1, Wang Yi 2, Zonghua Gu 3 and Ge Yu 1 1 Northeastern University, Shenyang, China

More information

The non-stochastic multi-armed bandit problem

The non-stochastic multi-armed bandit problem Submitted for journal ublication. The non-stochastic multi-armed bandit roblem Peter Auer Institute for Theoretical Comuter Science Graz University of Technology A-8010 Graz (Austria) auer@igi.tu-graz.ac.at

More information

STABILITY ANALYSIS TOOL FOR TUNING UNCONSTRAINED DECENTRALIZED MODEL PREDICTIVE CONTROLLERS

STABILITY ANALYSIS TOOL FOR TUNING UNCONSTRAINED DECENTRALIZED MODEL PREDICTIVE CONTROLLERS STABILITY ANALYSIS TOOL FOR TUNING UNCONSTRAINED DECENTRALIZED MODEL PREDICTIVE CONTROLLERS Massimo Vaccarini Sauro Longhi M. Reza Katebi D.I.I.G.A., Università Politecnica delle Marche, Ancona, Italy

More information

Relaxed p-adic Hensel lifting for algebraic systems

Relaxed p-adic Hensel lifting for algebraic systems Relaxed -adic Hensel lifting for algebraic systems Jérémy Berthomieu Laboratoire de Mathématiques Université de Versailles Versailles, France jeremy.berthomieu@uvsq.fr Romain Lebreton Laboratoire d Informatique

More information

Online homotopy algorithm for a generalization of the LASSO

Online homotopy algorithm for a generalization of the LASSO This article has been acceted for ublication in a future issue of this journal, but has not been fully edited. Content may change rior to final ublication. Online homotoy algorithm for a generalization

More information

A Parallel Algorithm for Minimization of Finite Automata

A Parallel Algorithm for Minimization of Finite Automata A Parallel Algorithm for Minimization of Finite Automata B. Ravikumar X. Xiong Deartment of Comuter Science University of Rhode Island Kingston, RI 02881 E-mail: fravi,xiongg@cs.uri.edu Abstract In this

More information

Dimensional perturbation theory for Regge poles

Dimensional perturbation theory for Regge poles Dimensional erturbation theory for Regge oles Timothy C. Germann Deartment of Chemistry, University of California, Berkeley, California 94720 Sabre Kais Deartment of Chemistry, Purdue University, West

More information

Unit 1 - Computer Arithmetic

Unit 1 - Computer Arithmetic FIXD-POINT (FX) ARITHMTIC Unit 1 - Comuter Arithmetic INTGR NUMBRS n bit number: b n 1 b n 2 b 0 Decimal Value Range of values UNSIGND n 1 SIGND D = b i 2 i D = 2 n 1 b n 1 + b i 2 i n 2 i=0 i=0 [0, 2

More information

Uncorrelated Multilinear Discriminant Analysis with Regularization and Aggregation for Tensor Object Recognition

Uncorrelated Multilinear Discriminant Analysis with Regularization and Aggregation for Tensor Object Recognition TNN-2007-P-0332.R1 1 Uncorrelated Multilinear Discriminant Analysis with Regularization and Aggregation for Tensor Object Recognition Haiing Lu, K.N. Plataniotis and A.N. Venetsanooulos The Edward S. Rogers

More information

Communication-avoiding LU and QR factorizations for multicore architectures

Communication-avoiding LU and QR factorizations for multicore architectures Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding

More information

Proximal methods for the latent group lasso penalty

Proximal methods for the latent group lasso penalty Proximal methods for the latent grou lasso enalty The MIT Faculty has made this article oenly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher Villa,

More information

Asymptotic Properties of the Markov Chain Model method of finding Markov chains Generators of..

Asymptotic Properties of the Markov Chain Model method of finding Markov chains Generators of.. IOSR Journal of Mathematics (IOSR-JM) e-issn: 78-578, -ISSN: 319-765X. Volume 1, Issue 4 Ver. III (Jul. - Aug.016), PP 53-60 www.iosrournals.org Asymtotic Proerties of the Markov Chain Model method of

More information