c 2007 Society for Industrial and Applied Mathematics

Size: px

Start display at page:

Download "c 2007 Society for Industrial and Applied Mathematics"

Charles Howard
6 years ago
Views:

1 SIAM J MATRIX ANAL APPL Vol 29, No, pp c 2007 Society for Industrial and Applied Mathematics A SUPERFAST ALGORITM FOR TOEPLITZ SYSTEMS OF LINEAR EQUATIONS S CANDRASEKARAN, M GU, X SUN, J XIA, AND J ZU Abstract In this paper we develop a new superfast solver for Toeplitz systems of linear equations To solve Toeplitz systems many people use displacement equation methods With displacement structures, Toeplitz matrices can be transformed into Cauchy-like matrices using the FFT or other trigonometric transformations These Cauchy-like matrices have a special property, that is, their off-diagonal blocks have small numerical ranks This low-rank property plays a central role in our superfast Toeplitz solver It enables us to quickly approximate the Cauchy-like matrices by structured matrices called sequentially semiseparable SSS matrices The major work of the constructions of these SSS forms can be done in precomputations independent of the Toeplitz matrix entries These SSS representations are compact because of the low-rank property The SSS Cauchy-like systems can be solved in linear time with linear storage Excluding precomputations the main operations are the FFT and SSS system solve, which are both very efficient Our new Toeplitz solver is stable in practice Numerical examples are presented to illustrate the efficiency and the practical stability Key words displacement equation, SSS structure, superfast algorithm, Toeplitz matrix AMS subject classifications 5A06, 65F05, 65G05 DOI 037/ Introduction Toeplitz systems of linear equations arise in many applications, including PDE solving, signal processing, time series analysis, orthogonal polynomials, and many others A Toeplitz system is a linear system Tx b with a coefficient matrix to be a Toeplitz matrix t 0 t t 2 t N t t 0 t t N 2 2 T t 2 t t 0 t N 3, t N t N 2 t N 3 that is, its entries are constant along every diagonal a matrix whose entries are constant along every antidiagonal is called a ankel matrix The vector t t N t t 0 t t N is called the Toeplitz-vector that generates T There are both direct and iterative methods for solving Direct solvers are said to be fast if they cost ON 2 operations; examples include Schur-type methods, Received by the editors October 9, 200; accepted for publication in revised form by D A Bini May 8, 2007; published electronically December 3, Department of Electrical and Computer Engineering, University of California at Santa Barbara, Santa Barbara, CA 9306 shiv@eceucsbedu Department of Mathematics, University of California at Berkeley, Berkeley, CA 9720 mgu@ mathberkeleyedu, zhujiangcal@gmailcom Department of Mechanical Engineering, University of California at Berkeley, Berkeley, CA 9720 sunxt@calberkeleyedu Department of Mathematics, University of California at Los Angeles, Los Angeles, CA jxia@mathuclaedu 27 Copyright by SIAM Unauthorized reproduction of this article is prohibited

2 28 CANDRASEKARAN, GU, SUN, XIA, AND ZU Levinson-type methods, and others [32] An important type of direct solver is the displacement equation type fast solver based on Gaussian eliminations Some known displacement equation type methods are the einig [28], GKO [22], and Gu [26] methods Those methods have complexity ON 2 Methods with complexity less than ON 2 are called superfast In this paper we will present a displacement equation type superfast algorithm Fast and superfast methods Many fast and superfast methods that have been developed are numerically unstable [5, 5, 6, 7, 37] References [5] and [5] showed that the Schur algorithm and the Levinson algorithm are weakly stable in some cases, but both may be highly unstable in the case of an indefinite and nonsymmetric matrix Stable generalized Schur algorithms [32] and look-ahead algorithms were developed in [9, 0] igh-performance look-ahead Schur algorithms were presented [20] Many other solvers use the FFT or other trigonometric transforms to convert the Toeplitz or even ankel or Toeplitz-plus-ankel matrices into generalized Cauchy or Vandermonde matrices, which can be done stably in ON log N operations This is also the approach that we will use in this paper, with the aid of a displacement structure The concept of displacement structure was first introduced in [3] The Sylvestertype displacement equation for a matrix Ĉ RN N [29] is 3 ΩĈ ĈΛ UV, where Ω, Λ R N N, U R N α, V R α N, and α N is the displacement rank with respect to Ω and Λ if rankuvα The matrix Ĉ is considered to possess a displacement structure with respect to Ω and Λ if α N With displacement structures it was shown in [9, 2, 36, 22, 28] that Toeplitz and ankel matrices can be transformed into Cauchy-like matrices of the following form: u T Ĉ i v j u i,v j R α, η i λ j i,j N where we assume that η i λ j for i, j N Equivalently, a Cauchy-like matrix is the unique solution to the displacement equation 3 with u T Ω diagη,,η n, Λ diagλ,,λ n,u, and V v,,v n In particular, Ĉ is a Cauchy matrix if u T i v j for all i and j The displacement rank of Ĉ is at most α To solve Toeplitz systems through Cauchy-like matrices, many people have utilized matrix factorizations Gohberg and Olshevsky [23] presented a fast variation of the straightforward Gaussian elimination with partial pivoting GEPP procedure to solve a Cauchy-like linear system of equations in ON 2 operations Among other results, Gohberg, Kailath, and Olshevsky [22] developed algorithm GKO, an improved version of einig s algorithm [28], and demonstrated numerically that it is stable In their algorithm, the ankel matrix and the Toeplitz-plus-ankel matrix are also transformed via fast trigonometric transforms into Cauchy-like matrices u T n Copyright by SIAM Unauthorized reproduction of this article is prohibited

3 SUPERFAST ALGORITM FOR TOEPLITZ SYSTEMS 29 Gu presented a modified algorithm in [26] to avoid extra error growth This algorithm is numerically stable, provided that the element growth in the computed factorization is not large The algorithm takes ON 2 operations and is a fast stable method Superfast algorithms appeared in [3,, 6, 7, 35,, 2, 2] and many others Superfast algorithms use divide-and-conquer strategies Morf developed the first idea in [3] These methods are unstable for nonsymmetric systems as they cannot deal with nearly singular leading principal submatrices Van Barel and Kravanja presented a superfast method for rational interpolation at roots of unity [39] A similar idea was then applied to Toeplitz systems [38] It provided an explicit formula for the inverse of a Toeplitz matrix Additional techniques such as iterative refinement and downdating were still required to stabilize their algorithm 2 Main results Our new Toeplitz solver is also of the displacement equation type Given a Toeplitz linear system, we first use the FFT to transform the associated Toeplitz matrix into a Cauchy-like matrix Then instead of using matrix factorizations which often cost ON 2 or more, we exploit a special low-rank property of Cauchy-like matrices, that is, every off-diagonal block of a Cauchy-like matrix has a low numerical rank Using this low-rank property, we then approximate the Cauchy-like matrix by a low-rank matrix structure called the sequentially semiseparable SSS matrix proposed by Chandrasekaran et al [2, 3] A system with the coefficient matrix in compact SSS form can be solved with only Op 2 N operations, where N is the matrix dimension, and p is the complexity of the semiseparable description The SSS solver is practically stable in our numerical tests and those in [2, 3] The SSS structure was developed to capture the low-rank property of the offdiagonal blocks of a matrix and to maintain stability or practical stability in the mean time It is a matrix analog of semiseparable integral kernels in Kailath s paper [30] Matrix operations with compact form SSS representations are very efficient, provided that such compact representations exist or can be easily computed This turns out to be true for our case, as the Cauchy-like matrices are transformed from Toeplitz matrices We use a recursive compression scheme with a shifting strategy to construct compact SSS forms for those Cauchy-like matrices The major work of the compressions can be precomputed on some Cauchy matrices which are independent of the actual Toeplitz matrix entries The overall algorithm thus has the following stages: Precompute compressions of off-diagonal blocks of Cauchy matrices 2 Transform the Toeplitz matrix into a Cauchy-like matrix in ON log N operations 3 Construct a compact SSS representation from precomputed compressions and solve the Cauchy-like matrix system Op 2 N operations Recover the solution of the Toepliz system in ON log N operations The stages above are either stable or practically stable Our numerical results indicate that the overall algorithm is stable in practice The Toeplitz matrix does not have to be symmetric or positive definite, and no extra stabilizing step is necessary After the precomputations, the total cost for the algorithm is ON log N+Op 2 N This indicates that the entire algorithm is superfast We also point out that similar techniques are used in [33], where the low-rank property is exploited through the block columns without diagonals called neutered block columns in [33], in contrast with the off-diagonal blocks here The compres- Copyright by SIAM Unauthorized reproduction of this article is prohibited

4 250 CANDRASEKARAN, GU, SUN, XIA, AND ZU sions of either the neutered blocks or the off-diagonal blocks both give data-sparse representations which enable fast factorizations of the Cauchy-like matrices In fact, corresponding to neutered block rows or columns, there are also matrix representations called hierarchically semiseparable SS matrices [] which are usually more complicated structures than SSS matrices 3 Overview We will discuss the displacement structure and the transformation from a Toeplitz problem to a Cauchy-like problem in section 2 The low-rank property of this Cauchy-like problem is then exploited Section 3 then gives a linear complexity solver using the SSS structure In section, we will present an algorithm for fast construction of SSS structures We will then analyze the complexity in section 5 and use some numerical experiments to demonstrate the efficiency and the practical stability All algorithms have been implemented in Fortran 90 Section 6 draws some conclusions 2 Displacement structures and low-rank property 2 Cauchy-like systems Given a Toeplitz system, we can use a displacement structure to transform it into a Cauchy-like system Define δ 0 0 Z δ 0, and let Ω Z and Λ Z in 3 Kailath, Kung, and Morf [3] have shown that every Toeplitz matrix satisfies the displacement equation 3 with A B, having nonzero entries only in its first row and last column, to be a matrix of rank at most 2 ence the displacement rank of a Toeplitz matrix is at most 2 with respect to Z and Z The following result can be found in [28] Proposition 2 Let Ĉ RN N be a matrix satisfying the displacement equation 2 Z Ĉ ĈZ UV, where U R n α and V R α n Then FĈD 0 F is a Cauchy-like matrix satisfying 22 D FĈD 0 F FĈD 0 F D FU VD0 F, where F N ω2k j k,j N is the normalized inverse discrete Fourier transform matrix, ω e πi N, and D diag,ω 2,,ω 2N, D diagω, ω 3,,ω 2N, D 0 diag,ω,,ω N ere α 2 This proposition suggests that for a Toeplitz matrix T, one can convert it into the Cauchy-like matrix in 22 Therefore the Toeplitz system can be readily transformed into a new system 23 C x b, Copyright by SIAM Unauthorized reproduction of this article is prohibited

5 SUPERFAST ALGORITM FOR TOEPLITZ SYSTEMS 25 where C has the form u T i 2 C v j ω 2i ω 2j+ i,j N u i,v j R α In section 3 we will present a fast solver for 23 After obtaining x we will then recover x with an FFT again All the stages involving FFT are stable and cost ON log N The solver for 23 has a linear complexity and turns out to be practically stable Thus the total cost of our algorithm is bounded by ON log N+ ONp 2, where p is some parameter that will be described below This indicates our method is a superfast one with practical stability 22 Low-rank property of Cauchy-like matrices In this section, we will show a low-rank property of C, ie, every off-diagonal block of C has a low numerical rank This property is the basis of the superfast SSS solver in section 3 First, a simple numerical experiment can give us an idea of the low-rank property of C To find out the numerical ranks we can use one of the following tools: τ-accurate SVD: singular values less than τ are dropped if τ is an absolute tolerance, or singular values less than τ times the largest singular value are dropped if τ is a relative tolerance 2 τ-accurate QR: A QR, A : m n, Q : m k, R : k k, k l minm, n, which is obtained in the following way Compute the exact QR factorization of matrix A ˆQ ˆR, where ˆQ is m l and ˆR is l n with diagonal entries satisfying ˆR ˆR 22 ˆR ll Then obtain R by dropping all rows of ˆR with diagonal entries less than τ if τ is an absolute tolerance, or with diagonal entries less than τ ˆR if τ is a relative tolerance Drop relevant columns of ˆQ accordingly to obtain Q Later, by ranks we mean numerical ranks ere we take some random Toeplitz matrices in different sizes Then we transform them into Cauchy-like matrices C and compute the numerical ranks of their off-diagonal blocks For simplicity, we compute the ranks for blocks C : d, d +:N, C:2d, 2d +:N,, C : kd, kd +: N,, where d is a fixed integer, and these blocks are numbered as block numbers, 2,,k as shown in Figure 2 ere, k 8 off-diagonal blocks for each of three N N Cauchy-like matrices are considered See Table 2 for the numerical ranks, where we use the τ-accurate SVD with τ to be an absolute tolerance We can see that the numerical ranks are relatively small as compared to the block sizes And when we double the dimension of the matrix, the numerical ranks do not increase much This is more significant when a larger τ is used Fig 2 Off-diagonal blocks numbered as, 2, 3 Upper triangular part only Copyright by SIAM Unauthorized reproduction of this article is prohibited

6 252 CANDRASEKARAN, GU, SUN, XIA, AND ZU Table 2 Off-diagonal numerical ranks of Cauchy-like matrices transformed from some Toeplitz matrices with absolute tolerance τ 0 9 N 60 N 280 N 2560 Block # Block size Rank Block size Rank Block size Rank The low-rank property can be verified theoretically in the following way We first consider a special case of 2 where all u T i v j We show that the following Cauchy matrix has low-rank off-diagonal blocks: 25 C 0 ω 2i ω 2j+ i,j N The central idea is similar to that in the fast multipole method [25, 8], which implies that with well-separated points see, eg, Figure 22 an interaction matrix is numerically low-rank ere we introduce two sets of points {λ k } N k {ω2k } N k and {η k} N k {ω2k+ } N k on the unit circle When we consider an off-diagonal block of C 0 as follows: G λ i η j i p, p+ j N we can show G is numerically low-rank In fact G corresponds to two well-separated sets {λ k } p k and {η j} N kp+ ; that is, there exists a point c C such that 26 λ i c > d+ e, i,,p, η j c < d, j p +,,N, where d, e are positive constants Consider the expansion 27 r η j c k ηj r+ λ i η j λ i c ηj c λ i c k0 λ i c k+ + O c λ i c 28 r η j c k + ε, k+ λ i c k0 Fig 22 Well-separated sets in the plane Copyright by SIAM Unauthorized reproduction of this article is prohibited

7 SUPERFAST ALGORITM FOR TOEPLITZ SYSTEMS 253 where r is a number such that the error term ε O ηj c λ i c r+ is bounded by a given tolerance We have the estimate r+ η j c r+ d λ i c <, d + e which enables us to find an appropriate r according to the tolerance Thus r η j c k G k0 λ i c k+ +ˆε i p, p+ j N 29 λ c λ c 2 λ c r+ λ 2 c λ 2 c 2 λ 2 c r+ η p+ c η p+2 c η N c +ˆε λ p c λ p c 2 η λ p c r+ p+ c r η p+2 c r η N c r Therefore the numerical rank of G is at most r +, up to an error ˆε Now we can return to the Cauchy-like matrix 2 A similar argument shows that any off-diagonal block of C satisfies r Ĝ u T i v j η j c k λ i c k+ k0 u T λ c u T 2 λ 2 c u T p λ p c u T λ c 2 u T 2 λ 2 c 2 u T p λ p c 2 i p, p+ j N u T λ c r+ u T 2 λ 2 c r+ u T p λ p c r+ v v 2 v q η p+ c v η p+2 c v 2 η N c v N η p+ c r v η p+2 c r v 2 η N c r v N That is, we replace the entries of the two matrix factors in 29 with appropriate vectors Thus the numerical rank of Ĝ will be no larger than αr +, which will be relatively small as compared to N 3 SSS structures and superfast SSS solver 3 SSS representations To take advantage of the low-rank property of the Cauchy-like matrix C, we can use SSS structures introduced by Chandrasekaran et al [2, 3] The SSS structure nicely captures the ranks of off-diagonal blocks of a matrix such as shown in Figure 2 A matrix A C M M satisfies the SSS structure if there exist 2n positive integers m,,m n, and m,, m n with M m + + m n and M m + + m n to mi mj block-partition A as A A i,j k k, where A ij C satisfies D i if i j, 3 A ij U i W i+ W j Vj if i<j, P i R i R j+ Q j if i>j ere the superscript denotes the ermitian transpose and empty products are defined to be identity matrices The matrices {U i } n i, {V i} n i2, {W i} n i2, {P i} n i2, Copyright by SIAM Unauthorized reproduction of this article is prohibited

8 25 CANDRASEKARAN, GU, SUN, XIA, AND ZU Table 3 Dimensions of matrices in 3 Matrix U i V i W i P i Q i R i Dimensions m i k i m i k i k i k i m i l i m i l i+ l i+ l i {Q i } n i, {R i} n i2, and {D i} n i are called generators for the SSS structure and their dimensions are defined in Table 3 As an example, the matrix A with n has the form 32 A D U V 2 U W 2 V 3 U W 2 W 3 V P 2 Q D 2 U 2 V3 U 2 W 3 V P 3 R 2 Q P 3 Q 2 D 3 U 3 V P R 3 R 2 Q P R 3 Q 2 P Q 3 D The SSS representation 32 is related to the off-diagonal blocks in Figure 2 in the way that the upper off-diagonal block numbers, 2, and 3 are U V 2 W 2 V3 W 2 W 3 V, U W 2 U 2 V 3 W 3 V, U W 2 W 3 U 2 W 3 U 3 V Appropriate row and column bases of the off-diagonal blocks are clearly reflected The SSS structure depends on the sequences {m i } and { m i } and the SSS generation scheme If A is a square matrix M M, then we can have a simpler situation m i m i, i,,n SSS matrices are closed under addition, multiplication, inversion, etc, although the sizes of the generators may increase While any matrix can be represented in this form for sufficiently large k i s and l i s, the column dimensions of U i s and P i s, respectively, our main focus will be on SSS matrices which have low-rank off-diagonal blocks and have generators with k i s and l i s to be close to those ranks We say these SSS matrices are compact Particularly true for Cauchy-like matrices, they can have compact SSS forms Using SSS structures, we can take advantage of the superfast SSS system solver in [2, 3] to solve the Cauchy-like systems The solver is efficient when the SSS form is compact, and is practically stable The solver shares similar ideas with that for banded plus semiseparable systems in [] ere we briefly describe the main ideas of the solver in [2, 3] We consider solving the linear system Ax b, where A C N N satisfies 3 and b itself is an unstructured matrix The solver computes an implicit ULV decomposition of A, where U and V are orthogonal matrices Before we present the formal algorithm, we demonstrate the key ideas on a block matrix example 32 SSS solver: example Let the initial system Ax b be partitioned as follows: 33 D U V 2 U W 2 V 3 U W 2 W 3 V P 2 Q D 2 U 2 V3 U 2 W 3 V P 3 R 2 Q P 3 Q 2 D 3 U 3 V P R 3 R 2 Q P R 3 Q 2 P Q 3 D x x 2 x 3 x b b 2 b 3 b 0 P 2 P 3 R 2 P R 3 R 2 ξ, Copyright by SIAM Unauthorized reproduction of this article is prohibited

9 SUPERFAST ALGORITM FOR TOEPLITZ SYSTEMS 255 where the dimensions of the generators follow those in Table 3 with m i m i and the vector ξ 0 The extra zero vector ξ on the right-hand side of 33 has been added for the purpose of a general recursive pattern The algorithm has two main stages, compression or elimination and merging, depending on the relationship between k i l i and m i in the intermediate procedure 32 Compression At the beginning, k <m because of the low-rank property described earlier We apply a unitary transformation q to U so that the first m k rows of U become zeros: 3 q U 0Û Now we multiply q m k k to the first m equations of the system q Ax 0 b 0 I q 0 0 I 0 P 2 P 3 R 2 P R 3 R 2 ξ We pick another unitary transformation w to lower-triangularize q D, the, diagonal block A, ie, Then system 33 becomes q 0 0 I q D w w A 0 0 I w 0 0 I m k k m k D 0 k D 2 D 22 q x 0 0 I b 0 P 2 P 3 R 2 P R 3 R 2 which can be rewritten as D D 2 D 22 Û V2 Û W 2 V3 Û W 2 W 3 V P 2 Q P 2 ˆQ D 2 U 2 V3 U 2 W 3 V P 3 R 2 Q P 3 R 2 ˆQ P 3 Q 2 D 3 U 3 V P R 3 R 2 Q P R 3 R 2 ˆQ P R 3 Q 2 P Q 3 D β γ b 2 b 3 ξ, b where we have used the partitions w x m k k z ˆx, q b 0 0 P 2 P 3 R 2 P R 3 R 2 m k β k γ, and w Q ξ, z ˆx x 2 x 3 x m k Q k ˆQ Copyright by SIAM Unauthorized reproduction of this article is prohibited

10 256 CANDRASEKARAN, GU, SUN, XIA, AND ZU At this point, we can solve for z from the system of equations D z β We also subtract D 2 z from the right-hand side to obtain ˆb γ D 2 z Then we can discard the first m k rows and columns of the coefficient matrix of the system to obtain D 22 Û V 2 Û W 2 V 3 Û W 2 W 3 V P 2 ˆQ D 2 U 2 V3 U 2 W 3 V P 3 R 2 ˆQ P 3 Q 2 D 3 U 3 V P R 3 R 2 ˆQ P R 3 Q 2 P Q 3 D ˆx ˆb x 2 x 3 b 2 b 3 x b 0 P 2 P 3 R 2 P R 3 R 2 ˆξ, where ˆξ ξ + Q z This new system has a similar structure to the original one but with smaller dimension We can continue to solve it by recursion, if further compressions of the blocks such as 3 are possible Note the actual solution, say, x, can be recovered by x w z ˆx 322 Merging During the recursive eliminations there are situations when k i is no longer smaller than m i and no further compression is possible We are then unable to introduce more zeros into the system Now we proceed by merging appropriate block rows and columns of the matrix As an example we can merge the first two block rows and columns and rewrite the system of equations as follows: D U V2 P 2 Q D 2 P 3 Q R 2 Q 2 P R 3 Q R 2 Q 2 U W 2 U 2 V 3 U W 2 U 2 D 3 U 3 V P Q 3 D W 3 V b 0 b 2 P 2 ˆξ b 3 0 P 3 R ˆξ 2 b P R 3 x x 2 x 3 x ence the system becomes ˆD Û V3 Û W 3 V P 3 ˆQ D 3 U 3 V P R 3 ˆQ P Q 3 D ˆx x 3 x ˆb b 3 b 0 P 3 P R 3 ξ, where ˆD ˆx D U V2 P 2 Q D 2 x, ˆb x 2 U W, Û 2 b b 2 P 2 τ U 2, ξ R2 ˆξ, ˆQ Q R 2 Q 2, The number of block rows/columns is reduced by one Further compressions become possible and we can proceed to solve the system recursively In the case n,we have the system D x b 0ξ, which is solved directly Copyright by SIAM Unauthorized reproduction of this article is prohibited

11 SUPERFAST ALGORITM FOR TOEPLITZ SYSTEMS General solve algorithm We now present a short description of the general algorithm The procedure in the example can be directly extended to a general system We assume that the matrix A is in compact SSS form represented by the generators {U i } n i, {V i} n i2, {W i} n i2, {P i} n i2, {Q i} n i, {R i} n i2, and {D i} n i as in 3 We also partition x x i and b b j such that x i and b i have m i rows As in the example, there are two stages at each step of the recursion In the compression stage, we perform orthogonal eliminations on both sides of A to create an m k m k lower triangular submatrix at the top left corner of A Then we solve a small triangular system and obtain the first few components of the solution vector At this stage, we are left with a new system with less unknowns; hence we can carry out a recursion In the merging stage, we merge the first two block rows and columns of A while still maintaining the SSS structure The numbers of block rows and columns are reduced by one Combining these two stages, we can proceed with recursion to solve the system When n we can solve the linear system directly with standard solvers The SSS solver has a complexity ONp 2 [2, 3], where p is the maximum numerical rank of the off-diagonal blocks of A, as compared to the traditional ON 3 cost for a general dense N N matrix We use only orthogonal transformations and a single substitution in the SSS solver Although a formal proof for the backward stability is not yet available, the solver is shown to be practically stable The reader is referred to [2, 3] for more discussions on the stability Fast construction of SSS representation for C According to section 3, a system in compact SSS form can be solved very efficiently We can thus use that algorithm to solve the Cauchy-like system 23, provided that C can be quickly written in SSS form Therefore we try to find an efficient construction scheme ere we provide a divide-and-conquer SSS construction scheme by using the fast merging and splitting strategy in [3] If we know further that the matrix has low-rank off-diagonal blocks and it is easy to compress the off-diagonal blocks, then the construction can be superfast ere we will concentrate on this situation as Cauchy-like matrices have this low-rank property We first present a general divide-and-conquer construction algorithm and then describe a fast shifting strategy to compress the off-diagonal blocks of C General divide-and-conquer construction algorithm In this section, we discuss the construction of the SSS structure of a matrix A, when the partition sequence {m i } n i is given The general construction methods can be applied to any unstructured matrix, thus proving that any matrix has an SSS structure Of course, k i and l i will usually be large in this case, precluding any speed-ups These methods can be viewed as specific ways to make the realization algorithm of [8] more efficient Suppose we are given an N N matrix A and a partition sequence {m i } n i with n i m i N Starting with an appropriate sequence { m, m 2 }, where k i m i m and n ik+ m i m 2, we can first partition A into a 2 2 block matrix and then construct a simple SSS form A where m m 2 m D B m 2 E D 2 2 B U V 2 U Σ F m m 2 D U V2 P 2 Q D 2,,F P2 Q P 2 Σ F Copyright by SIAM Unauthorized reproduction of this article is prohibited

12 258 CANDRASEKARAN, GU, SUN, XIA, AND ZU are the low-rank SVDs of the off-diagonal blocks B and E Note if the compressions are done by τ-accurate SVD approximations or rank-revealing QR decompositions, then appropriate signs should be replaced by signs We now split either the, block or the 2, 2 block to obtain a 3 3 block SSS matrix For instance, we can split the, block according to an appropriate new sequence { ˆm, ˆm 2, ˆm 3 } as follows, where ˆm +ˆm 2 m, ˆm 3 m 2 : D new U new V2 new P2 new Q new D2 new D, U new W2 new U2 new U, V3 new V2, R2 new Q new Q new 2 Q,P3 new P 2,D3 new D 2, where the new generators marked by the superscript new introduced based on can be determined in the following way First, we partition the matrices for the old first block conformally with the two new blocks as D D 2 U D 2 D 22 D, U 2 U, Q Q 2 Q We can then identify from these and the previous equations that D new D,D2 new D 22,U2 new U 2,V3 new V 2,Q new 2 Q 2,P3 new P 2,D3 new D 2 The remaining matrices satisfy 3 D 2 U U new V new 2 W new 2, D 2 Q P new 2 R2 new Q new By factorizing the left-hand side matrices using numerical tools such as the SVD and rank-revealing QR, these two equations allow us to compute those remaining matrices for the new blocks A is thus in the new form A ˆm ˆm 2 ˆm 3 ˆm D new U new V2 new U new W2 new V3 new ˆm 2 P2 new Q new D2 new U2 new V3 new ˆm 3 P2 new R2 new Q new P3 new Q new 2 D3 new We can use similar techniques if we want to split the second row and column of We can continue this by either splitting the first block row and block column or the last ones using the above techniques, or splitting any middle block row and block column similarly Then we will be able to construct the desired SSS representation according to the given sequence {m i,i, 2,,n} The general construction can be organized with bisection The major cost is in compressions of off-diagonal blocks of the form D 2 i X i Y i, where X i and Y i are tall and thin, and Di 2 is an off-diagonal block of D D i i Di 2 Di 2 Di 22 The compression can be achieved by a τ-accurate QR factorization The construction is also practically stable in our implementation Copyright by SIAM Unauthorized reproduction of this article is prohibited

13 SUPERFAST ALGORITM FOR TOEPLITZ SYSTEMS Compression of off-diagonal blocks For a general matrix with low-rank off-diagonal blocks, the SSS construction can cost ON 2 p as a compression such as 2, 3, and can take OK 2 p, where K is the dimension of the block being compressed owever, for the Cauchy-like matrix C in 23 the compressions can be precomputed Theorem The compression of the off-diagonal block u T 5 G i v j XY λ i η j i p, q j N of C can be obtained by the compression of the corresponding off-diagonal block G X 0 Y0 λ i η j i p, q j N of C 0, where the column dimension of X is no larger than twice the size of the column dimension of X 0 Proof Assume the off-diagonal block G 0 λ i η j i p, q j N is in compressed form X 0 Y 0, and the i, j entry is λ i η j X i,: Y j,:, where X i,: Y j,: denotes the ith row of X Y As α 2 for simplicity we fix α 2 in 2 Then the corresponding off-diagonal block in C is u T G i v j ui v j + u i2 v j2 λ i η j i p, q j N λ i η j u i v j + u i2 v j2 X i,: Yj,: i p, q j N vj Yj,: u i X i,: u i2 X i,: v j2 Yj,: i p, q j N u X,: u 2 X,: vq Yq,: v N Y N,: v q2 Yq,: v N2 YN,: u p X i,: u p2 X i,: XY i p, q j N That is, we get a compression of G Theorem indicates that we can convert the compressions of the off-diagonal blocks of C to be the compressions of those of C 0 which is independent of the actual entries of the Toeplitz matrix T This means the compressions of off-diagonal blocks of C 0 can be precomputed The precomputation can be done in ON 2 p flops by a rank-revealing QR factorization such as in [27] It is possible to reduce the cost to ON log N due to the fact that the compression of a large off-diagonal block can be obtained by that of small ones This can be seen implicitly from the following subsection 3 Compressions of off-diagonal blocks in precomputation We further present a shifting strategy to reduce the cost of the compressions in the precomputation The significance of this shifting strategy is to relate the compressions of large off-diagonal block of C 0 to those of small ones That is, in different splitting stages of the SSS construction of C, the compressions of off-diagonal blocks with different sizes can be related Copyright by SIAM Unauthorized reproduction of this article is prohibited

14 260 CANDRASEKARAN, GU, SUN, XIA, AND ZU For simplicity, we look at the example of partitioning C 0 into blocks Assume in the first cut that we can partition C 0 into 2 2 blocks with equal dimensions as in the following: C0;, C 0;,2 C 0 ω 2i ω 2j+ i, j N C 0;2, C 0;2,2 ω 2 ω 3 ω 2 ω k+ ω 2 ω k+3 ω 2 ω 2N+ ω k ω 3 ω k ω k+ ω k ω k+3 ω k ω 2N+, C 0;2, C 0;2,2 where k N, and without loss of generality, we consider only the block upper triangular part Assume that we have obtained a compression X Y of C 0;,2 such that C0;, X Y C 0, C 0;2, C 0;2,2 where X and Y are tall and thin and can be computed with τ-accurate SVD or rank-revealing QR factorization Next, we split C 0;,, the, block of C 0,into 2 2 blocks and compress its off-diagonal block Suppose the resulting off-diagonal block of C 0;, has size k N 2 l We will compress it by shifting certain parts of X and Y That is, we partition X and Y conformally as X X, X,2,Y Y, and also pick a k N 2 l block from the lower left corner of C 0;,2 X Y that is, X,2 Y, C 0 C0;, X Y C 0;2, C 0;2,2 ω 2 ω 2l+3 ω 2k ω 2l+3 Y,2, ω 2 ω N+ ω 2k ω N+ ω 2k+2 ω N+3 ω k ω N+3 C 0;2, C 0;2,2 ω 2k+2 ω 2N l+ ω N ω 2N l+ X 2 Y2 X, Y, X, Y,2 X,2 Y, X,2 Y,2, C 0;2, C 0;2,2 Copyright by SIAM Unauthorized reproduction of this article is prohibited

15 SUPERFAST ALGORITM FOR TOEPLITZ SYSTEMS 26 where X 2 Y2 is an unknown compression of the upper right submatrix of C 0;,, and the blocks that don t concern us are left blank At this point we do not need another factorization to get X 2 Y2 ; instead we can directly derive the compression X 2 Y2 of C 0;, from X,2 and Y, Clearly we have X 2 Y2 ω 2 ω 2l+3 ω 2 ω N+ ω 2k ω 2l+3 ω 2k ω N+, ω X,2 Y, 2k+2 ω N+3 ω k ω N+3 ω N ω 2N l+ ω 2k+2 ω 2N l+ ω 2k X 2 Y 2 That means we can get X 2 Y2 by shifting a subblock of X Y A similar situation holds for the splitting of the 2, 2 block of C after the first splitting For the successive compressions in later splittings, a similar shifting can be used For splitting with general block sizes the shifting is also similar The shifting scheme indicates that, in different levels of divide-and-conquer SSS constructions, the compressions of large blocks and small blocks are related This can be used to further save the compression cost 5 Performance and numerical experiments It is well known that the FFT transformation of the Toeplitz matrix T to a Cauchy-like matrix C, and the recovery from the solution x of the Cauchy-like system to the solution x of the Toeplitz system, are stable In addition, the fast divide-and-conquer construction and the SSS system solver in section 3 are both practically stable as we will see in our numerical results Thus our overall algorithm is stable in practice ere no extra steps are needed to stabilize the algorithm as required in some other superfast methods [38] After the precomputations discussed in sections 2 and 3 are finished, the cost of each SSS construction is only ONp 2, where p is the maximum of the off-diagonal ranks of the Cauchy matrix C 0 The SSS solver also costs ONp 2 The total cost is thus no more than ON log N +ONp 2 flops The total storage requirement is ONp For the convenience of coding, we use an ON 2 p cost precomputation routine as discussed in subsection 2 A preliminary implementation of our superfast solver in Fortran 90 is available at jxia/work/toep We did experiments on an Itanium 2 Gz SGI Altix 350 server with a 6-bit Linux operating system and Intel MKL BLAS Our method denoted by NEW is compared to Gaussian elimination with partial pivoting denoted by GEPP via the LAPACK linear system solver ZGESV [3] We consider real N N Toeplitz matrices T whose entries are random and uniformly distributed in [0, ] The right-hand sides b are obtained by b Tx, where the exact solution x has random entries uniformly distributed in [, ] Matrices of size N 2 k 00, k 2, 3,, are considered These matrices are moderately illconditioned For example, for N 2 k 00 with k 2,,7, the one-norm condition numbers of the matrices increase from the order of about 0 3 to 0 6 For the NEW solver two important parameters are involved, the SSS block size d Figure 2 and the tolerance τ in the compressions of off-diagonal blocks ere we use the τ-accurate QR factorization with τ as a relative tolerance section 22 Figure 5 shows the execution time in seconds for GEPP and our NEW solver with different d and τ For the NEW solver only the time for solving the Cauchy-like Copyright by SIAM Unauthorized reproduction of this article is prohibited

16 262 CANDRASEKARAN, GU, SUN, XIA, AND ZU Execution time 0 GEPP NEWτ0 9, d NEWτ0, d time seconds k Fig 5 Computation time in seconds versus N 2 k 00 For the NEW solver, time for the precomputation is excluded Precomputation time 0 NEWτ0 9, d00 NEWτ0, d time seconds k Fig 52 Time in seconds for precomputations in the NEW solver system 23 is reported, as the compressions of off-diagonal blocks can be done in the precomputation, as described in section 3, and the SSS construction time is nearly the precomputation time The precomputation time is shown in Figure 52, although the precomputation only needs to be done once for a particular matrix size N We also consider the time scaling factor, that is, the factor by which the time is multiplied when the matrix size doubles; see Figure 53 We observe that the time scaling factors for the NEW solver are near 2, that is to say the NEW solver is close to being a linear time solver Figures 5 and 55 present the errors ε ˆx x 2 x and ε 2 T ˆx b 2 2 T ˆx + b, respectively, where ˆx denotes the numerical solution 2 A significant point of the new solver is that we can use a relatively large tolerance τ in the precomputation and solving, and then use iterative refinement to improve the accuracy A relatively large τ leads to relatively small off-diagonal ranks and thus Copyright by SIAM Unauthorized reproduction of this article is prohibited

17 SUPERFAST ALGORITM FOR TOEPLITZ SYSTEMS Time scaling factor GEPP NEWτ0 9, d00 NEWτ0, d50 time scaling factor k Fig 53 Time scaling factors 0 2 ε 0 NEWτ0 9, d00 GEPP 0 6 relative error k Fig 5 ε ˆx x 2 x 2 versus N 2 k ε 2 0 NEWτ0 9, d00 GEPP 0 6 relative residual error k Fig 55 ε 2 T ˆx b 2 T ˆx + b 2 versus N 2 k 00 Copyright by SIAM Unauthorized reproduction of this article is prohibited

18 26 CANDRASEKARAN, GU, SUN, XIA, AND ZU small p in the operation counts above Iterative refinements are very cheap due to the facts that no precomputation is needed and the solver itself is very fast Figure 5 For τ 0 and d 50 the accuracy results after 2 to 8 steps of iterative refinement are displayed in Figure 56 In fact the number of iterative refinement steps required to reach ε 2 < 0 3 is listed in Table 5 Thus it is also clear that our NEW solver can also perform well as a preconditioner The block size d also affects the performance Figure 57 indicates that for a 0 2 ε relative residual error initial 2 steps 0 6 steps 6 steps 8 steps k Fig 56 ε 2 T ˆx b 2 T ˆx + b 2, initial values before iterative refinements, and after some steps of iterative refinement Computation time for NEW solver with different d2 j 25 precomputation + solver time solver time 60 time seconds j Fig 57 Computation time of the NEW solver with N 2800, τ0 9, and different block size d 2 j 25 Table 5 Number of iterative refinement steps required to reach ε 2 < 0 3 for different matrix dimensions N 2 k 00 k Number of steps Copyright by SIAM Unauthorized reproduction of this article is prohibited

19 SUPERFAST ALGORITM FOR TOEPLITZ SYSTEMS 265 given N if d 2 j 25 is too large or too small, then both the solver time and the precomputation time can be relatively large In fact by counting the operations in the algorithm it is possible to determine an optimal d; see [2] for the derivation of the optimal choice of d in the SSS solver We can do similar derivations for both the SSS construction and the SSS solver We just point out that when τ gets larger, the offdiagonal ranks decrease and a smaller d should be chosen In the previous experiments we used two sets of parameters: τ 0 9,d 00 and τ 0,d 50 Finally, it turns out that our current preliminary implementation of the solver is slower than the implementation of the algorithm in [38], likely due to our inefficient data structure and nonoptimized codes for SSS matrices The complicated nature of SSS matrices needs more careful coding and memory management An improved software implementation is under construction 6 Conclusions and future work In this paper we have presented a superfast and practically stable solver for Toeplitz systems of linear equations A Toeplitz matrix is first transformed into a Cauchy-like matrix, which has a nice low-rank property, and then the Cauchy-like system is solved by a superfast solver This superfast solver utilizes this low-rank property and makes use of an SSS representation of the Cauchy-like matrix A fast construction procedure for SSS structures is presented After a one-time precomputation the solver is very efficient ON log N +ONp 2 complexity Also the algorithm is efficient in that only linear storage is required In future work we hope to further reduce the precomputation cost and finish a betterdesigned version of the current Fortran 90 codes in both the computations and the coding Acknowledgments Many thanks to the referees for their valuable comments which have greatly improved the development and presentation of this paper The authors are also grateful to the editors and the editorial assistant for their patience and thoughtful suggestions REFERENCES [] G S Ammar and W B Gragg, Superfast solution of real positive definite Toeplitz systems, SIAM J Matrix Anal Appl, 9 988, pp 6 76 [2] G S Ammar and W B Gragg, Numerical experience with a superfast real Toeplitz solver, Linear Algebra Appl, 2 989, pp [3] E Anderson, Z Bai, C Bischof, J Demmel, J Dongarra, J Du Croz, A Greenbaum, S ammarling, A McKenney, S Ostrouchov, and D Sorensen, LAPACK Users Guide, 2nd ed, SIAM, Philadelphia, PA, 99 [] R R Bitmead and B D O Anderson, Asymptotically fast solution of Toeplitz and related systems of linear equations, Linear Algebra Appl, 3 980, pp 03 6 [5] A W Bojanczyk, R P Brent, F R de oog, and D R Sweet, On the stability of the Bareiss and related Toeplitz factorization algorithms, SIAM J Matrix Anal Appl, 6 995, pp 0 57 [6] R P Brent, F G Gustavson, and D Y Y Yun, Fast solution of Toeplitz systems of equations and computation of Padé approximants, J Algorithms, 980, pp [7] J R Bunch, Stability of methods for solving Toeplitz systems of equations, SIAM J Sci Statist Comput, 6 985, pp [8] J Carrier, L Greengard, and V Rokhlin, A fast adaptive multipole algorithm for particle simulations, SIAM J Sci Statist Comput, 9 988, pp [9] T F Chan and P C ansen, A look-ahead Levinson algorithm for general Toeplitz systems, IEEE Trans Signal Process, 0 992, pp [0] T F Chan and P C ansen, A look-ahead Levinson algorithm for indefinite Toeplitz systems, SIAM J Matrix Anal Appl, 3 992, pp [] S Chandrasekaran and M Gu, Fast and stable algorithms for banded plus semiseparable matrices, SIAM J Matrix Anal Appl, , pp Copyright by SIAM Unauthorized reproduction of this article is prohibited

20 266 CANDRASEKARAN, GU, SUN, XIA, AND ZU [2] S Chandrasekaran, P Dewilde, M Gu, T Pals, X Sun, A-J van der Veen, and D White, Fast Stable Solvers for Sequentially Semi-Separable Linear Systems of Equations and Least Squares Problems, Technical report, University of California, Berkeley, CA, 2003 [3] S Chandrasekaran, P Dewilde, M Gu, T Pals, X Sun, A-J van der Veen, and D White, Some fast algorithms for sequentially semiseparable representations, SIAM J Matrix Anal Appl, , pp 3 36 [] S Chandrasekaran, M Gu, and W Lyons, A Fast and Stable Adaptive Solver for ierarchically Semi-Separable Representations, Technical report, UCSB Math , University of California, Santa Barbara, CA, 200 [5] G Cybenko, The numerical stability of the Levinson-Durbin algorithm for Toeplitz systems of equations, SIAM J Sci Statist Comput, 980, pp [6] G Cybenko, Error Analysis of Some Signal Processing Algorithms, PhD thesis, Princeton University, Princeton, NJ, 978 [7] F R deoog, On the solution of Toeplitz systems, Linear Algebra Appl, 88/89 987, pp [8] P Dewilde and A van der Veen, Time-Varying Systems and Computations, Kluwer Academic Publishers, Boston, MA, 998 [9] M Fiedler, ankel and Loewner matrices, Linear Algebra Appl, 58 98, pp [20] K A Gallivan, S Thirumalai, P Van Dooren, and V Vermaut, igh performance algorithms for Toeplitz and block Toeplitz matrices, Linear Algebra Appl, 2/23 996, pp [2] L Gemignani, Schur complements of Bezoutians and the inversion of block ankel and block Toeplitz matrices, Linear Algebra Appl, , pp [22] I Gohberg, T Kailath, and V Olshevsky, Fast Gaussian elimination with partial pivoting for matrices with displacement structure, Math Comp, 6 995, pp [23] I Gohberg and V Olshevsky, Fast state space algorithms for matrix Nehari and Nehari- Takagi interpolation problems, Integral Equations Operator Theory, 20 99, pp 83 [2] I Gohberg and V Olshevsky, Complexity of multiplication with vectors for structured matrices, Linear Algebra Appl, , pp [25] L Greengard and V Rokhlin, A fast algorithm for particle simulations, J Comput Phys, , pp [26] M Gu, Stable and efficient algorithms for structured systems of linear equations, SIAM J Matrix Anal Appl, 9 998, pp [27] M Gu and S C Eisenstat, Efficient algorithms for computing a strong rank-revealing QR factorization, SIAM J Sci Comput, 7 996, pp [28] G einig, Inversion of generalized Cauchy matrices and other classes of structured matrices, in Linear Algebra for Signal Processing, IMA Vol Math Appl 69, Springer, New York, 995, pp 63 8 [29] G einig and K Rost, Algebraic Methods for Toeplitz-like Matrices and Operators, Oper Theory Adv Appl 3, Birkhäuser Verlag, Basel, 98, pp [30] T Kailath, Fredholm resolvents, Wiener-opf equations, and Riccati differential equations, IEEE Trans Inform Theory, 5 969, pp [3] T Kailath, S Kung, and M Morf, Displacement ranks of matrices and linear equations, J Math Anal Appl, , pp [32] T Kailath and A Sayed, eds, Fast Reliable Algorithms for Matrices with Structure, SIAM, Philadelphia, 999 [33] P G Martinsson, V Rokhlin, and M Tygert, A fast algorithm for the inversion of general Toeplitz matrices, Comput Math Appl, , pp [3] M Morf, Fast Algorithms for Multivariable Systems, PhD thesis, Department of Electrical Engineering, Stanford University, Stanford, CA, 97 [35] B R Musicus, Levinson and Fast Choleski Algorithms for Toeplitz and Almost Toeplitz Matrices, Technical report, Res Lab of Electronics, MIT, Cambridge, MA, 98 [36] V Pan, On computations with dense structured matrices, Math Comp, , pp [37] D R Sweet, The use of pivoting to improve the numerical performance of algorithms for Toeplitz matrices, SIAM J Matrix Anal Appl, 993, pp [38] M Van Barel, G einig, and P Kravanja, A stabilized superfast solver for nonsymmetric Toeplitz systems, SIAM J Matrix Anal Appl, , pp 9 50 [39] M Van Barel and P Kravanja, A stabilized superfast solver for indefinite ankel systems, Linear Algebra Appl, , pp Copyright by SIAM Unauthorized reproduction of this article is prohibited

Fast algorithms for hierarchically semiseparable matrices

NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS Numer. Linear Algebra Appl. 2010; 17:953 976 Published online 22 December 2009 in Wiley Online Library (wileyonlinelibrary.com)..691 Fast algorithms for hierarchically