Contributions to Parallel Algorithms for Sylvester-type Matrix Equations and Periodic Eigenvalue Reordering in Cyclic Matrix Products

Size: px

Start display at page:

Download "Contributions to Parallel Algorithms for Sylvester-type Matrix Equations and Periodic Eigenvalue Reordering in Cyclic Matrix Products"

Kenneth Snow
6 years ago
Views:

1 Contributions to Parallel Algorithms for Sylvester-type Matrix Equations and Periodic Eigenvalue Reordering in Cyclic Matrix Products Robert Granat Licentiate Thesis, May 2005 UMINF Department of Computing Science Umeå University SE Umeå, Sweden

2 Print & Media, Umeå Universitet UMINF ISSN ISBN X

3 Abstract This Licentiate Thesis contains contributions in two different subfields of Computing Science: parallel ScaLAPACK-style algorithms for Sylvester-type matrix equations and periodic eigenvalue reordering in a cyclic product of matrices. Sylvester-type matrix equations, like the continuous-time Sylvester equation AX XB = C, where A of size m m, B of size n n and C of size m n are general matrices with real entries, have applications in many areas. For example, the continuous-time Sylvester equation shows up in eigenvalue problems, condition estimation of eigenvalue problems, e.g., sensitivity analysis of invariant subspaces corresponding to a specified spectrum, and in control and system theory. This thesis contributes to the area of parallel library ScaLAPACK-style software for solving Sylvester-type matrix equations. The algorithms and library software presented are based on the well-known Bartels Stewart s method and extend earlier work on triangular Sylvester-type matrix equations to general Sylvester matrix equations. The developed methods will serve as foundation for a future parallel software library for solving 42 sign and transpose variants of eight common Sylvester-type matrix equations. Many real world phenomena behave periodically, e.g., helicopter rotors and revolving satellites, and can be described in terms of periodic eigenvalue problems. Typically, eigenvalues and invariant subspaces (eigenvectors) to certain periodic matrix products are of interest and have direct physical interpretations. The eigenvalues of a cyclic matrix product can be computed via the periodic Schur decomposition. Our contribution in this area is a direct method for periodic eigenvalue reordering in the periodic real Schur form which extends earlier work on the standard and the generalized eigenvalue problems. Periodic eigenvalue reordering is vital in the computation of periodic eigenspaces corresponding to specified spectra and is utilized and requested, e.g., in recently proposed methods for solving periodic differential matrix equations arising in the analysis of the observability/controllability of linear continuous-time periodic systems and for solving discrete-time periodic Riccati equations arising in linear quadratic (LQ) optimal control problems. The proposed direct reordering method relies on orthogonal transformations only, i.e., is backward stable, and can be generalized to more general periodic matrix products arising in generalizations of the periodic Schur form. iii

4 iv

5 Preface This licentiate thesis consists of the following four papers and an introduction including a summary of the papers. I. Robert Granat, Bo Kågström and Peter Poromaa, Parallell ScaLAPACKstyle Algorithms for Solving Continuous-Time Sylvester Equations. In H. Kosch et al., Euro-Par 2003 Parallel Processing. Lecture Notes in Computer Science, Springer Verlag, Vol. 2790, pp , II. Robert Granat and Bo Kågström, Evaluating Parallel Algorithms for Solving Sylvester-Type Matrix Equations: Direct Transformation-Based versus Iterative Matrix-Sign-Function-Based Methods. To appear in PARA 04 State-of-the-Art in Scientific Computing Conference Proceedings, Lecture Notes in Computer Science, Springer Verlag, III. Robert Granat, Isak Jonsson and Bo Kågström, Combining Explicit and Recursive Blocking for Solving Triangular Sylvester-Type Matrix Equations on Distributed Memory Platforms. In M. Danelutto, D. Laforenza, M. Vanneschi (Eds), Euro-Par Lecture Notes in Computer Science, Springer Verlag, Vol. 3149, pp , IV. Robert Granat and Bo Kågström, Direct Eigenvalue Reordering in a Product of Matrices in Extended Periodic Real Schur Form, Report UMINF 05.05, submitted to SIAM Journal on Matrix Analysis and Applications, February The topic of papers I-III is parallel ScaLAPACK-style algorithms for computing the solution of general Sylvester-type matrix equations. In Paper IV, a direct method for eigenvalue reordering in a cyclic product of matrices is developed and analyzed. v

6 vi

7 Acknowledgements I wish to thank my supervisor Professor Bo Kågström, who is also co-author of all papers in this contribution. It has been great to work next to you and to take part of your deep knowledge and experience and I look forward to the second part of my doctoral studies. Thank you! Next, I want to send big thanks to Dr Isak ( the problem solver ) Jonsson, my assistant supervisor who is also co-author of one of the papers in this thesis. You have always been very kind and helpful. Thirdly, I wish to say thank you to Dr Peter Poromaa, who is co-author of the first paper. Thanks also to Dr Daniel Kressner and Dr Andras Varga for fruitful discussions on periodic eigenvalue problems and related topics. Thanks to all friends and colleagues at the Department of Computing Science, especially the members of the Numerical Linear Algebra and Parallel and High-Performance Computing Groups. Thanks to the staff at HPC2N (High Performance Computing Center North) for providing a great computing environment and superior technical support. Many thanks to Professor Jörgen Löfström, University of Gothenburg, whose teaching really woke up my interest in linear algebra. Also, thank you Dr Maya Neytcheva, University of Uppsala, who introduced me to parallel computing. Success! Eva, Elias and Anna, my family, you have made my life complete and I am proud being husband and father in our family. Thank you very much! Also many thanks to my parents, my three brothers and my grandparents for all your support during the last years. Finally, I want to express my deepest gratitude to my Lord and Savior Jesus Christ. You have been my Shepherd through life for so many years and never failed me. Thank you for all Your grace and mercy. Soli Deo Gloria! Financial support has been provided jointly by the Faculty of Science and Technology, Umeå University, and by the Swedish Research Council under grant VR and by the Swedish Foundation for Strategic Research under the frame program grant A3 02:128. Umeå, May 2005 Robert Granat vii

8 viii

9 Contents 1 Introduction Motivation for this work Matrix computations in CACSD High performance linear algebra software libraries Sylvester-type matrix equations Periodic eigenvalue problems Contributions in this thesis Paper I Paper II Paper III Paper IV Ongoing and future work Sylvester-type matrix equations Periodic eigenvalue problems Paper I 25 Paper II 39 Paper III 53 Paper IV 67 ix

11 Chapter 1 Introduction This chapter motivates and introduces the work presented in this thesis and gives some background to the topics considered. 1.1 Motivation for this work The growing demand for high-quality and high-performance library software is driven by the increasing need from industry, research facilities and communities to be able to solve larger and more complex problems faster than ever before. Often the problems considered are so complex that they cannot be solved by ordinary desktop computers in a reasonable amount of time. Scientists and engineers are more and more forced to utilize high-end computational devices, like specialized high performance computational units from vendor-constructed shared or distributed memory parallel computers or self-made so called clusterbased systems consisting of high-end commodity PC processors connected with high-speed networks. High performance clusters are getting more and more common due to their good scalability properties and high cost-effectiveness, i.e., lower cost per performance unit compared to the more old-fashioned supercomputer systems. It is also common among institutions with limited budgets to build cheaper and simpler clusters from ordinary PC workstations connected via simpler (and slower) networks, as well. In fact, any local area network (LAN) connecting a number of workstations can be considered to be a clusterbased system. However, the latter systems are mainly used for high-throughput computing applications, while clusters with high-speed networks are used for challenging parallel computations. To solve a problem in a reliable and efficient way, a lot of out-of-application considerations must be made regarding solution method, discretization, data distribution and granularity, expected and achieved accuracy of computed results, how to utilize the available computer power in the best way (should a par- 1

12 2 Chapter 1 allel computer be used or not), does the algorithm match the memory hierarchy of the target computer system, etc. Typically, an appropriate and efficient usage of high performance computer (HPC) systems, like parallel computers, calls for non-trivial reformulations of the problem settings and algorithms. Therefore, a lot of time and efforts can be saved by utilizing extensively tested high-quality software libraries as basic building blocks in the research. By this procedure, a lot more attention can be focused on the applications and related theory. Most problems in the real world are non-linear, i.e., the output from a phenomena or a process does not depend linearly on the input. Moreover, most problems are also continuous, i.e, the output depends continuously on the input. Roughly speaking, the graph of a continuous function can be painted without ever lifting the pen from the paper. Since very few real world problems can be solved analytically, that is, finding an exact mathematical expression that fully describes the relation between the input and the output of the process or phenomena, scientists are many times forced to linearize and discretize their problems to make them solvable in a finite amount of time using a finite amount of computational resources (such as computing devices, data storage, network bandwidth, etc.). This means that the computed solution always will be a more or less valid approximation. The good thing is that by linearization and discretization processes, many problems can be solved effectively by standard linear algebra methods. In numerical linear algebra, systems of linear equations, eigenvalue problems and related solution methods, i.e., matrix computations, are studied. The focus is on reliable and efficient algorithms for large-scale matrix computational problems from the point of view of using finite-precision arithmetic. Developments in the area of mumerical linear algebra often results in widely available public domain linear algebra software libraries, like LAPACK and ScaLAPACK (see Section 1.3). 1.2 Matrix computations in CACSD Matrix computations are fundamental to many areas of science and engineering and occur frequently in a variety of applications, for example in Computer-Aided Control System Design (CACSD). In CACSD various linear control systems are considered, like the following linear continuous-time descriptor system Eẋ(t) = Ax(t) + Bu(t) y(t) = Cx(t) + Du(t) (1.1) or a similar discrete-time system of the form Ex k+1 = Ax k + Bu k y k = Cx k + Du k, (1.2)

13 Introduction 3 where x(t), x k R n are state vectors, u(t), u k R m are the vectors of inputs (or controls) and y(t), y k R r are the vectors of output. The systems are described by the state matrix pair (A, E) R (n n) 2, the input matrix B R n m, the output matrix C R r n and the feed-forward matrix D R r m. The matrix E is possibly singular. With E = I, where I is the identity matrix of order n, standard state-space systems are considered. Other subsystems described by the tuples (E, A, B) and (E, A, C) are studied when investigating the controllability and observability characteristics of a system (see, e.g, [24, 38]). Applications with periodic behavior, e.g., rotating helicopter blades and revolving satellites can be described by discrete-time periodic descriptor systems of the form E k x k+1 = A k x k + B k u k y k = C k x k + D k u k, (1.3) where the matrices A k, E k R n n, B k R n m, C k R r n and D k R r m are periodic with periodicity K 1. For example, this means that A K = A 0, B K = B 0, etc. Important problems studied in CACSD include state-space realization, minimal realization, linear-quadratic (LQ) optimal control, pole assignment, distance to controllability and observability considerations, etc. For details see, e.g., [57]. The systems (1.1) (1.3) can be studied by various matrix computational approaches, e.g., by solving related eigenvalue problems. In this area, improved algorithms and software are developed for computing and investigating different subspaces, e.g., condition estimation of invariant or deflating subspaces [46, 47], solving various important matrix equations, like (periodic) Sylvester-type and (periodic) Riccati matrix equations and computing canonical structure information [38]. One common step in computing such structure information is the need to separate the stable and unstable eigenvalues by an eigenvalue reordering technique (see, e.g., [62, 35, 66, 65]). 1.3 High performance linear algebra software libraries It is important that scientists and engineers are able to solve their discretized and linearized problems without having to rewrite all the necessary software from scratch. Fortunately, as discussed above, many problems can be formulated in terms of common matrix operations. These ideas have been driving forces behind serial library software packages such as BLAS [13], LAPACK The definition of stable and unstable eigenvalue depends on the system considered. However, the common definitions of a stable eigenvalue λ for discrete-time and continuous-time systems are λ 1 and Re(λ) 0, respectively.

14 4 Chapter 1 [51, 3], SLICOT [27, 61] and their parallel counterparts ScaLAPACK [11] and PSLICOT [56]. BLAS (Basic linear algebra Subprograms) is structured in three levels. Level 1 BLAS is concerned with vector-vector operations, e.g., scalar products, rotations etc., and was developed during the seventies. Level 2 BLAS performs matrix-vector operations and was originally motivated by the increasing number of vector machines during the eighties. Level 3 BLAS concerns matrix-matrix operations, such as the well known GEMM (GEneral Matrix Multiply and add) operation C = αc + βab, where α and β are scalars, and A is an m k matrix, B is a k n matrix and C is an m n matrix. In general, the level 3 BLAS performs O(n 3 ) arithmetic operations while moving O(n 2 ) data element through the memory hierarchy of the computer. If the level 3 BLAS is properly tuned for the cache memory hierarchy of the target computer system, and the computations in the actual program are organized into level 3 operations, the execution may run with close to peak performance. In fact, the whole level 3 BLAS may be organized in GEMM-operations [44, 45] which means that the performance will depend only on how highly tuned the GEMM-operation is. Computer vendors often supply their own high performance implementation of the BLAS which is optimized for their specific architecture. Automatically tuned libraries also exists, see, e.g., ATLAS [4]. See also the GOTO-BLAS [31] which makes use of data streaming to efficiently utilize the memory hierarchy of the target computer. The LAPACK (linear algebra Package) is a combination of the libraries LINPACK and EISPACK and performs all kinds of linear algebra computations, from solving linear systems of equations to calculating all eigenvalues of a general matrix. The computations in LAPACK are organized to perform as much as possible in level 3 operations for optimal performance. The LAPACK project has been extremely successful [22] and now forms the underlying computational layer of the interactive MATLAB [53] environment, which perhaps is the most popular tool for solving computational problems in science and engineering and for educational purposes. ScaLAPACK (Scalable LAPACK) implements a subset of the algorithms in LAPACK for distributed memory environments. Basic building blocks are twodimensional (2D) block cyclic data distribution (see, e.g., [32]) over a logical rectangular processor mesh in combination with a Fortran 77 object oriented approach for handling the involved global distributed matrices. In connection to ScaLAPACK, a parallel version of BLAS exists, the PBLAS (Parallel BLAS) [58]. Explicit communication in ScaLAPACK is performed using the communication library BLACS (Basic linear algebra Communication Subprograms) [12], which provides processor mesh setup routines and basic point-to-point, collective and reduction communication routines. BLACS is usually implemented

15 Introduction 5 using MPI (Message Passing Interface) [55]. SLICOT (Subroutine Library in Systems and Control Theory) provides Fortran 77 implementations of numerical algorithms for computations in systems and control applications. Based on numerical linear algebra routines from BLAS and LAPACK libraries, SLICOT provides methods for the design and analysis of control systems. Similarly to LAPACK and ScaLAPACK, a parallel version of SLICOT, called PSLICOT, is under development. The goal is to include all functionality of SLICOT in a parallel version. PSLICOT builds also on the existing functionality of ScaLAPACK, PBLAS and BLACS. We remark that both LAPACK and ScaLAPACK are currently under revision for new releases [22]. 1.4 Sylvester-type matrix equations Matrix equations have been in focus of the numerical community for quite some time. Applications include eigenvalue problems and condition estimation of eigenvalue problems (e.g., see [46, 36, 59]) and various control problems (e.g., see [24]). Already in 1972, R. H. Bartels and G. W. Stewart published the paper Algorithm 432: Solution of the Matrix Equation AX + XB = C [6], which outlines what came to be called the Bartels Stewart s method for solving the continuous-time Sylvester equation (SYCT) AX + XB = C, (1.4) where A of size m m, B of size n n and C of size m n are arbitrary matrices with real entries. Equation (1.4) have a unique solution if and only if A and B has no eigenvalues in common. The solution method in [6] follows the general idea from mathematics of problem solving via reformulations and coordinate transformations: First transform the problem to a form where it is (more easily) solvable, then solve the transformed problem and finally transform the solution back to the original coordinate system. Examples include computing derivatives by a spectral method using forward and backward Fourier Transform as transformation method [25], and computing explicit inverses of general square matrices by using LU-factorization and matrix multiplication as transformation method [37]. The Bartels Stewart s method for Equation (1.4) is as follows: 1. Transform the matrix A and the matrix B to real Schur form. 2. Update the matrix C with respect to the two Schur decompositions. 3. Solve the resulting reduced triangular matrix equation. 4. Transform the obtained solution back to the original coordinate system.

16 6 Chapter 1 The first step, which is performed by reducing the matrices to Hessenberg forms and applying the QR-algorithm to compute their real Schur forms, is also known to be the dominating part in terms of floating point operations [30] and execution time. By recent developments in obtaining close to level 3 performance in the bulge-chasing [16] and advanced deflating techniques for the QR-algorithm [17], this might change in the future. The classic paper of Bartels and Stewart [6] has served as a foundation for later developments of direct solution methods for related problem, see, e.g., Hammarlings method [34] and the Hessenberg-Schur approach by Golub, Nash and Van Loan [29]. However, these methods were developed before matrix blocking became vital for handling the increasing performance gap between processors and memory modules. Level 3 BLAS LAPACK-style block algorithms for Sylvester-type matrix equations were developed in [46, 47]. In [8, 9, 10] fully iterative methods for solving matrix equations are considered. Those methods can be very fast and reliable, but they are limited to a certain range of problems and cannot be applied to all instances. Jonsson and Kågström presented fast recursive blocked algorithms for solving standard and generalized Sylvester-type matrix equations in [39, 40] and developed the software library RECSY [41]. Notice that their recursive blocking approach has a potential for automatic matching of the memory hierarchy and can be very effective in combination with a highly tuned level 3 BLAS. Recursive blocking can be applied to many problems in matrix computations [26]. ScaLAPACK-style algorithms for solving triangular standard and generalized coupled Sylvester equations were considered in [46, 59] and further developed for the standard Sylvester equation in [33]. In Chapter 3, we give some more examples of Sylvester-type matrix equations and related solution methods. 1.5 Periodic eigenvalue problems Given a general matrix A R n n, the standard eigenvalue problem consists of finding n eigenvalue-eigenvector pairs (λ i, x i ) R 1 (n 1) such that Ax i = λ i x i, (1.5) i = 1,..., n (see, e.g., [52]). Notice that Equation (1.5) only concerns right eigenvectors. Left eigenvectors are defined by yi T A = λ iyi T [30], i.e, they are right eigenvectors of the transposed matrix A T. The standard method for the general standard eigenvalue problem is the unsymmetric QR-algorithm (see, e.g., [28, 30, 50]), which is a backward stable algorithm belonging to a large family of bulge-chasing algorithms [68] that by

17 Introduction 7 iteration reduces the matrix A to real Schur form via an orthogonal similarity transformation Q R n n such that Q T AQ = T A, (1.6) where all eigenvalues of A appear as 1 1 and 2 2 blocks on the main diagonal of the quasi-triangular matrix T A. The column vectors q i, i = 1, 2,..., n of Q are called the Schur vectors of the decomposition (1.6), where q 1 = x 1 is an eigenvector associated with the eigenvalue λ 1. More importantly, given k n such that no 2 2 block resides in T A (k : k + 1, k : k + 1), the k first Schur vectors q i, i = 1, 2..., k, form an orthonormal basis for an invariant subspace of A associated with the k the first eigenvalues λ 1, λ 2,..., λ k. In most practical applications, the retrieved information from the Schur decomposition (eigenvalues and invariant subspaces) is sufficient and the eigenvectors need not to be computed explicitly. However, in case the matrix A is diagonalizable (see, e.g., [52]), the eigenvector x i can be computed as the null space of the linear systems A λ i I. The eigenvectors can also be computed by successively reordering each of the eigenvalues in the Schur form to the topleft corner of the matrix (see, e.g., [5, 23, 15]) and reading off the first Schur vector q 1. However, the latter approach is not utilized in practice for the standard eigenvalue problem but the basic idea can be useful in other contexts (see below). The periodic eigenvalue problem (see, e.g., [50, 68]) consists, in its simplest form, of computing eigenvalues and invariant subspace of the matrix product A = A K 1 A K 2 A 0, where A 0, A 1,..., A K 1, A K+i = A i, i = 0, 1,..., is a K-cyclic matrix sequence. Such problems can arise from forming the monodromy matrix [67] of discretetime periodic descriptor systems of the form (1.3) with E = I. From cost and accuracy reasons, it is necessary to work with the factors and not forming the product A explicitly [14, 68]. Furthermore, with E I, such an approach would require the explicit calculation of the K inverses E k, which may not even exist. In general, the eigenvalues of a K-cyclic matrix product is obtained by computing the periodic Schur decomposition (PRSF) [14, 35] Z T k+1a k Z k = T k, k = 0, 1,..., K 1, (1.7) where Z 0, Z 1,..., Z K 1, Z K = Z 0 are orthogonal and the sequence T k consists of K 1 upper triangular matrices and one upper quasi-triangular matrix. This is utilized by the periodic QR-algorithm (or periodic QZ-algorithm [14, 48] in case of E I). The periodic QR-algorithm is essentially analogous to the standard QR-algorithm applied to a (block) cyclic matrix [49]. The placement of

18 8 Chapter 1 the quasi-triangular matrix may be specified to fit the actual application. Sometimes, for example in pole assignment, the resulting PRSF should be ordered [64], i.e., the eigenvalues should be ordered in a specified way. Each formal cyclic matrix product is associated with a matrix tuple Ā = (A K 1, A K 2,..., A 1, A 0 ) [7]. The vector tuple ū = (x K 1, x K 2,, x 1, x 0 ), with x k 0, is called a right eigenvector of the tuple Ā corresponding to the eigenvalue λ if there exist scalars α k, possibly complex, such that the relations A k x k = α k x k+1, k = 0, 1,..., K 1, λ := 0 k=k 1 α k (1.8) hold with x K = x 0. A left eigenvector ȳ of the tuple Ā corresponding to λ is defined similarly. In this context, a direct eigenvalue reordering method may be utilized to compute eigenvectors corresponding to each eigenvalue in the periodic Schur form by reordering each eigenvalue to the top left corner of the periodic Schur form, similarly to the non-periodic case. More generally, the direct method can be used to compute periodic eigenspaces corresponding to a specified set of eigenvalues (see, e.g., [47]) of the matrix product. The extended periodic real Schur form (EPRSF) [63] generalizes PRSF for handling square products where the involved matrices are rectangular. EPRSF can be computed by a slightly modified periodic QR-algorithm. The generalized periodic real Schur form (GPRSF) generalizes PRSF to periodic matrix pairs (E k, A k ), with E k possible singular, and is typically computed by the periodic QZ-algorithm [48]. Both extensions are important in solving various periodic matrix equations, like Riccati, Lyapunov or Sylvester in their standard or generalized forms.

19 Chapter 2 Contributions in this thesis This chapter gives a brief summary of the papers in this thesis. 2.1 Paper I In this paper, the work by Kågström-Poromaa [46] and Poromaa [59] on block algorithms and parallel algorithms for triangular Sylvester matrix equations is extended. The algorithms presented are complete ScaLAPACK-style implementations of Bartels Stewarts method (see Section 1.4) for the four transpose variants of the general Sylvester equation op(a)x Xop(B) = C, (2.1) where op(a) denotes the matrix A or its transpose A T. One of the shortcomings of the previous algorithms was the lack of support for handling quasitriangular matrices where some 2 2 blocks, corresponding to complex conjugate pairs of eigenvalues, were split in the explicit blocking of the algorithms. This problem was resolved by proposing an implicit redistribution of the matrices in the initial stage of step 3 in Bartels Stewart s method (see Section 1.4). This paper also introduces the on demand communication scheme for step 3 that complements the original matrix shifting scheme [59] for the transpose variants where matrix shifting does not work. Paper I is mainly a refined and compressed version of my Master s Thesis [33]. 2.2 Paper II In this contribution, a comparison between two ScaLAPACK-style implementations of two different methods for solving Equation (1.4) is presented. The 9

20 10 Chapter 2 first method is the implementation of Bartels Stewart s method presented in Paper I and the second implementation is a Newton-iteration-style matrix-signfunction-based method [8, 9, 10]. The comparison concerns generality of use, execution time and accuracy of computed solutions. The matrices A and B are constructed to obtain differently conditioned test problems. The conditioning of the problems are also measured by computing lower bound estimates of sep(a, B) 1 [46]. Experimental results from two differently balanced parallel platforms are presented, showing that the method from Paper I can be substantially faster on well-balanced parallel platforms, and can deliver far more accuracy for illconditioned problems. 2.3 Paper III Paper III introduces ScaLAPACK-style hybrid algorithms for solving the triangular continuous-time Sylvester equation by combining the explicit matrix blocking approach adopted in LAPACK [3] and ScaLAPACK [11] with the recursive matrix blocking provided by the HPC library RECSY [39, 40, 41]. The hybrid algorithms are obtained by replacing the LAPACK standard routines with the solvers provided by RECSY as node subsystem solvers in step 3 of Bartels Stewart s method. Even though an overwhelming part of the computational work in step 3 is performed outside the node subsystem solver, the proposed approach gives a substantial improvement of the execution time of the algorithm, mainly from two reasons: It allows for using larger blocks in the explicit blocking without ruining the performance because of a slow kernel solver, thus decreasing the number of communication steps in the algorithm. It decreases the synchronization time for some idle processors during a certain step in the parallel solver where a subsolution is broadcasted along two different scopes. Experimental results are presented and evaluated for both matrix shifting and on demand communication. 2.4 Paper IV The final contribution concerns periodic eigenvalue problems. The paper presents the derivation and the analysis of a direct method (see, e.g., [5, 42]) for eigenvalue reordering in a K-cyclic matrix product A K 1 A K 2 A 1 A 0 of the sequence A 0, A 1,..., A K 2, A K 1, A K = A 0 without evaluating the matrix product.

21 Contributions in this thesis 11 The method relies on orthogonal transformations only and the proposed algorithm performs the reordering tentatively to guarantee backward stability. One important step in the method is the numerical solution of an associated triangular periodic Sylvester equation (PSE) A k X k X k+1 B k = C k, k = 0, 1,..., K 1, (2.2) where X K = X 0. Several methods for solving Equation (2.2) are discussed, including Gaussian elimination with partial or complete pivoting (GEPP/GECP) and iterative refinement (see, e.g., [37]). An error analysis of the direct reordering method is presented that reveals that the accuracy of the reordered eigenvalues is connected to the accuracy of the computed solution to the associated PSE. The theoretical results are also connected back to the standard case K = 1. Some experimental results are presented that illustrates the reliability and robustness of the direct reordering method for a selected number of problems, including well- and ill-conditioned artificial problems with short and long periods and an application with long period from satellite control.

22 12 Chapter 2

23 Chapter 3 Ongoing and future work The topics discussed in the thesis will be considered for further developments. 3.1 Sylvester-type matrix equations The Bartels Stewart s method presented in Section 1.4 can be generalized to other matrix equations, as well. Perhaps the simplest example (which was also mentioned by Bartels and Stewart) is the continuous-time Lyapunov equation (LYCT) AX + XA T = C, (3.1) where A and C = C T of size n n are general matrices with real entries. One may also consider the discrete-time counterparts of SYCT and LYCT: the discrete-time Sylvester equation (SYDT) AXB T X = C, (3.2) where A, B and C are as in SYCT, and the discrete-time Lyapunov equation (LYDT) AXA T X = C. (3.3) where A and C are as in LYCT. Furthermore, generalized variants of the Sylvester/Lyapunov matrix equations can be considered, for example, the generalized coupled Sylvester equation (GCSY) (AX Y B, DX Y E) = (C, F ), (3.4) where A and D of size m m, B and E of size n n and C and F of size m n are general matrices with real entries. A Bartels Stewart-style method for GCSY can be formulated as follows: 13

24 14 Chapter 3 1. Reduce the matrix pairs (A, D) and (B, E) to generalized Schur form. 2. Update the right hand side matrix pair (C, F ) with respect to the two generalized Schur decompositions. 3. Solve the resulting triangular generalized matrix equation. 4. Transform the computed solution back to the original coordinate system. Notice that step 3 above has already been considered in [59]. Other generalized matrix equations to consider are the generalized Sylvester equation (GSYL) AXB T CXD T = E, (3.5) where A and C of size m m, B and D of size n n and E of size m n are general matrices with real entries, the continuous-time generalized Lyapunov equation (GLYCT) AXE T EXA T = C, (3.6) where A, E and C = C T of size m m are general matrices with real entries, and the discrete-time generalized Lyapunov equation (GLYDT) AXA T + EXE T = C, (3.7) where A, E and C = C T of size m m are general matrices with real entries. Name Standard Sylvester (CT) Standard Lyapunov (CT) Standard Sylvester (DT) Standard Lyapunov (DT) Generalized Coupled Sylvester Generalized Sylvester Generalized Lyapunov (CT) Generalized Lyapunov (DT) Matrix equation op(a)x ± Xop(B) = C op(a)x + Xop(A T ) = C op(a)xop(b) ± X = C op(a)xop(a { T ) X = C op1 (A)X ± Y op 2 (B) = C op 1 (D)X ± Y op 2 (E) = F op 1 (A)Xop 2 B ± op 1 (C)Xop 2 (D) = E op 1 (A)Xop 2 (E T ) + op 2 (E)Xop 1 (A T ) = C op 1 (A)Xop 1 (A T ) op 2 (E)Xop 2 (E T ) = C Table 3.1: The sign and transpose variants of the Sylvester-type matrix equations to be considered in SCASY. CT and DT denote the continuous-time and discrete-time variants, respectively. Parallel ScaLAPACK-style Bartels Stewart-based algorithms for solving 42 sign and transpose variants of the standard and generalized Sylvester/Lyapunov matrix equations shown in Table 3.1 are under development (see also RECSY

25 Ongoing and future work 15 [41]). As the experienced reader might have noticed, the parallel Schur decomposition step for the generalized matrix equations requires a reliable and efficient ScaLAPACK-style implementation of the QZ-algorithm. Up to this date, the most important contributions regarding block forms and parallelizations of the QZ-algorithm and related topics have been given by Dackland and Kågström [19, 20], Dackland [18], Adlerborn, Dackland and Kågström [1, 2] and Kågström and Kressner [43]. The objective is to develop algorithms and a software package SCASY that will include general and triangular solvers for the Sylvester-type matrix equations presented in this section, parallel condition estimators [46] and matrix generators for matrices and matrix pairs with specified eigenvalues. At the time of writing, most matrix equation algorithms have already been implemented, but not fully tested or documented [60]. 3.2 Periodic eigenvalue problems The direct method for eigenvalue reordering in a cyclic product of matrices has been further developed to support computation of periodic eigenspaces with specified eigenvalues by bubble-sorting the selected eigenvalues to the top-left corner of the corresponding matrix product. This approach allows for condition estimation of selected clusters of eigenvalues by solving a few (say five) associated PSEs (2.2). Level 3 BLAS algorithms based on Kronecker product representations of small PSEs, Gaussian elimination with partial pivoting (GEPP) and iterative refinement for solving PSEs are developed. For the direct reordering method to be reliable, the associated periodic Schur decomposition must be stable. There exists a number of implementations of the periodic QR-algorithm (see, e.g, [50]), but at the time of writing, no one is completely satisfying. Some efforts to contribute to this area will be made. We remark that eigenvalue reordering techniques, like the one presented in this thesis, can be beneficial for implementing advanced aggressive early deflation techniques for any QR-style bulge-chasing algorithm [17]. Future work also includes considering more general periodic eigenvalue problems, like periodic matrix pairs (E k, A k ), with E k possibly singular, arising from periodic descriptor systems, see, e.g., [67] and Section 1.2, where K-cyclic matrix products of the form A = E 1 K 1 A K 1E 1 K 2 A K 2... E 1 0 A 0 (3.8) are considered. Such matrix products arise from the study of the monodromy matrix of the corresponding system. Notice that the matrix product is considered to be formal [7, 68], i.e., the inverses E k are not formed explicitly (see Section 1.5). Observe also that for K = 1, this problem is equivalent to the generalized eigenvalue problem

26 16 Chapter 3 Ax = λex, (3.9) where the task is to compute generalized eigenvalue-eigenvector pairs of the matrix pair (A, E). One key ingredient here is the generalized Schur form which can be computed by the QZ-algorithm without computing any explicit matrix inverses. For details we refer to [54, 21, 50] and the references therein. To perform direct eigenvalue reordering in the context of matrix products of the form (3.8), generalized periodic Sylvester-type matrix equations must be solved with high accuracy, in analogy with direct eigenvalue reordering in the generalized Schur form [42, 47]. The prototype software developed in Paper IV is being revised to reach library (see, e.g., LAPACK [3] and SLICOT [61]) standard and quality. All developed software will be designed to be integrated in state-of-the-art software libraries as LAPACK and SLICOT.

27 References [1] B. Adlerborn, K. Dackland, and B. Kågström. Parallel Two-Stage Reduction of a Regular Matrix Pair to Hessenberg-Triangular Form. In T. Sørvik and et al, editors, Applied Parallel Computing: New Paradigms for HPC Industry and Academia, volume 1947, pages Springer-Verlag, Lecture Notes in Computer Science, [2] B. Adlerborn, K. Dackland, and B. Kågström. Parallel and blocked algorithms for reduction of a regular matrix pair to Hessenberg-triangular and generalized Schur forms. In J. Fagerholm et al., editor, PARA 2002, LNCS 2367, pages Springer-Verlag, [3] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. W. Demmel, J. J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. C. Sorensen. LAPACK Users Guide. SIAM, Philadelphia, PA, third edition, [4] ATLAS - Automatically Tuned Linear Algebra Software. See http: //math-atlas.sourceforge.net/. [5] Z. Bai and J. W. Demmel. On swapping diagonal blocks in real Schur form. Linear Algebra Appl., 186:73 95, [6] R. H. Bartels and G. W. Stewart. Algorithm 432: The Solution of the Matrix Equation AX BX = C. Communications of the ACM, 8: , [7] P. Benner, V. Mehrmann, and H. Xu. Perturbation analysis for the eigenvalue problem of a formal product of matrices. BIT, 42(1):1 43, [8] P. Benner and E. S. Quintana-Ortí. Solving Stable Generalized Lyanpunov Equations with the matrix sign functions. Numerical Algorithms, 20(1):75 100, [9] P. Benner, E.S. Quintana-Ortí, and G. Quintana-Ortí. Numerical Solution of Discrete Stable Linear Matrix Equations on Multicomputers. Parallel Algorithms and Applications, 17(1): ,

28 18 Chapter 3 [10] P. Benner, E.S. Quintana-Ortí, and G. Quintana-Ortí. Solving Stable Sylvester Equations via Rational Iterative Schemes. Preprint sfb393/04-08, TU Chemnitz, [11] L. S. Blackford, J. Choi, A. Cleary, E. D Azevedo, J. W. Demmel, I. Dhillon, J. J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users Guide. SIAM, Philadelphia, PA, [12] BLACS - Basic Linear Algebra Communication Subprograms. See http: // [13] BLAS - Basic Linear Algebra Subprograms. See blas/index.html. [14] A. Bojanczyk, G. H. Golub, and P. Van Dooren. The periodic Schur decomposition; algorithm and applications. In Proc. SPIE Conference, volume 1770, pages 31 42, [15] A. Bojanczyk and P. Van Dooren. Reordering diagonal blocks in the real Schur form. In NATO ASI on Linear Algebra for Large Scale and Real-Time Applications, volume 1770, pages , [16] K. Braman, R. Byers, and R. Mathias. The multishift QR algorithm, I: Maintaining well-focused shifts and level 3 performance. SIAM J. Matrix Anal. Appl., 23(4): , [17] K. Braman, R. Byers, and R. Mathias. The multishift QR algorithm, II: Aggressive early deflation. SIAM J. Matrix Anal. Appl., 23(4): , [18] K. Dackland. Parallel Reduction of a Regular Matrix Pair to Block- Hessenberg-Triangular Form Algorithm Design and Performance Modelling. Report uminf-98.09, Department of Computing Science, Umeå University, Sweden, [19] K. Dackland and B. Kågström. Reduction of a Regular Matrix Pair (A, B) to Block Hessenberg-Triangular Form. In J. Dongarra, K. Madsen, and J. Wasniewski, editors, Applied Parallel Computing: Computations in Physics, Chemistry and Engineering Science, volume 1041, pages Lecture Notes in Computer Science, Springer, [20] K. Dackland and B. Kågström. Blocked Algorithms and Software for Reduction of a Regular Matrix Pair to Generalized Schur Form. ACM Trans. Math. Software, 25(4): , 1999.

29 REFERENCES 19 [21] K. Dackland and B. Kågström. Blocked algorithms and software for reduction of a regular matrix pair to generalized Schur form. ACM Trans. Math. Software, 25(4): , [22] J. W. Demmel and J. J. Dongarra. Lapack 2005 prospectus: Reliable and scalable software for linear algebra computations on high end computers. Lapack working note 164, University of Carlifornia, Berkeley and University of Tennessee, Knoxville, [23] J. J. Dongarra, S. Hammarling, and J. H. Wilkinson. Numerical considerations in computing invariant subspaces. SIAM J. Matrix Anal. Appl., 13(1): , [24] G. E. Dullerud and F. Paganini. A Course in Robust Control Theory. Springer-Verlag, New York, A Convex Approach. [25] B. Eliasson. Numerical Vlasov-Maxwell Modelling of Space Plasma. PhD thesis, Uppsala University, Department of Information Technology, Scientific Computing, [26] E. Elmroth, F. Gustavson, I. Jonsson, and B. Kågström. Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software. SIAM Review, 46(1):3 45, [27] E. Elmroth, P. Johansson, B. Kågström, and D. Kressner. A web computing environment for the SLICOT library. In The Third NICONET Workshop on Numerical Control Software, pages 53 61, [28] J. G. F. Francis. The QR Transformation, Parts I and II. Computer Journal, 4: , , 1961, [29] G. H. Golub, S. Nash, and C. F. Van Loan. A Hessenberg-Schur method for the problem AX + XB = C. IEEE Trans. Automat. Control, 24(6): , [30] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, MD, third edition, [31] GOTO-BLAS - High-Performance BLAS by Kazushige Goto. See http: // [32] A. Grama, A. Gupta, G. Karypsis, and V. Kumar. Introduction to Parallel Computing, Second Edition. Addison-Wesley, [33] R. Granat. A Parallel ScaLAPACK-style Sylvester Solver. Master s thesis, umnad 435/03, Department of Computing Science, 2003.

30 20 Chapter 3 [34] S. J. Hammarling. Numerical Solution of the Stable, Non-negative Definite Lyapunov Equation. IMA Journal of Numerical Analysis, (2): , [35] J. J. Hench and A. J. Laub. Numerical solution of the discrete-time periodic Riccati equation. IEEE Trans. Automat. Control, 39(6): , [36] N. J. Higham. Perturbation theory and backward error for AX XB = C. BIT, 33(1): , [37] N. J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia, PA, second edition, [38] S. Johansson. Stratification of Matrix Pencils in Systems and Control: Theory and Algorithms. Licentiate Thesis, Report UMINF-05.17, Department of Computing Science, SE , Umeå University, Sweden, [39] I. Jonsson and B. Kågström. Recursive blocked algorithms for solving triangular systems. I. One-sided and coupled Sylvester-type matrix equations. ACM Trans. Math. Software, 28(4): , [40] I. Jonsson and B. Kågström. Recursive blocked algorithms for solving triangular systems. II. Two-sided and generalized Sylvester and Lyapunov matrix equations. ACM Trans. Math. Software, 28(4): , [41] I. Jonsson and B. Kågström. RECSY - A High Performance Library for Solving Sylvester-Type Matrix Equations. In Kosch H. et al, editor, Euro- Par 2003 Parallel Processing, volume 2790, pages Springer-Verlag, Lecture Notes in Computer Science, [42] B. Kågström. A direct method for reordering eigenvalues in the generalized real Schur form of a regular matrix pair (A, B). In Linear algebra for large scale and real-time applications (Leuven, 1992), volume 232 of NATO Adv. Sci. Inst. Ser. E Appl. Sci., pages Kluwer Acad. Publ., Dordrecht, [43] B. Kågström and D. Kressner. Multishift Variants of the QZ Algorithm with Aggressive Early Deflation. Report UMINF-05.11, ISSN , Department of Computing Science, UmeåUniversity, S Umeå, Sweden, [44] B. Kågström, P. Ling, and C. Van Loan. GEMM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark. ACM Trans. Math. Software, 24(3): , [45] B. Kågström, P. Ling, and C. Van Loan. Algorithm 784: GEMM-Based Level 3 BLAS: Portability and Optimization Issues. ACM Trans. Math. Software, 24(3): , 1998.

31 REFERENCES 21 [46] B. Kågström and P. Poromaa. Distributed and shared memory block algorithms for the triangular Sylvester equation with sep 1 estimators. SIAM J. Matrix Anal. Appl., 13(1):90 101, [47] B. Kågström and P. Poromaa. Computing eigenspaces with specified eigenvalues of a regular matrix pair (A, B) and condition estimation: theory, algorithms and software. Numer. Algorithms, 12(3-4): , [48] D. Kressner. An efficient and reliable implementation of the periodic QZ algorithm. In IFAC Workshop on Periodic Control Systems, [49] D. Kressner. The periodic QR algorithm is a disguised QR algorithm, To appear in Linear Algebra Appl. [50] D. Kressner. Numerical Methods and Software for General and Structured Eigenvalue Problems. PhD thesis, TU Berlin, Institut für Mathematik, Berlin, Germany, [51] LAPACK - Linear Algebra Package. See lapack/. [52] David C. Lay. Linear Algebra and its Applications, 2nd edition. Addison- Wesley, [53] The MathWorks, Inc., Cochituate Place, 24 Prime Park Way, Natick, Mass, Matlab Version 6.5, [54] C. B. Moler and G. W. Stewart. An algorithm for generalized matrix eigenvalue problems. SIAM J. Numer. Anal., 10: , [55] MPI - Message Passing Interface. See mpi/. [56] Niconet Taks II: Model Reduction. See niconet/nic2/nictask2.html. [57] N. S. Nise. Control Systems Engineering. Wiley, Fourth International Edition. [58] PBLAS - Parallel Basic Linear Algebra Subprograms. See netlib.org/scalapack/html/pblas qref.html. [59] P. Poromaa. Parallel Algorithms for Triangular Sylvester Equations: Design, Scheduling and Scalability Issues. In Kågström et al., editor, Applied Parallel Computing. Large Scale Scientific and Industrial Problems, volume 1541, pages Springer Verlag, Lecture Notes in Computer Science, 1998.

32 22 Chapter 3 [60] SCASY - ScaLAPACK-style solvers for Sylvester-type matrix equations. See granat/scasy.html. [61] SLICOT Library In The Numerics In Control Network (Niconet). See [62] J. Sreedhar and P. Van Dooren. Pole placement via the periodic Schur decomposition. In Proceedings Amer. Contr. Conf., pages , [63] A. Varga. Balancing related methods for minimal realization of periodic systems. Systems Control Lett., 36(5): , [64] A. Varga. Robust and minimum norm pole assignment with periodic state feedback. IEEE Trans. Automat. Control, 45(5): , [65] A. Varga. On solving discrete-time periodic Riccati equations accepted for IFAC [66] A. Varga. On solving periodic differential matrix equations with applications to periodic system norms computation submitted to CDC [67] A. Varga and P. Van Dooren. Computational methods for periodic systems - an overview. In Proc. of IFAC Workshop on Periodic Control Systems, Como, Italy, pages , [68] D. S. Watkins. Product Eigenvalue Problems. SIAM Review, 47:3 40, 2005.

33 I

35 Paper I Parallell ScaLAPACK-style Algorithms for Solving Continuous-Time Sylvester Equations Robert Granat, Bo Kågström and Peter Poromaa Department of Computing Science and HPC2N, Umeå University SE Umeå, Sweden. granat@cs.umu.se, bokg@cs.umu.se, peterp@cs.umu.se Abstract An implementation of a parallel ScaLAPACK-style solver for the general Sylvester equation, op(a)x Xop(B) = C, where op(a) denotes A or its transpose A T, is presented. The parallel algorithm is based on explicit blocking of the Bartels-Stewart method. An initial transformation of the coefficient matrices A and B to Schur form leads to a reduced triangular matrix equation. We use different matrix traversing strategies to handle the transposes in the problem to solve, leading to different new parallel wave-front algorithms. We also present a strategy to handle the problem when 2 x 2 diagonal blocks of the matrices in Schur form, corresponding to complex conjugate pairs of eigenvalues, are split between several blocks in the block partitioned matrices. Finally, the solution of the reduced matrix equation is transformed back to the original coordinate system. The implementation acts in a ScaLAPACK environment using 2-dimensional block cyclic mapping of the matrices onto a rectangular grid of processes. Real performance results are presented which verify that our parallel algorithms are reliable and scalable. Keywords: Sylvester matrix equation, continuous-time, Bartels Stewart method, blocking, GEMM-based, level 3 BLAS, SLICOT, ScaLAPACK-style algorithms. With kind permission of Springer Science and Business Media, Springer-Verlag, Lecture Notes in Computer Science, 2790 (2003), pp c 2003 Springer-Verlag, Berlin. All rights reserved This research was conducted using the resources of the High Performance Computer Center North (HPC2N). Financial support has been provided by the Swedish Research Council under grant VR and by the Swedish Foundation for Strategic Research under grant A3 02:

On aggressive early deflation in parallel variants of the QR algorithm

On aggressive early deflation in parallel variants of the QR algorithm Bo Kågström 1, Daniel Kressner 2, and Meiyue Shao 1 1 Department of Computing Science and HPC2N Umeå University, S-901 87 Umeå, Sweden