Parallel Library Software for the Multishift QR Algorithm with Aggressive Early Deflation

Size: px
Start display at page:

Download "Parallel Library Software for the Multishift QR Algorithm with Aggressive Early Deflation"

Transcription

1 Parallel Lirary Software for the Multishift QR Algorithm with Aggressive Early Deflation By Roert Granat, Bo Kågström, Daniel Kressner, and Meiyue Shao UMINF-12/06 UMEÅ UNIVERSITY DEPARTMENT OF COMPUTING SCIENCE SE UMEÅ Sweden

2 Parallel Lirary Software for the Multishift QR Algorithm with Aggressive Early Deflation Roert Granat, Bo Kågström, Daniel Kressner, and Meiyue Shao, June 3, 2014 Astract Lirary software implementing a parallel small-ulge multishift QR algorithm with aggressive early deflation (AED) targeting distriuted memory high-performance computing systems is presented. Starting from recent developments of the parallel multishift QR algorithm [Granat et al., SIAM J. Sci. Comput. 32(4), 2010], we descrie a numer of algorithmic and implementation improvements. These include communication avoiding algorithms via data redistriution and a refined strategy for alancing etween multishift QR sweeps and AED. Guidelines concerning several important tunale algorithmic parameters are provided. As a result of these improvements, a computational ottleneck within AED has een removed in the parallel multishift QR algorithm. A performance model is estalished to explain the scalaility ehavior of the new parallel multishift QR algorithm. Numerous computational experiments confirm that our new implementation significantly outperforms previous parallel implementations of the QR algorithm. Keywords: Multishift QR algorithm, aggressive early deflation, parallel algorithms, distriuted memory architectures 1 Introduction The QR algorithm is the method of choice for computing all eigenvalues of a nonsymmetric matrix A R n n. This paper descries a novel parallel implementation of the multishift QR algorithm for distriuted memory architectures. While our implementation is largely ased on the algorithms descried in [21], a numer of additional algorithmic improvements have een made, leading to significantly reduced execution times and higher roustness. In the following, we give a rief history of serial and parallel implementations of the QR algorithm. The algol procedure hqr y Martin, Petersen, and Wilkinson [32] was among the first computer implementations of the QR algorithm. A Fortran translation of this procedure was included in EISPACK [35] as routine HQR. The initial version of the LAPACK routine DHSEQR was ased on work y Bai and Demmel [5]; the most notale difference to HQR was the use of multishift techniques to improve data locality. This routine had only seen a few minor modifications [2] until LAPACK version 3.1, when it was replaced y an implementation incorporating pipelined ulges and aggressive early deflation techniques from the works y Braman, Byers, and Mathias [12, 13]. This implementation is descried in more detail in [14]. While there has een a lot of early work on parallelizing the QR algorithm, for example in [23, 33, 37, 38, 39, 40], the first pulicly availale parallel implementation was released only 1997 in ScaLAPACK [11] version 1.5 as routine PDLAHQR, ased on work y Henry, Watkins, and Dongarra [24]. A complex version PZLAHQR of this routine was included later on [16]. In this work, we descrie Report UMINF The work is supported y the Swedish Research Council under grant A , UMIT Research La via an EU Mål 2 project, and essence, a strategic collaorative e-science programme funded y the Swedish Research Council. Department of Computing Science and HPC2N, Umeå University, SE Umeå, Sweden. {granat, okg, myshao}@cs.umu.se. MATHICSE, EPF Lausanne, CH-1015 Lausanne, Switzerland. daniel.kressner@epfl.ch. 1

3 a new parallel implementation of the QR algorithm that aims to replace PDLAHQR. It might e interesting to note that all recently released high-performance linear algera packages, such as MAGMA and PLASMA [1], ELPA [3], FLAME [10] lack adapted implementations of the QR algorithm or other nonsymmetric eigenvalue solvers. Given a nonsymmetric matrix A, the parallel implementation of the eigenvalue solver in ScaLAPACK consists of the following steps. In the first optional step, the matrix is alanced, that is, an invertile diagonal matrix D is computed to make the rows and columns of A D 1 AD as close as possile. In the second step, A is reduced to Hessenerg form: H = Q T 0 AQ 0 with an orthogonal matrix Q 0 and h i j = 0 for i j + 2. In the third step, the QR algorithm iteratively reduces H further to real Schur form, eventually resulting in an orthogonal matrix Q such that T = Q T HQ (1) is quasi-upper triangular. This means that T is lock upper triangular with 1 1 locks (corresponding to real eigenvalues) and 2 2 locks (corresponding to complex conjugate eigenvalue pairs) on the diagonal. Therefore, the Schur decomposition of A is A = ZTZ T, where Z = Q 0 Q. The last optional step consists of computing the eigenvectors of T and performing a ack transformation to otain the eigenvectors of the original matrix A. This paper is only concerned with the reduction to real Schur form (1). In particular, we will not discuss the implementation of Hessenerg reduction, see [31, 29, 36] for recent developments in this direction. The rest of this paper is organized as follows. Section 2 provides a summary of our implementation and the underlying algorithm, emphasizing improvements over [21]. In Section 3, we present a performance model that provides insights into the cost of computations and communication. The derivation of this model is given in the electronic appendix. The performance model is then used to guide the choice of the parameters in Section 4. Finally, in Section 5, we evaluate the performance of our parallel lirary software y a large set of numerical experiments. 2 Algorithms and Implementation Modern variants of the QR algorithm usually consist of two major components multishift QR sweeps and aggressive early deflation (AED). A typical structure of the modern QR algorithm in a sequential or parallel setting is provided in Algorithm 1. Although our parallel multishift QR algorithm also uilds on Algorithm 1, there are several issues to e considered for reaching high performance. Figure 1 shows the software hierarchy of our implementation of the parallel multishift QR algorithm. Details of the algorithm and some implementation issues are discussed in the successive susections. Algorithm 1 Multishift QR algorithm with AED Input: H R n n, H is upper Hessenerg. Output: A real Schur form of H. 1: while not converged do 2: Perform AED on the n AED n AED trailing principle sumatrix. 3: Apply the accumulated orthogonal transformation to the corresponding off-diagonal locks. 4: if a large fraction of eigenvalues has een deflated in Step 2 then 5: goto Step 2. 6: end if 7: Perform a small-ulge multishift QR sweep with n shift undeflatale eigenvalues otained from Step 2 as shifts. 8: Check for negligile sudiagonal elements. 9: end while 2

4 PDHSEQR Entry routine for the new parallel multishift QR algorithm. PDLAQR1 Modified version of ScaLAPACK s original implementation of the parallel pipelined QR algorithm. PDLAQR3 Aggressive early deflation and shift computation. PDLAQR0 New parallel multishift QR algorithm. PDLAQR2 Aggressive early deflation. PDTRORD Parallel eigenvalue reordering. PDLAQR5 Multishift QR iteration ased on chains of tightly coupled ulges. Figure 1: Software hierarchy of the parallel multishift QR algorithm with AED (0,0) (0,1) (0,2) (0,0) (0,1) (0,2) (0,0) (0,1) (1,0) (1,1) (1,2) (1,0) (1,1) (1,2) (1,0) (1,1) (0,0) (0,1) (0,2) (0,0) (0,1) (0,2) (0,0) (0,1) (1,0) (1,1) (1,2) (1,0) (1,1) (1,2) (1,0) (1,1) (0,0) (0,1) (0,2) (0,0) (0,1) (0,2) (0,0) (0,1) (1,0) (1,1) (1,2) (1,0) (1,1) (1,2) (1,0) (1,1) (0,0) (0,1) (0,2) (0,0) (0,1) (0,2) (0,0) (0,1) (1,0) (1,1) (1,2) (1,0) (1,1) (1,2) (1,0) (1,1) Figure 2: The 2D lock-cyclic data layout across a 2 3 processor grid. For example, processor (0, 0) owns all highlighted locks. 2.1 Data layout convention in ScaLAPACK In ScaLAPACK, the p = p r p c processors are usually arranged into a p r p c grid. Matrices are distriuted over the rectangular processor grid in a 2D lock-cyclic layout with lock size m n (see an example in Figure 2). The information regarding the data layout is stored in an array descriptor so that the mapping etween entries of the gloal matrix and their corresponding locations in the memory hierarchy can e estalished. We adopt ScaLAPACK s data layout convention and require that the n n input matrices H and Z have identical data layout with square data locks (i.e., m = n ). However, the processor grid need not to e square unless explicitly specified. 2.2 Multishift QR sweep The multishift QR sweep is a ulge chasing process that involves several shifts. The QR sweep applied to a Hessenerg matrix H with k shifts σ 1, σ 2,..., σ k yields another Hessenerg matrix Q T HQ where Q is determined y the QR decomposition of the shift polynomial: (H σ 1 I)(H σ 2 I) (H σ k I) = QR. 3

5 PDLAHQR PDLAQR5 (a) () Figure 3: Loosely coupled shifts v.s. tightly coupled shifts. The dashed lines represent orders of the processor grid. Only parts of the matrix are displayed. Intralock ulge chasings can e performed independently and in parallel. Thanks to the Implicit Q theorem (e.g., see [17, 18]), there is a lot of freedom regarding how to perform the QR sweep. In ScaLAPACK v1.8.0 and earlier versions, PDLAHQR uses a pipelined approach that chases a chain of loosely coupled ulges, see Figure 3(a). While this approach clearly gives potential for parallelism, it comes with the disadvantage that its computational intensity is concentrated at level 1 and level 2 BLAS. In addition, frequent communication etween processors is required. To avoid these shortcomings of the pipelined QR algorithm, we use several chains of tightly coupled ulges, see Figure 3(), to improve node performance and reduce communication. For the new ulge chasing routine PDLAQR5, the total numer of shifts (n shift ) used in a single QR sweep is shown in Tale 1 and is usually much larger compared to the pipelined approach. These shifts are divided into several chains of tightly coupled ulges with up to n /3 shifts per chain so that the length of each chain does not exceed n /2. The chains are placed on different diagonal locks, such that Θ( p) chains can e chased locally and simultaneously. The corresponding off-diagonal locks are updated y explicit multiplication with the orthogonal matrix accumulated in the diagonal chasing step. This delay-and-accumulate technique leads to level 3 computational intensity. When the chains are passing through the processor order, they are chased in an odd-even manner (see Figure 4) to avoid conflicts etween different tightly coupled chains. We refer to [21] for a detailed description of this technique. 2.3 Aggressive Early Deflation (AED) The AED technique has een proposed y Braman, Byers, and Mathias [13] and proceeds y partitioning the current upper Hessenerg matrix H R n n as n n AED 1 1 n AED n n AED 1 H 11 H 12 H 13 H = 1 H 21 H 22 H 23, n AED 0 H 32 H 33 4

6 first round second round Figure 4: Interlock ulge chasing. Odd-numered chains (left) and even-numered chains (right) are chased separately in two rounds. Tale 1: Recommended values for n shift and n AED. Values taken from Tale 2.1 in [21] for n 96K, values for n > 96K are extrapolations. These values can e tuned y the user. However, such a tuning would not only need to take the computer architecture into account ut also the alance etween multishift QR iterations and AED, which depends on the particular matrix under consideration. matrix size (n) n shift n AED <6K see [14] 6K 12K K 24K K 48K K 96K K 192K K 384K K 768K K 1000K > 1M n/25 3n shift /2 5

7 where H 33 R n AED n AED is the so called AED window. By computing the (real) Schur decomposition H 33 = VTV T, and applying the corresponding similarity transformation to H, we otain H 11 H 12 H 13 V U T HU = H 21 H 22 H 23 V, 0 s T where U = n n AED 1 n n AED 1 1 n AED 1 1 n AED The vector s R n AED is the so called spike, created from the first entry of the vector H 32. The last diagonal entry (or 2 2 diagonal lock) of T can e deflated if the magnitude of the last component (or the last two components) of the spike is negligile. Undeflatale eigenvalues are moved to the top left corner of T y a swapping algorithm [6, 20]. The orthogonal transformations for reordering eigenvalues in the Schur form of the AED window are accumulated in an n AED n AED orthogonal matrix. By repeating the same procedure to all diagonal entries (or 2 2 locks) of T, the eigenvalues of T are checked susequently and possily deflated. Then the entire matrix is reduced ack to upper Hessenerg form and the off-diagonal locks H 13 and H 23 are multiplied y Ṽ, the product of all involved orthogonal transformations. Typically, the size of the AED window is recommended to e somewhat larger, e.g., y 50%, than the numer of shifts in the multishift QR sweeps [12, 14], see also Tale 1. In principle, aggressive early deflation can e incorporated into any variant of the QR algorithm. Therefore oth the new multishift QR algorithm and the pipelined QR algorithm enefit from performing AED. As the efficient implementation of AED requires some care, we will now discuss these two settings in more detail AED within the new multishift QR algorithm As this setting will e used for targeting large-scale prolems, the AED window can e expected to ecome quite large. It is therefore not reasonale to expect that executing AED locally and sequentially yields good performance. Hence, the corresponding routine PDLAQR3 for performing AED requires a parallel approach. The first and most costly step of AED is to calculate the Schur decomposition of H 33, the n AED n AED trailing principal sumatrix of H. This eigenvalue prolem can e solved y either recursively using the new multishift QR algorithm (PDLAQR0) or using the pipelined QR algorithm (PDLAQR1). The choice of the solver is determined y the size of the AED window as well as the numer of processors used. Since n AED is relatively small compared to n, the numer of availale processors may e too large to facilitate all of them without causing significant communication overhead. In this case, we only use a suset of the processors to reduce the overhead and minimizing the execution time. See Section 2.6 for a more detailed discussion. In the deflation checking phase, the reordering algorithm is arranged in a locked manner, to reduce memory transfers and communication. Unlike the procedure descried in [12], undeflatale eigenvalues are not moved immediately and individually towards the top left corner of the AED window. Instead, they are first reordered within an n n computational window. Only after all eigenvalues in this n n window are checked, the group of undeflatale eigenvalues is moved simultaneously to the top left corner of the AED window. This locked approach increases the computational intensity and avoids the frequent communication needed when reordering each eigenvalue individually. The procedure is repeated until all eigenvalues in the AED window are checked. The last step is to eliminate the spike and reduce the AED window ack to the upper Hessenerg form. This task is performed with the ScaLAPACK routine PDGEHRD. The corresponding off-diagonal locks are updated y explicitly multiplying the accumulated orthogonal matrix using the PBLAS routine PDGEMM AED within the pipelined QR algorithm Within the AED stage in the new parallel multishift QR algorithm, the pipelined QR algorithm is often used as the eigensolver since the AED window is relatively small compared to the whole matrix. However, the original I V. 6

8 ScaLAPACK v1.8.0 implementation PDLAHQR of the pipelined QR algorithm is not equipped with the AED strategy and is hence not efficient. When adding AED, we have taken into account that we will only use this routine for small- to medium-size (su)matrices. In particular, we can expect the AED window to e sufficiently small such that AED can e performed on one processor efficiently, y using the LAPACK implementation of AED. Apart from AED, our new routine PDLAQR1 incorporates further modifications to PDLAHQR, making it oth faster and more roust. The following list summarizes the most important aspects; additional details can e found in [30, 34]. Aggressive early deflation: AED is implemented in an auxiliary routine PDLAQR2 which copies the AED window to local memory and calls the LAPACK routine DLAQR3 to solve the prolem sequentially. To determine the parameters of AED, we use the settings of the LAPACK installation determined y ILAENV. However, a notale difference is that we do not use the undeflatale eigenvalues from the AED step as shifts in the susequent pipelined QR sweep. Instead, we recompute the eigenvalues of the trailing sumatrix to improve the quality of the shifts. We have oserved that this shifting strategy accelerates the convergence of the pipelined QR algorithm. Conventional deflation: In PDLAHQR, pipelined QR sweeps are performed until the very end, that is, the remaining diagonal locks are all of size 1 1 or 2 2. In PDLAQR1, we use a different strategy: Once the active lock is sufficiently small (say, not larger than ), we copy this lock to local memory and call the LAPACK routines DLAHQR/DLAQR4 to solve the prolem sequentially. This strategy significantly reduces communication overhead in the latter stages and is implemented in an auxiliary routine PDLAQR4. Avoidance of anomalies: The original ScaLAPACK routine PDLAHQR suffered from two anomalies, which have een removed. First, the routine sometimes returned 2 2 diagonal locks containing real eigenvalues, which is not in accordance with the specification of the interface. In PDLAQR1, each 2 2 diagonal lock contains a pair of complex conjugate eigenvalues. The second issue is concerned with a strategy already proposed y Francis [17] to potentially enefit from two consecutive small su-diagonal entries. In the rare case when it is successful, this strategy allows to introduce ulges elow the top left corner of the active sumatrix. However, this turns out to e difficult to implement in a safe manner in pipelined or multishift QR sweeps. Indeed, when using the ScaLAPACK routine PDLACONSB which implements this strategy, we have oserved large relative residuals of norm up to 10 5, indicating numerical instailities. As the performance improvements gained from this strategy are usually negligile, we have decided to remove it. In return, the numerical staility is improved. 2.4 Switching etween Multishift QR and AED In the LAPACK implementation of the QR algorithm, there are rules for alancing the cost etween multishift QR sweeps and AED, see Step 4 in Algorithm 1. The precise meaning of this step is characterized y a threshold called NIBBLE. If we let n undflt denote the numer of undeflatale shifts in an AED step, the multishift QR sweep is skipped if n AED n undflt n AED NIBBLE 100. Since AED ehaves differently for different matrices, this strategy automatically adjusts the choice etween AED and QR sweeps ased on the properties of the matrix. The default value of NIBBLE in the sequential LAPACK implementation is 14, which provides a good alance etween multishift QR sweeps and AED. The same default value is used in the pipelined QR algorithm. However, in the new parallel multishift QR algorithm, the parallel AED process ecomes sustantially more expensive than the sequential AED process due to communication. As explained aove, the AED process only involves a smaller trailing sumatrix, leading to decreased parallel efficiency. To account for this, NIBBLE should e set larger to avoid performing AED too frequently. A good choice of this threshold depends oth on the size of the matrix H and the numer of processors involved. We use the model NIBBLE = a n p c for this purpose, where a,, and c 7

9 are machine-dependent constants. An appropriate choice of these constants can e gained from repeated runs of the program with different thresholds. It turns out that the right choice of NIBBLE ecomes rather sensitive when communication is slow. (In our numerical experiments, see Section 5, the default values on our computing architectures are chosen as (a,, c) = (335, 0.44, 0.5).) A complication arises when using NIBBLE > 33. Such a choice may lead to situations where the numer of undeflatale eigenvalues is less than the desired numer of shifts, that is n undflt < n shift, due to the fact that n AED = 3n shift /2. The solution in the software is that as long as n undflt n shift /2, we only use these n undflt undeflatale eigenvalues as shifts in the next QR sweep. However, the condition n undflt n shift /2 may also fail when NIBBLE 67. In this case we calculate the eigenvalues of the n shift n shift trailing principal sumatrix of H and use them as shifts. The calculation can e performed y either PDLAQR0 or PDLAQR1, just like computing the Schur decomposition in the AED step. 2.5 Task duplication efficient ut hazardous Task duplication is a common technique in parallel computing to reduce communication and potentially improve performance, see, e.g., [19]. This technique has already een employed in previous implementations of the parallel QR algorithm, such as PDLAHQR and software developed in [21, 30]. However, a crucial assumption is made when applying this technique: all involved processors need to produce identical outputs for identical tasks. Due to the effect of roundoff error, this assumption is not always satisfied, especially on heterogeneous architectures. This lack of numerical reproduciility is potentially harmful to the roustness of the parallel QR algorithm. As discussed in Section 2.3.2, all computations within the AED window are performed sequentially for the pipelined QR algorithm. In [30], we have proposed to duplicate these sequential parts on all involved processors. The parallel update of the off-diagonal locks can then e performed with the local copy of the orthogonal transformation matrix resulting from AED, without any extra communication. However, even the slightest change in finite-precision arithmetic may lead to very different outputs produced y AED. In particular, the ordering of the eigenvalues in the Schur decomposition computed within AED is very sensitive to such changes. In turn, the offdiagonal locks are updated using completely different local copies of the orthogonal transformation matrices, leading to meaningless results. We have oserved similar prolems in crossorder ulge chasing and eigenvalue reordering. To avoid this, we use explicit communication rather than task duplication in the new implementation. For a moderate numer of processors (e.g., p 100), the change in performance is negligile; while for a large numer of processors, the performance can drop. For example, for computing the Schur decomposition of a 100, , 000 matrix using processors, up to 25% performance drop has een oserved y replacing task duplication with explicit communication. 2.6 Avoiding communication via data redistriution As oserved in [30], the parallel QR algorithm is not efficient at solving relatively small eigenvalue prolem on many processors, due to excessive communication, to the point that the execution time actually increases when increasing the numer of processors. Such a situation is regularly encountered when calculating the Schur decomposition of the AED window. Therefore, an efficient parallel solver for this relatively small eigenvalue prolem is of great interest. One possile approach to this prolem is to use alternative algorithms, e.g., the spectral divide-and-conquer algorithm [7, 8]. In the following we propose a solution within the framework of the QR algorithm a data redistriution strategy that reduces the communication overhead. Recent work on communication avoiding algorithms [9, 8, 25, 28] usually focuses on the design of algorithms that can attain the theoretical lower ounds of the communication cost. A asic assumption in these theoretical analyses is that the data are nearly evenly distriuted over the processors. Here we propose an alternative approach, which does not rely on this assumption and is especially useful for operations involving smaller sumatrices. We first consider a simple and extreme case. Suppose there is one processor which has a large amount of local memory and very high clock speed. Then y gathering all data to this processor, the prolem can e solved 8

10 without further communication. Once the computation is completed, the data are scattered to their original owners. The total amount of communication does not exceed the cost of scattering and gathering regardless of the complexity of computational work. Although this simple idea does not work for large prolems that cannot e stored on a single processor, it is still useful for smaller prolems. For example, the AED process in the pipelined QR algorithm is implemented in such a manner since we know in advance that the AED window is always sufficiently small, such that the associated Schur decomposition can e efficiently solved sequentially. By introducing the overhead of data redistriution, the total amount of communication as well as the execution time are reduced. For larger prolems, it is not feasile to solve them sequentially via data redistriution. Specifically, the AED window within the new parallel multishift QR algorithm usually ecomes quite large, although much smaller compared to the whole matrix. In this case, we choose a suset of processors instead of a single processor to perform AED. The data redistriution is performed using the routine PDGEMR2D in ScaLAPACK; its overhead has een oserved to e negligile relative to the AED process as a whole. The tunale parameter p min = PILAENVX(ISPEC = 23) determines our heuristic strategy for choosing the numer of processors for the redistriution. If min(p r, p c ) > p min + 1, we redistriute the AED window to a p min p min processor grid and perform the calculations on this suset of processors. The same strategy is also applied if we need to compute shifts after an AED step. The default value for this parameter is p min = n AED /(n 384/n ), implying that each processor needs to own at least 384 columns of the AED window. The constant 384 has een otained via extensive numerical experiments on one of our target architectures. It certainly needs adjustment for optimal performance on other architectures. 2.7 Summary Finally, we present pseudocode to summarize the discussion on our new parallel multishift QR algorithm. Algorithm 2 represents the parallel AED procedure used within the new multishift QR algorithm. Algorithm 3 provides pseudocode for the parallel multishift QR sweeps. For simplicity, the start-up and ending stages, i.e., ulge introduction and ulge annihilation, are not considered in the pseudocode. These algorithms are to e used within Algorithm 1 on distriuted memory architectures. 3 Performance model In this section, we riefly discuss the cost of computation and communication of the new parallel multishift QR algorithm for reducing a Hessenerg matrix to Schur form. The cost of accumulating orthogonal transformations are taken into account. The derivation of the results presented here will e given in the electronic appendix. For simplicity, we consider a square processor grid, that is, p r = p c = p. In addition, we assume that each processor contains reasonaly many data locks of the matrices, i.e., p n n, so that the work load is alanced. The parallel execution time consists of two main components: T p = T a + T c, where T a and T c are the times for arithmetic operations and communication, respectively. The possiility of overlapping communication with computations is not taken into account. By neglecting the communication etween memory and cache lines inside one core, the serial runtime can e approximated y T a = #(flops) γ, f (p) where γ is the average time for performing one floating point operation and f (p) is the degree of concurrency. For the communication etween two processors, we define α and β as the start-up time (or communication latency) and the time for transferring one word without latency (or reciprocal of andwidth), respectively. The time for a single point-to-point communication is modelled as α + L β where L is the message size in words. A one-to-all roadcast or an all-to-one reduction within a scope of p processors is assumed to take Θ(log p) steps. 9

11 Algorithm 2 Parallel aggressive early deflation within the new multishift QR algorithm Input: H R n n is upper Hessenerg, with the AED window contained in H(i : j, i : j); Q R n n is orthogonal. Output: Updated Hessenerg matrix H U T HU otained after performing AED, Q QU, with U orthogonal. 1: Estimate the optimal process grid size, p AED = p 2 min, ased on n AED = j i + 1 and n. 2: if min(p r, p c ) p min + 1 then 3: ˆp r p min, ˆp c p min. 4: else 5: ˆp r p r, ˆp c p c. 6: end if 7: Redistriute H 0 H(i : j, i : j) to a ˆp r ˆp c process sugrid if necessary. 8: Compute the Schur decomposition H 0 = V 0 T 0 V T 0 on the ˆp r ˆp c process sugrid. 9: Redistriute T T 0, V V 0 ack to the original p r p c process grid. 10: s H(i, i 1)V(1, :) T. 11: repeat 12: Check deflation for the ottommost n unchecked eigenvalues; undeflatale eigenvalues are moved to the top-left corner of this n n lock. 13: Move all undeflatale eigenvalues within this group of n eigenvalues to the top-left corner of the AED window. 14: Update s V T 1 s, T VT 1 TV 1 in parallel (on the original p r p c process grid), where V 1 is the accumulated orthogonal matrix from Steps 12 and : until all eigenvalues of T are tested 16: Eliminate the spike: V T 2 s = ηe 1; Update T V T 2 TV 2 in parallel. 17: Reduce T to an upper Hessenerg matrix H 1 V T 3 TV 3 in parallel. 18: Set H(i : i 1) η, H(i : j, i : j) T; Update V VV 3 in parallel. 19: Update H(i : j, j + 1 : n) V T H(i : j, j + 1 : n) in parallel. 20: Update H(1 : i 1, i : j) H(1 : i 1, i : j)v in parallel. 21: Update Q(1 : n, i : j) Q(1 : n, i : j)v in parallel. Let k AED and k sweep denote the numer of AED steps and QR sweeps, respectively, performed y the new parallel multishift QR algorithm. We have k sweep k AED, since some QR sweeps are skipped when the percentage of deflated eigenvalues in the AED step is larger than the threshold (NIBBLE). When the numer of undeflatale eigenvalues from AED is not sufficient for performing the next QR sweep, we need to calculate shifts from the trailing sumatrix. The numer of extra calls to the parallel Schur decomposition solver in this case is denoted y k shift, which of course satisfies k shift k sweep. Given the constants k AED, k sweep, and k shift, the execution time of the new parallel multishift QR algorithm is modelled as the sum of the corresponding phases: T new (n, p) = k AED T AED (n, n AED, p) + k sweep T sweep (n, n shift, p) + k shift T shift (n, n shift, p), where T AED, T sweep, and T shift are the runtimes for performing each phase once. For simplicity, it is assumed that n AED and n shift remain constant throughout the entire QR algorithm, and all QR sweeps act on the entire matrix. We further assume that AED is always performed on a p AED p AED processor grid, so that the property paed n n AED is also valid inside the AED window. The same assumption is made for the shift calculation phase. Typically, we have n shift 2 3 n AED 1 C 1 n and n AED paed n shift pshift C 2, where C 1 and C 2 are constants (e.g., C 1 = 24, C 2 = 384). In practice k AED, k sweep, k shift can vary a lot for different matrices. We assume k AED = Θ(n/n AED ) = Θ(C 1 ), which appears to e reasonale. In the following we present performance models ased on the assumptions aove. Tiny terms, especially lower order terms with reasonaly sized constants, are omitted. The derivation of these models can e found in 10

12 Algorithm 3 Parallel multishift QR sweep (ulge chasing process) Input: H R n n is upper Hessenerg except for several tightly coupled ulge chains in the top left corner; Q R n n is orthogonal. Output: Updated matrix H U T HU has the tightly coupled ulge chains moved to the ottom right corner, Q QU, where U is orthogonal and the new H is upper Hessenerg. 1: for each window w = (i : i + n 1) in parallel do 2: if (myrow, mycol) owns parts of (:, w) or (w, :) then 3: if (myrow, mycol) owns (w, w) then 4: Chase the ulge chain inside w down n /2 rows. 5: Broadcast the local orthogonal matrix V in process row myrow. 6: Broadcast the local orthogonal matrix V in process column mycol. 7: else 8: Receive V. 9: end if 10: Update H(w, i + n : n) V T H(w, i + n : n) in parallel. 11: Update H(1 : i 1, w) H(1 : i 1, w)v in parallel. 12: Update Q(1 : n, w) Q(1 : n, w)v in parallel. 13: end if 14: end for 15: for each odd-numered window w = ( j : j + n 1) in parallel do 16: if (myrow, mycol) owns parts of (:, w) or (w, :) then 17: if (myrow, mycol) owns parts of (w, w) then 18: Form a process sugrid G w = {(0, 0) w, (0, 1) w, (1, 0) w, (1, 1) w }. 19: Exchange data in G w to uild H(w, w) at (0, 0) w. 20: if (myrow, mycol) = (0, 0) w then 21: Chase the ulge chain inside w down n /2 rows. 22: Send H(w, w) and the local orthogonal matrix V to other processes in G w. 23: else 24: Receive H(w, w) and V from (0, 0) w. 25: end if 26: if (myrow, mycol) = (0, 0) w or (myrow, mycol) = (1, 1) w then 27: Broadcast V in process row myrow. 28: Broadcast V in process column mycol. 29: end if 30: else 31: Receive V. 32: end if 33: Exchange local parts of H(w, j + n : n), H(1 : j 1, w), and Q(1 : n, w) with neighoring processes in parallel. 34: Update H(w, j + n : n) V T H(w, j + n : n) in parallel. 35: Update H(1 : j 1, w) H(1 : j 1, w)v in parallel. 36: Update Q(1 : n, w) Q(1 : n, w)v in parallel. 37: end if 38: end for 39: for each even-numered window w = ( j : j + n 1) in parallel do 40: % Analogous procedure as descried for the odd case aove. 41: end for the electronic appendix. By [24], the execution time of the pipelined QR algorithm is ( n 3 ) ( n 2 log p ) ( n 3 ) T pipe (n, p) = Θ γ + Θ α + Θ β, (2) p p n pn 11

13 provided that the average numer of shifts required for deflating each eigenvalue is Θ(1). One QR sweep in our new parallel multishift QR algorithm roughly requires T sweep (n, n shift, p) 36n2 n shift γ + 9n n shift (log p p n 2 2 p + 4)α + 9n2 n shift β pn execution time. Therefore, under the same assumption of convergence rate (i.e., k sweep n shift = Θ(n)), it can e shown that the execution time of the new parallel multishift QR algorithm without AED is ( n 3 ) ( n 2 log p ) ( n 3 ) T new (n, p) = k sweep T sweep (n, n shift, p) = Θ γ + Θ α + Θ β. (3) p p n 2 pn It is interesting to make a comparison etween (2) and (3). Both solvers have an ideal degree of concurrency. However, the tightly coupled shift strategy is superior to loosely coupled shifts, ecause it yields less frequent communication. The numer of messages is reduced y a factor of Θ(n ); in return the average message length increases correspondingly. Another important oservation is that the serial term in T pipe assumes level 3 performance, which is actually not the case. This already explains why the pipelined QR algorithm is usually much slower than the new parallel multishift QR algorithm for larger matrices, even when neglecting the effects of AED. Taking AED into account makes the model significantly more complicated. The execution times for AED and computing shifts are estimated y T AED (n, n AED, p) 30C2 2 n + 9n2 n C 1 C (C 1 + 6)n 3 p 2C 3 1 p γ ( 9C2 n 3n + log C 1 n 2 + 3n ) log 2C 1 C 2 2C 2 p α 1 + 9C 2n 3n log C C2 2 n ( ) + 3n C log 2 p 1 2C 1 C 2 C 1 n 16C1 2 β p and T shift (n, n shift, p) 10C2 2 n γ + 6C 2n n ( 6C2 n n log C 1 C 1 n 2 α + log C 1 C 2 C 2 + 8C2 2 n ) β, 1 C 1 C 2 C 1 n respectively. To e ale to provide some intuition, we assign concrete values to most parameters. For example, let us set C 1 = 24, C 2 = 384, and assume k AED = 2k sweep = 16k shift = 64. Then T sweep (n, n shift, p) 3n3 2p γ + 3n2 8 p n 2 ( n + n2 n (log 2 p + 4)α + 3n3 8pn β, T AED (n, n AED, p) 64 p + 5n3 256p [ 144n + (log n 2 n 14) + n log 2 p ] α 8 [ ( + 144n log 2 n ) ( 51 log2 p + 42 ) n 2 ] + n 3072 β, p T shift (n, n shift, p) 61440nγ + 96n ( (log n 2 n 13)α + 96n log 2 n ) n This yields the following overall estimate for the new parallel QR algorithm with AED: ( 48n 3 ) T new (n, p) n γ + 12n2 log 2 p ( 12n 3 α n2 log 2 p p p n 2 pn 16 p ( n 3 = Θ p ) γ β. (4) ) β, (5) ) ( n 2 log p ) ( n 3 ) γ + Θ α + Θ β, (6) p n 2 pn 12

14 Figure 5: Calling sequences of the newly developed routine PDHSEQR, the corresponding LAPACK routine DHSEQR, and the ScaLAPACK routine PDLAHQR. SUBROUTINE PDHSEQR( JOB, COMPZ, N, ILO, IHI, H, DESCH, WR, WI, Z, $ DESCZ, WORK, LWORK, IWORK, LIWORK, INFO ) * *.. Scalar Arguments.. INTEGER IHI, ILO, INFO, LWORK, LIWORK, N CHARACTER COMPZ, JOB *.. *.. Array Arguments.. INTEGER DESCH( * ), DESCZ( * ), IWORK( * ) DOUBLE PRECISION H( * ), WI( N ), WORK( * ), WR( N ), Z( * ) SUBROUTINE DHSEQR( JOB, COMPZ, N, ILO, IHI, H, LDH, WR, WI, Z, $ LDZ, WORK, LWORK, INFO ) SUBROUTINE PDLAHQR( WANTT, WANTZ, N, ILO, IHI, A, DESCA, WR, WI, $ ILOZ, IHIZ, Z, DESCZ, WORK, LWORK, IWORK, $ ILWORK, INFO ) where most small-order terms are neglected. It turns out that oth QR sweeps and AED have significant serial runtime when n is not very large. However, QR sweeps usually dominate the communication cost. As a consequence, the models (2), (3), and (6) nearly have the same asymptotic ehavior. AED is asymptotically not more expensive compared to QR sweeps, and hence it does not represent a computational ottleneck for larger matrices. Comined with the convergence acceleration often oserved when using AED (and not fully attriuted in the model aove), this contriutes to the superior performance of the new parallel multishift QR algorithm. 4 Other Implementation Issues 4.1 Calling sequence The calling sequence of the newly developed routine PDHSEQR is nearly identical with the LAPACK routine DHSEQR, see Figure 5. Apart from the need of a descriptor for each gloally distriuted matrix and the leading dimension for each local matrix, the only difference is that PDHSEQR requires an extra integer workspace. The calling sequence of the ScaLAPACK routine PDLAHQR is also similar, hopefully allowing to easily switch from PDLAHQR and DHSEQR in existing software making use of ScaLAPACK. In practice, it is advisale to call PDHSEQR twice one call for the workspace query (y setting LWORK=-1) and another call for actually doing the computation. This follows the convention of many LAPACK/ScaLAPACK routines that make use of workspace. 4.2 Tuning parameters In the new software for the parallel multishift QR algorithm, tunale parameters are defined in the routine PILAENVX. They are availale via the function call PILAENVX(ICTXT, ISPEC,...) with 12 ISPEC 23. A complete list of these parameters is provided in Tale 2. Some of them require fine tuning to attain nearly optimal performance across different architectures. Although a reasonale choice of n, the data layout lock size, is important, we have oserved the performance to e not overly sensitive to this choice. On the one hand, n should e large enough so that the local computations can achieve level 3 performance. On the other hand, it is advisale to avoid n eing too large. A large value of n harms load alance and increases the overhead in the start-up and ending stages of the ulge chasing process, especially when computing the Schur decomposition of the AED window. In our performance 13

15 Tale 2: List of tunale parameters ISPEC Name Description Recommended value 12 n min Crossover point etween PDLAQR0 and PDLAQR1 220 min(p r, p c ) 13 n AED Size of the AED window See Tale 1 14 NIBBLE Threshold for skipping a multishift QR sweep See Section n shift Numer of simultaneous shifts See Tale 1 16 KACC22 Specification of how to update off-diagonal locks in the multishift Use GEMM/TRMM QR sweep 17 NUMWIN Maximum numer of concurrent computational windows (for min(p r, p c, n/n ) oth QR sweep and eigenvalue reordering) 18 WINEIG Numer of eigenvalues in each window (for eigenvalue reordering) min(n /2, 40) 19 WINSIZE Computational window size (for oth ulge-chasing and eigenvalue min(n, 80) reordering) 20 MMULT Minimal percentage of flops for performing GEMM instead 50 of pipelined Householder reflections when updating the offdiagonal locks in the eigenvalue reordering routine 21 NCB Width of lock column slas for rowwise update of Householder min(n, 32) reflections in factorized form 22 WNEICR Maximum numer of eigenvalues to move over a lock order Identical to WINEIG in the eigenvalue reordering routine 23 p min Size of processor grid involving AED See Section 2.6 model, we always assume n n / p to avoid such kind of overhead. For many architectures, n [32, 128] will offer a good choice. We expect that most of the recommended values in Tale 2 yield reasonale performance on existing architectures. However, the parameters n min, p min and NIBBLE require some extra care, as the performance of AED crucially relies on them. The values of these parameters need to e determined y performing a unch of test runs: To determine n min and p min, it is advisale to use the typical sizes of the AED windows (see Tale 1) and run tests on different processor grids. Then the optimal values for oth n min and p min can e chosen via examining the numer of columns of H owned y each processor. NIBBLE should e tuned lastly, once all other parameters are fixed. Auto-tuning tools have een provided in the software for tuning these three parameters. We refer to the User s Guide [22] for details of the tuning procedure. Tuning NIBBLE is time-consuming ut highly recommended, especially on older architectures with relatively slow communication. As discussed in Section 2.4, NIBBLE = a n p c is a reasonaly good model that takes into account oth n and p. We have oserved reasonale performance ehavior of this model for large prolems up to size n = 100, 000 and p = Computational Experiments We have performed a large set of computational experiments on Akka [27] and Aisko [26] hosted y the High Performance Computing Center North (HPC2N), and on Bellatrix [15] hosted y École Polytechnique Fédérale de Lausanne. In this section, we present a suset from these computational experiments to confirm and demonstrate the improvements we have made in the parallel QR algorithm. The computational environments are summarized in Tale 3. We compare the following implementations: 14

16 Akka Aisko Bellatrix Tale 3: Computational environments. 64-it Intel Xeon (Harpertown) Linux cluster 672 dual socket nodes with L5420 quad-core 2.5GHz processors and 16GB RAM per node Cisco Infiniand and Gigait Ethernet, 10Gps andwidth PathScale compiler version OpenMPI 1.4.4, LAPACK 3.4.0, GotoBLAS it AMD Opteron (Interlagos) Linux cluster 322 nodes with four Opteron core 2.6GHz processors and 128GB RAM per node Mellanox 4X QSFP Infiniand connectivity, 40Gps andwidth PathScale compiler version OpenMPI 1.6.4, LAPACK 3.4.0, OpenBLAS 0.1 alpha it Intel Xeon (Sandy Bridge) Linux cluster 424 dual socket nodes with E octa-core 2.2GHz processors and 32GB RAM per node Qlogic Infiniand QDR 2:1 connectivity, 40Gps andwidth Intel compiler version Intel MPI version 4.1.0, Intel Math Kernel Lirary version 11.0 S-v180 SISC NEW Pipelined QR algorithm in ScaLAPACK v Previous implementation of parallel multishift QR algorithm with AED, as descried in [21]. New implementation of parallel multishift QR algorithm with AED, as descried in this paper. The implementation NEW improves upon the implementation SISC in terms of roustness and performance, in particular ecause of the proposed modifications to the AED step. The data layout lock size n = 50 is used for all experiments. No multithreaded features (such as OpenMP or threaded BLAS) are used. Therefore the numer of processors (p = p r p c ) means the numer of cores involved in the computation. 5.1 Random matrices First, we consider two types of random matrices fullrand and hessrand [21]. Matrices of the type fullrand are dense square matrices with all entries randomly generated from a uniform distriution in [0, 1]. We call the ScaLAPACK routine PDGEHRD to reduce them to upper Hessenerg form efore applying the QR algorithm. Only the time for the QR algorithm (i.e., reducing the upper Hessenerg matrix to Schur form) is measured. These matrices usually have well-conditioned eigenvalues and exhiit regular convergence ehavior. Matrices of the type hessrand are upper Hessenerg matrices whose nonzero entries are randomly generated from a uniform distriution in [0, 1]. The eigenvalues of these matrices are extremely ill-conditioned for larger n, affecting the convergence ehavior of the QR sweeps [21]. On the other hand, AED often deflates a high fraction of eigenvalues in the AED window for such matrices. These properties sometimes cause erratic convergence rates. Tales 4 and 5 show the parallel execution times of the three solvers on Akka. Both the real Schur form T and the orthogonal transformation matrix Z are calculated. We limit the total execution time (including the Hessenerg reduction) y 10 hours for each individual prolem, to avoid excessive use of the computational resources. An entry corresponds to an execution time larger than 10 hours. These tales reveal that our new version of PDHSEQR (i.e., NEW) always improves the performance compared to the SISC version. On average, the improvement is 31% for matrices of type fullrand and 14 times for matrices of type hessrand. Not surprisingly, the improvements compared to PDLAHQR in ScaLAPACK v1.8.0 are even more significant. The convergence rates for fullrand are sufficiently regular, so that we can analyze the scalaility of the 15

17 Tale 4: Execution time (in seconds) on Akka for fullrand matrices. p = n = 4000 n = 8000 n = n = p r p c S-v180 SISC NEW S-v180 SISC NEW S-v180 SISC NEW S-v180 SISC NEW Tale 5: Execution time (in seconds) on Akka for hessrand matrices. p = n = 4000 n = 8000 n = n = p r p c S-v180 SISC NEW S-v180 SISC NEW S-v180 SISC NEW S-v180 SISC NEW parallel multishift QR algorithm. If we fix the memory load per core to n / p = 4000, the execution times in Tale 4 satisfy 2T(n, p) T(2n, 4p) < 4T(n, p), indicating that the parallel multishift QR algorithm scales reasonaly well ut not perfectly. To verify the performance models we have derived in Section 3, we use (4) together with the measured values of k AED, k sweep, and k shift to predict the execution time. Since k shift = 0 is oserved for all these examples, there are only two components in the total execution time (i.e., T = k sweep T sweep + k AED T AED ). Figure 6(a) illustrates that the predicted execution times underestimate the measured ones (especially, for AED), mainly due to too optimistic choices of the parameters (α, β, γ). If we assume that the program executes at 40% and 7.5% of the peak core performance for level 3 and level 1 2 BLAS operations, respectively, the calirated model fits the actual execution time quite well for large matrices (see Figure 6() ). Since the model(s) are asymptotic, the results are very satisfactory. For hessrand, it is oserved that most eigenvalues are deflated with very few (or even no) QR sweeps. Considering that the main difference of PDHSEQR etween versions NEW and SISC is in the AED process, it is not surprising to see the great convergence acceleration for hessrand, where AED dominates the calculation. In Tale 5, sometimes the execution time for the new parallel multishift QR algorithm does not change too much when increasing the numer of processors. This is mainly ecause the Schur decomposition of the AED window, which is the most expensive part of the algorithm, is performed y a constant numer of processors (p min p min p) after data redistriution. In Tales 6 9 we list the execution times received from Aisko and Bellatrix. The oservations are similar to those otained from Akka. Therefore, in the rest of this section we only present experiments on Akka for economical consideration , , 000 matrices The modifications proposed and presented result into dramatic improvements for settings with very large matrices and many processors. To demonstrate this, we present the otained execution times for 100, , 000 matrices in Tale 10. Although the QR algorithm does not scale as well as Hessenerg reduction when fixing the prolem size and increasing the numer of processors, the execution times of these two reduction steps are still on the same order of magnitude. With the help of the dynamic NIBBLE strategy, the fraction of the execution time spent on AED for fullrand matrices is under control. In contrast to our earlier implementation, AED is not a 16

18 Original model Calirated model Execution time Total QR Sweep AED Measured Predicted Execution time Total QR Sweep AED Measured Predicted 10 4 Matrix size (N) (a) 10 4 Matrix size (N) () Figure 6: Comparison etween the measured execution times and the predicted times using (4) (fullrand, n / p = 4000). The original model (left) uses theoretical values of (α, β, γ) according to the hardware information, while the calirated one (right) adjusts γ according to different computational intensities (level 1, 2, and 3). Tale 6: Execution time (in seconds) on Aisko for fullrand matrices. p = n = 4000 n = 8000 n = n = p r p c S-v180 SISC NEW S-v180 SISC NEW S-v180 SISC NEW S-v180 SISC NEW Tale 7: Execution time (in seconds) on Aisko for hessrand matrices. p = n = 4000 n = 8000 n = n = p r p c S-v180 SISC NEW S-v180 SISC NEW S-v180 SISC NEW S-v180 SISC NEW Tale 8: Execution time (in seconds) on Bellatrix for fullrand matrices. p = n = 4000 n = 8000 n = n = p r p c S-v180 SISC NEW S-v180 SISC NEW S-v180 SISC NEW S-v180 SISC NEW

Parallel Library Software for the Multishift QR Algorithm with Aggressive Early Deflation

Parallel Library Software for the Multishift QR Algorithm with Aggressive Early Deflation Parallel Library Software for the Multishift QR Algorithm with Aggressive Early Deflation ROBERT GRANAT, Umeå University BO KÅGSTRÖM, Umeå University DANIEL KRESSNER, EPF Lausanne MEIYUE SHAO, Umeå University

More information

On aggressive early deflation in parallel variants of the QR algorithm

On aggressive early deflation in parallel variants of the QR algorithm On aggressive early deflation in parallel variants of the QR algorithm Bo Kågström 1, Daniel Kressner 2, and Meiyue Shao 1 1 Department of Computing Science and HPC2N Umeå University, S-901 87 Umeå, Sweden

More information

A NOVEL PARALLEL QR ALGORITHM FOR HYBRID DISTRIBUTED MEMORY HPC SYSTEMS

A NOVEL PARALLEL QR ALGORITHM FOR HYBRID DISTRIBUTED MEMORY HPC SYSTEMS A NOVEL PARALLEL QR ALGORITHM FOR HYBRID DISTRIBUTED MEMORY HPC SYSTEMS ROBERT GRANAT, BO KÅGSTRÖM, AND DANIEL KRESSNER Abstract A novel variant of the parallel QR algorithm for solving dense nonsymmetric

More information

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Parallel Variants and Library Software for the QR Algorithm and the Computation of the Matrix Exponential of Essentially Nonnegative Matrices

Parallel Variants and Library Software for the QR Algorithm and the Computation of the Matrix Exponential of Essentially Nonnegative Matrices Parallel Variants and Library Software for the QR Algorithm and the Computation of the Matrix Exponential of Essentially Nonnegative Matrices Meiyue Shao Ph Licentiate Thesis, April 2012 Department of

More information

An Efficient Parallel Algorithm to Solve Block Toeplitz Systems

An Efficient Parallel Algorithm to Solve Block Toeplitz Systems The Journal of Supercomputing, 32, 251 278, 2005 C 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. An Efficient Parallel Algorithm to Solve Block Toeplitz Systems P. ALONSO

More information

Parallel eigenvalue reordering in real Schur forms

Parallel eigenvalue reordering in real Schur forms Parallel eigenvalue reordering in real Schur forms LAPACK Working Note #192 R. Granat 1, B. Kågström 1, D. Kressner 1,2 1 Department of Computing Science and HPC2N, Umeå University, SE-901 87 Umeå, Sweden.

More information

Parallel eigenvalue reordering in real Schur forms

Parallel eigenvalue reordering in real Schur forms CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1 7 [Version: 2002/09/19 v2.02] Parallel eigenvalue reordering in real Schur forms R. Granat 1, B. Kågström

More information

MULTISHIFT VARIANTS OF THE QZ ALGORITHM WITH AGGRESSIVE EARLY DEFLATION LAPACK WORKING NOTE 173

MULTISHIFT VARIANTS OF THE QZ ALGORITHM WITH AGGRESSIVE EARLY DEFLATION LAPACK WORKING NOTE 173 MULTISHIFT VARIANTS OF THE QZ ALGORITHM WITH AGGRESSIVE EARLY DEFLATION LAPACK WORKING NOTE 173 BO KÅGSTRÖM AND DANIEL KRESSNER Abstract. New variants of the QZ algorithm for solving the generalized eigenvalue

More information

Matrix Eigensystem Tutorial For Parallel Computation

Matrix Eigensystem Tutorial For Parallel Computation Matrix Eigensystem Tutorial For Parallel Computation High Performance Computing Center (HPC) http://www.hpc.unm.edu 5/21/2003 1 Topic Outline Slide Main purpose of this tutorial 5 The assumptions made

More information

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Introduction Introduction We wanted to parallelize a serial algorithm for the pivoted Cholesky factorization

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

1. Introduction. Applying the QR algorithm to a real square matrix A yields a decomposition of the form

1. Introduction. Applying the QR algorithm to a real square matrix A yields a decomposition of the form BLOCK ALGORITHMS FOR REORDERING STANDARD AND GENERALIZED SCHUR FORMS LAPACK WORKING NOTE 171 DANIEL KRESSNER Abstract. Block algorithms for reordering a selected set of eigenvalues in a standard or generalized

More information

Fast inverse for big numbers: Picarte s iteration

Fast inverse for big numbers: Picarte s iteration Fast inverse for ig numers: Picarte s iteration Claudio Gutierrez and Mauricio Monsalve Computer Science Department, Universidad de Chile cgutierr,mnmonsal@dcc.uchile.cl Astract. This paper presents an

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization

More information

Communication-avoiding LU and QR factorizations for multicore architectures

Communication-avoiding LU and QR factorizations for multicore architectures Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding

More information

NAG Library Routine Document F08YUF (ZTGSEN).1

NAG Library Routine Document F08YUF (ZTGSEN).1 F08 Least-squares and Eigenvalue Problems (LAPACK) NAG Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised

More information

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel algorithms for dense matrix factorizations Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013

More information

1 Caveats of Parallel Algorithms

1 Caveats of Parallel Algorithms CME 323: Distriuted Algorithms and Optimization, Spring 2015 http://stanford.edu/ reza/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 1, 9/26/2015. Scried y Suhas Suresha, Pin Pin, Andreas

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Towards Highly Parallel and Compute-Bound Computation of Eigenvectors of Matrices in Schur Form

Towards Highly Parallel and Compute-Bound Computation of Eigenvectors of Matrices in Schur Form Towards Highly Parallel and Compute-Bound Computation of Eigenvectors of Matrices in Schur Form Björn Adlerborn, Carl Christian Kjelgaard Mikkelsen, Lars Karlsson, Bo Kågström May 30, 2017 Abstract In

More information

Orthogonal iteration to QR

Orthogonal iteration to QR Notes for 2016-03-09 Orthogonal iteration to QR The QR iteration is the workhorse for solving the nonsymmetric eigenvalue problem. Unfortunately, while the iteration itself is simple to write, the derivation

More information

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Preconditioned Parallel Block Jacobi SVD Algorithm

Preconditioned Parallel Block Jacobi SVD Algorithm Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic

More information

Sparse BLAS-3 Reduction

Sparse BLAS-3 Reduction Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc

More information

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems LESLIE FOSTER and RAJESH KOMMU San Jose State University Existing routines, such as xgelsy or xgelsd in LAPACK, for

More information

11.5 Reduction of a General Matrix to Hessenberg Form

11.5 Reduction of a General Matrix to Hessenberg Form 476 Chapter 11. Eigensystems 11.5 Reduction of a General Matrix to Hessenberg Form The algorithms for symmetric matrices, given in the preceding sections, are highly satisfactory in practice. By contrast,

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Arnoldi Methods in SLEPc

Arnoldi Methods in SLEPc Scalable Library for Eigenvalue Problem Computations SLEPc Technical Report STR-4 Available at http://slepc.upv.es Arnoldi Methods in SLEPc V. Hernández J. E. Román A. Tomás V. Vidal Last update: October,

More information

NAG Library Routine Document F08PNF (ZGEES)

NAG Library Routine Document F08PNF (ZGEES) F08 Least-squares and Eigenvalue Problems (LAPACK) F08PNF NAG Library Routine Document F08PNF (ZGEES) Note: before using this routine, please read the Users Note for your implementation to check the interpretation

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

Contents. Preface... xi. Introduction...

Contents. Preface... xi. Introduction... Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism

More information

#A50 INTEGERS 14 (2014) ON RATS SEQUENCES IN GENERAL BASES

#A50 INTEGERS 14 (2014) ON RATS SEQUENCES IN GENERAL BASES #A50 INTEGERS 14 (014) ON RATS SEQUENCES IN GENERAL BASES Johann Thiel Dept. of Mathematics, New York City College of Technology, Brooklyn, New York jthiel@citytech.cuny.edu Received: 6/11/13, Revised:

More information

Exponentials of Symmetric Matrices through Tridiagonal Reductions

Exponentials of Symmetric Matrices through Tridiagonal Reductions Exponentials of Symmetric Matrices through Tridiagonal Reductions Ya Yan Lu Department of Mathematics City University of Hong Kong Kowloon, Hong Kong Abstract A simple and efficient numerical algorithm

More information

The Future of LAPACK and ScaLAPACK

The Future of LAPACK and ScaLAPACK The Future of LAPACK and ScaLAPACK Jason Riedy, Yozo Hida, James Demmel EECS Department University of California, Berkeley November 18, 2005 Outline Survey responses: What users want Improving LAPACK and

More information

Matrix Assembly in FEA

Matrix Assembly in FEA Matrix Assembly in FEA 1 In Chapter 2, we spoke about how the global matrix equations are assembled in the finite element method. We now want to revisit that discussion and add some details. For example,

More information

Solving large scale eigenvalue problems

Solving large scale eigenvalue problems arge scale eigenvalue problems, Lecture 4, March 14, 2018 1/41 Lecture 4, March 14, 2018: The QR algorithm http://people.inf.ethz.ch/arbenz/ewp/ Peter Arbenz Computer Science Department, ETH Zürich E-mail:

More information

Parallelization of the Molecular Orbital Program MOS-F

Parallelization of the Molecular Orbital Program MOS-F Parallelization of the Molecular Orbital Program MOS-F Akira Asato, Satoshi Onodera, Yoshie Inada, Elena Akhmatskaya, Ross Nobes, Azuma Matsuura, Atsuya Takahashi November 2003 Fujitsu Laboratories of

More information

Two-Dimensional Linear Discriminant Analysis

Two-Dimensional Linear Discriminant Analysis Two-Dimensional Linear Discriminant Analysis Jieping Ye Department of CSE University of Minnesota jieping@cs.umn.edu Ravi Janardan Department of CSE University of Minnesota janardan@cs.umn.edu Qi Li Department

More information

Parallel Singular Value Decomposition. Jiaxing Tan

Parallel Singular Value Decomposition. Jiaxing Tan Parallel Singular Value Decomposition Jiaxing Tan Outline What is SVD? How to calculate SVD? How to parallelize SVD? Future Work What is SVD? Matrix Decomposition Eigen Decomposition A (non-zero) vector

More information

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for 1 Algorithms Notes for 2016-10-31 There are several flavors of symmetric eigenvalue solvers for which there is no equivalent (stable) nonsymmetric solver. We discuss four algorithmic ideas: the workhorse

More information

LAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006.

LAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006. LAPACK-Style Codes for Pivoted Cholesky and QR Updating Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig 2007 MIMS EPrint: 2006.385 Manchester Institute for Mathematical Sciences School of Mathematics

More information

Efficient implementation of the overlap operator on multi-gpus

Efficient implementation of the overlap operator on multi-gpus Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator

More information

A hybrid Hermitian general eigenvalue solver

A hybrid Hermitian general eigenvalue solver Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

CS 542G: Conditioning, BLAS, LU Factorization

CS 542G: Conditioning, BLAS, LU Factorization CS 542G: Conditioning, BLAS, LU Factorization Robert Bridson September 22, 2008 1 Why some RBF Kernel Functions Fail We derived some sensible RBF kernel functions, like φ(r) = r 2 log r, from basic principles

More information

Travel Grouping of Evaporating Polydisperse Droplets in Oscillating Flow- Theoretical Analysis

Travel Grouping of Evaporating Polydisperse Droplets in Oscillating Flow- Theoretical Analysis Travel Grouping of Evaporating Polydisperse Droplets in Oscillating Flow- Theoretical Analysis DAVID KATOSHEVSKI Department of Biotechnology and Environmental Engineering Ben-Gurion niversity of the Negev

More information

Divide-and-Conquer. Reading: CLRS Sections 2.3, 4.1, 4.2, 4.3, 28.2, CSE 6331 Algorithms Steve Lai

Divide-and-Conquer. Reading: CLRS Sections 2.3, 4.1, 4.2, 4.3, 28.2, CSE 6331 Algorithms Steve Lai Divide-and-Conquer Reading: CLRS Sections 2.3, 4.1, 4.2, 4.3, 28.2, 33.4. CSE 6331 Algorithms Steve Lai Divide and Conquer Given an instance x of a prolem, the method works as follows: divide-and-conquer

More information

Some notes on efficient computing and setting up high performance computing environments

Some notes on efficient computing and setting up high performance computing environments Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient

More information

Finite-choice algorithm optimization in Conjugate Gradients

Finite-choice algorithm optimization in Conjugate Gradients Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the

More information

Double-shift QR steps

Double-shift QR steps Double-shift QR steps Week 11: Monday, Oct 29 Last time, we discussed the Wilkinson strategy of choosing as a shift one of the roots of the trailing 2-by-2 submatrix of A (k) (the one closest to the final

More information

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria 1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation

More information

FPGA Implementation of a Predictive Controller

FPGA Implementation of a Predictive Controller FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan

More information

Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI *

Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI * Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI * J.M. Badía and A.M. Vidal Dpto. Informática., Univ Jaume I. 07, Castellón, Spain. badia@inf.uji.es Dpto. Sistemas Informáticos y Computación.

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

Eigenvalue problems. Eigenvalue problems

Eigenvalue problems. Eigenvalue problems Determination of eigenvalues and eigenvectors Ax x, where A is an N N matrix, eigenvector x 0, and eigenvalues are in general complex numbers In physics: - Energy eigenvalues in a quantum mechanical system

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science 412 (2011) 1484 1491 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: wwwelseviercom/locate/tcs Parallel QR processing of Generalized

More information

Simple Examples. Let s look at a few simple examples of OI analysis.

Simple Examples. Let s look at a few simple examples of OI analysis. Simple Examples Let s look at a few simple examples of OI analysis. Example 1: Consider a scalar prolem. We have one oservation y which is located at the analysis point. We also have a ackground estimate

More information

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:

More information

Structuring Unreliable Radio Networks

Structuring Unreliable Radio Networks Structuring Unreliale Radio Networks Keren Censor-Hillel Seth Gilert Faian Kuhn Nancy Lynch Calvin Newport March 29, 2011 Astract In this paper we study the prolem of uilding a connected dominating set

More information

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and

More information

Parallel Eigensolver Performance on High Performance Computers 1

Parallel Eigensolver Performance on High Performance Computers 1 Parallel Eigensolver Performance on High Performance Computers 1 Andrew Sunderland STFC Daresbury Laboratory, Warrington, UK Abstract Eigenvalue and eigenvector computations arise in a wide range of scientific

More information

Binding Performance and Power of Dense Linear Algebra Operations

Binding Performance and Power of Dense Linear Algebra Operations 10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique

More information

Matrices, Moments and Quadrature, cont d

Matrices, Moments and Quadrature, cont d Jim Lambers CME 335 Spring Quarter 2010-11 Lecture 4 Notes Matrices, Moments and Quadrature, cont d Estimation of the Regularization Parameter Consider the least squares problem of finding x such that

More information

Math 504 (Fall 2011) 1. (*) Consider the matrices

Math 504 (Fall 2011) 1. (*) Consider the matrices Math 504 (Fall 2011) Instructor: Emre Mengi Study Guide for Weeks 11-14 This homework concerns the following topics. Basic definitions and facts about eigenvalues and eigenvectors (Trefethen&Bau, Lecture

More information

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer

More information

SOLUTION of linear systems of equations of the form:

SOLUTION of linear systems of equations of the form: Proceedings of the Federated Conference on Computer Science and Information Systems pp. Mixed precision iterative refinement techniques for the WZ factorization Beata Bylina Jarosław Bylina Institute of

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 5 Eigenvalue Problems Section 5.1 Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael

More information

NAG Library Routine Document F08QUF (ZTRSEN)

NAG Library Routine Document F08QUF (ZTRSEN) F08 Least-squares and Eigenvalue Problems (LAPACK) F08QUF NAG Library Routine Document F08QUF (ZTRSEN) Note: before using this routine, please read the Users Note for your implementation to check the interpretation

More information

Chaos and Dynamical Systems

Chaos and Dynamical Systems Chaos and Dynamical Systems y Megan Richards Astract: In this paper, we will discuss the notion of chaos. We will start y introducing certain mathematical concepts needed in the understanding of chaos,

More information

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, Jack Dongarra Department of Electrical Engineering and Computer Science, University

More information

Improvements for Implicit Linear Equation Solvers

Improvements for Implicit Linear Equation Solvers Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

Generalized interval arithmetic on compact matrix Lie groups

Generalized interval arithmetic on compact matrix Lie groups myjournal manuscript No. (will be inserted by the editor) Generalized interval arithmetic on compact matrix Lie groups Hermann Schichl, Mihály Csaba Markót, Arnold Neumaier Faculty of Mathematics, University

More information

Mathematical Ideas Modelling data, power variation, straightening data with logarithms, residual plots

Mathematical Ideas Modelling data, power variation, straightening data with logarithms, residual plots Kepler s Law Level Upper secondary Mathematical Ideas Modelling data, power variation, straightening data with logarithms, residual plots Description and Rationale Many traditional mathematics prolems

More information

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations! Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:

More information

Numerical Solution Techniques in Mechanical and Aerospace Engineering

Numerical Solution Techniques in Mechanical and Aerospace Engineering Numerical Solution Techniques in Mechanical and Aerospace Engineering Chunlei Liang LECTURE 3 Solvers of linear algebraic equations 3.1. Outline of Lecture Finite-difference method for a 2D elliptic PDE

More information

Parallel Programming. Parallel algorithms Linear systems solvers

Parallel Programming. Parallel algorithms Linear systems solvers Parallel Programming Parallel algorithms Linear systems solvers Terminology System of linear equations Solve Ax = b for x Special matrices Upper triangular Lower triangular Diagonally dominant Symmetric

More information

A communication-avoiding thick-restart Lanczos method on a distributed-memory system

A communication-avoiding thick-restart Lanczos method on a distributed-memory system A communication-avoiding thick-restart Lanczos method on a distributed-memory system Ichitaro Yamazaki and Kesheng Wu Lawrence Berkeley National Laboratory, Berkeley, CA, USA Abstract. The Thick-Restart

More information

The LINPACK Benchmark in Co-Array Fortran J. K. Reid Atlas Centre, Rutherford Appleton Laboratory, Chilton, Didcot, Oxon OX11 0QX, UK J. M. Rasmussen

The LINPACK Benchmark in Co-Array Fortran J. K. Reid Atlas Centre, Rutherford Appleton Laboratory, Chilton, Didcot, Oxon OX11 0QX, UK J. M. Rasmussen The LINPACK Benchmark in Co-Array Fortran J. K. Reid Atlas Centre, Rutherford Appleton Laboratory, Chilton, Didcot, Oxon OX11 0QX, UK J. M. Rasmussen and P. C. Hansen Department of Mathematical Modelling,

More information

An Active Set Strategy for Solving Optimization Problems with up to 200,000,000 Nonlinear Constraints

An Active Set Strategy for Solving Optimization Problems with up to 200,000,000 Nonlinear Constraints An Active Set Strategy for Solving Optimization Problems with up to 200,000,000 Nonlinear Constraints Klaus Schittkowski Department of Computer Science, University of Bayreuth 95440 Bayreuth, Germany e-mail:

More information

Algorithms and Perturbation Theory for Matrix Eigenvalue Problems and the SVD

Algorithms and Perturbation Theory for Matrix Eigenvalue Problems and the SVD Algorithms and Perturbation Theory for Matrix Eigenvalue Problems and the SVD Yuji Nakatsukasa PhD dissertation University of California, Davis Supervisor: Roland Freund Householder 2014 2/28 Acknowledgment

More information

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform Edgar Solomonik University of Illinois at Urbana-Champaign September 21, 2016 Fast

More information

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version Amal Khabou James Demmel Laura Grigori Ming Gu Electrical Engineering and Computer Sciences University of California

More information

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department

More information

Roundoff Error. Monday, August 29, 11

Roundoff Error. Monday, August 29, 11 Roundoff Error A round-off error (rounding error), is the difference between the calculated approximation of a number and its exact mathematical value. Numerical analysis specifically tries to estimate

More information

High Relative Precision of Eigenvalues Calculated with Jacobi Methods

High Relative Precision of Eigenvalues Calculated with Jacobi Methods High Relative Precision of Eigenvalues Calculated with Jacobi Methods ANA JULIA VIAMONE*, RUI RALHA ** *Departamento de Inovação, Ciência e ecnologia Universidade Portucalense Rua Dr. Ant. Bernardino de

More information

Aasen s Symmetric Indefinite Linear Solvers in LAPACK

Aasen s Symmetric Indefinite Linear Solvers in LAPACK Aasen s Symmetric Indefinite Linear Solvers in LAPACK Ichitaro Yamazaki and Jack Dongarra December 2, 217 Abstract Recently, we released two LAPACK subroutines that implement Aasen s algorithms for solving

More information

Out-of-Core SVD and QR Decompositions

Out-of-Core SVD and QR Decompositions Out-of-Core SVD and QR Decompositions Eran Rabani and Sivan Toledo 1 Introduction out-of-core singular-value-decomposition algorithm. The algorithm is designed for tall narrow matrices that are too large

More information

Luis Manuel Santana Gallego 100 Investigation and simulation of the clock skew in modern integrated circuits. Clock Skew Model

Luis Manuel Santana Gallego 100 Investigation and simulation of the clock skew in modern integrated circuits. Clock Skew Model Luis Manuel Santana Gallego 100 Appendix 3 Clock Skew Model Xiaohong Jiang and Susumu Horiguchi [JIA-01] 1. Introduction The evolution of VLSI chips toward larger die sizes and faster clock speeds makes

More information

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT

More information

1 Systems of Differential Equations

1 Systems of Differential Equations March, 20 7- Systems of Differential Equations Let U e an open suset of R n, I e an open interval in R and : I R n R n e a function from I R n to R n The equation ẋ = ft, x is called a first order ordinary

More information

Intel Math Kernel Library (Intel MKL) LAPACK

Intel Math Kernel Library (Intel MKL) LAPACK Intel Math Kernel Library (Intel MKL) LAPACK Linear equations Victor Kostin Intel MKL Dense Solvers team manager LAPACK http://www.netlib.org/lapack Systems of Linear Equations Linear Least Squares Eigenvalue

More information

Maximum-weighted matching strategies and the application to symmetric indefinite systems

Maximum-weighted matching strategies and the application to symmetric indefinite systems Maximum-weighted matching strategies and the application to symmetric indefinite systems by Stefan Röllin, and Olaf Schenk 2 Technical Report CS-24-7 Department of Computer Science, University of Basel

More information

Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries

Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou (ctheodos@grid.auth.gr) User and Application Support Scientific Computing Centre @ AUTH Presentation Outline

More information