PERFORMANCE ENHANCEMENT OF PARALLEL MULTIFRONTAL SOLVER ON BLOCK LANCZOS METHOD 1. INTRODUCTION

Similar documents
Contents. Preface... xi. Introduction...

V C V L T I 0 C V B 1 V T 0 I. l nk

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

MAA507, Power method, QR-method and sparse matrix representation.

Enhancing Scalability of Sparse Direct Methods

Lecture 8: Fast Linear Solvers (Part 7)

Numerical Linear Algebra

Direct and Incomplete Cholesky Factorizations with Static Supernodes

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Improvements for Implicit Linear Equation Solvers

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Algorithms and Methods for Fast Model Predictive Control

PARALLEL SOLUTION OF EIGENPROBLEMS IN STRUCTURAL DYNAMICS USING THE IMPLICITLY RESTARTED LANCZOS METHOD

Preconditioned Parallel Block Jacobi SVD Algorithm

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

A communication-avoiding thick-restart Lanczos method on a distributed-memory system

Jae Heon Yun and Yu Du Han

Linear Solvers. Andrew Hazel

problem Au = u by constructing an orthonormal basis V k = [v 1 ; : : : ; v k ], at each k th iteration step, and then nding an approximation for the e

Out-of-Core SVD and QR Decompositions

LAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006.

Implementation of a preconditioned eigensolver using Hypre

EIGIFP: A MATLAB Program for Solving Large Symmetric Generalized Eigenvalue Problems

This ensures that we walk downhill. For fixed λ not even this may be the case.

MESH MODELING OF ANGLE-PLY LAMINATED COMPOSITE PLATES FOR DNS AND IPSAP

Adaptation of the Lanczos Algorithm for the Solution of Buckling Eigenvalue Problems

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Arnoldi Methods in SLEPc

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

Computational Linear Algebra

APPLIED NUMERICAL LINEAR ALGEBRA

Numerical Solution Techniques in Mechanical and Aerospace Engineering

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Numerical Methods I: Numerical linear algebra

1.Chapter Objectives

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Matrix Assembly in FEA

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Solving PDEs with CUDA Jonathan Cohen

IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM

Sparse solver 64 bit and out-of-core addition

LAPACK-Style Codes for Pivoted Cholesky and QR Updating

GPU accelerated Arnoldi solver for small batched matrix

Fast algorithms for hierarchically semiseparable matrices

LU Factorization. LU Decomposition. LU Decomposition. LU Decomposition: Motivation A = LU

Computing least squares condition numbers on hybrid multicore/gpu systems

Linear Algebra. PHY 604: Computational Methods in Physics and Astrophysics II

MARCH 24-27, 2014 SAN JOSE, CA

Incomplete Cholesky preconditioners that exploit the low-rank property

Large scale continuation using a block eigensolver

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

have invested in supercomputer systems, which have cost up to tens of millions of dollars each. Over the past year or so, however, the future of vecto

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

ECS289: Scalable Machine Learning

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Roundoff Error. Monday, August 29, 11

Accelerating linear algebra computations with hybrid GPU-multicore systems.

9. Numerical linear algebra background

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Solving Ax = b, an overview. Program

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor

Scientific Computing

A hybrid Hermitian general eigenvalue solver

Multilevel Methods for Eigenspace Computations in Structural Dynamics

Multifrontal Incomplete Factorization for Indefinite and Complex Symmetric Systems

ECS130 Scientific Computing Handout E February 13, 2017

Advantages, Limitations and Error Estimation of Mixed Solid Axisymmetric Modeling

Direct solution methods for sparse matrices. p. 1/49

Intel Math Kernel Library (Intel MKL) LAPACK

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

Lecture 13 Stability of LU Factorization; Cholesky Factorization. Songting Luo. Department of Mathematics Iowa State University

Solving linear equations with Gaussian Elimination (I)

Iterative Solvers in the Finite Element Solution of Transient Heat Conduction

Main matrix factorizations

DELFT UNIVERSITY OF TECHNOLOGY

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems

Received 8 November 2013; revised 8 December 2013; accepted 15 December 2013

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

The Ongoing Development of CSDP

Introduction to Numerical Analysis

Level-3 BLAS on a GPU

Sparse matrix methods in quantum chemistry Post-doctorale cursus quantumchemie Han-sur-Lesse, Belgium

APVC2009. Forced Vibration Analysis of the Flexible Spinning Disk-spindle System Represented by Asymmetric Finite Element Equations

Sparse factorization using low rank submatrices. Cleve Ashcraft LSTC 2010 MUMPS User Group Meeting April 15-16, 2010 Toulouse, FRANCE

A Hybrid Method for the Wave Equation. beilina

Computation of the mtx-vec product based on storage scheme on vector CPUs

The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors


Linear Algebra. Carleton DeTar February 27, 2017

Rational Krylov methods for linear and nonlinear eigenvalue problems

7.2 Steepest Descent and Preconditioning

The Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers

Transcription:

J. KSIAM Vol., No., -0, 009 PERFORMANCE ENHANCEMENT OF PARALLEL MULTIFRONTAL SOLVER ON BLOCK LANCZOS METHOD Wanil BYUN AND Seung Jo KIM, SCHOOL OF MECHANICAL AND AEROSPACE ENG, SEOUL NATIONAL UNIV, SOUTH KOREA E-MAIL address: wibyun@aeroguy.snu.ac.kr FLIGHT VEHICLE RESEARCH CENTER, SEOUL NATIONAL UNIV, SOUTH KOREA E-MAIL address: sjkim@snu.ac.kr ABSTRACT. The IPSAP which is a finite element analysis program has been developed for high parallel performance computing. This program consists of various analysis modules stress, vibration and thermal analysis module, etc. The M orthogonal block Lanczos algorithm with shiftinvert transformation is used for solving eigenvalue problems in the vibration module. And the multifrontal algorithm which is one of the most efficient direct linear equation solvers is applied to factorization and triangular system solving phases in this block Lanczos iteration routine. In this study, the performance enhancement procedures of the IPSAP are composed of the following stages: ) communication volume minimization of the factorization phase by modifying parallel matrix subroutines. ) idling time minimization in triangular system solving phase by partial inverse of the frontal matrix and the LCM (least common multiple) concept.. INTRODUCTION Structural vibration analysis considered in IPSAP is to obtain eigenvalues and eigenvectors of the eigen problem of symmetric positive semi-definite stiffness and mass matrix which are obtained by the finite element method. The eigenvalue problem of structural vibration is of the form: K λm () Since the stiffness and the mass matrix are symmetric and semi-definite, the well-known Lanczos iteration is applicable to Equation (). Exactly speaking, the shifted invert transformation as in Equation () should be used in presence of the mass matrix M. LT L T K σm LL T () The shifted eigenvalue α is associated with original one by the relation α λσ. Received by the editors December 9 008; Accepted February 0 009. 000 Mathematics Subject Classification. 68W0, 68W40. Key words and phrase : Block Lanczos Method, Parallel Performance Enhancement, Eigenvalue Problem, Mutifrontal algorithm, IPSAP. Corresponding author.

4 Wanil BYUN and Seung Jo KIM The matrix L is the lower triangular matrix, where M=LL T. The existence of the inverse of KσM requires a linear equation solver during Lanczos iterations. The presence of L T implies that the Lanczos basis should be M orthogonal. It is notable that the restarted Lanczos iteration [] has been proposed as an alternative for the shift-invert transform.. ALGORITHM In the shift-invert Lanczos method, a linear equation solver is needed to solve accurately a linear equation with the coefficient matrix KσM. One of the reliable means is a direct method []. Nowadays, due to the bottleneck in scalability and memory requirement of a parallel direct solver, iterative linear equation solvers or reduction methods are getting interest [-4]. However, since the linear equation solver should be more accurate than the desired accuracy of eigenvalues, iterative solvers may be less powerful even with its great parallel scalability and memory efficiency. In the IPSAP solver, we implemented a parallel block Lanczos eigensolver using M orthogonal iteration equipped with a direct linear equation solver (Algorithm.). Algorithm. M orthogonal block Lanczos algorithm with shift-invert transformation Let V be a set of initial vectors with V T MV I j 0 while(required eigenvalue > converged eigenvalue) U MV K σm)w U, solve for W W T W V B C V T MW W W V C W V B, QR factorize for V compute eigenvalue of T, jj re-orthogonalize V against V, i0j end step step step step 4 step 5 step 6 step 7 step 8 Step in Algorithm. is a linear equation solving procedure which takes most of the time consumed in structural vibration analysis using Lanczos iteration. The parallel multifrontal solver was chosen as a linear equation solver. The algorithm of the solver can be regarded as a FEM-oriented version of a conventional multifrontal solver. This solver does not require a globally assembled stiffness matrix while the element concept of the finite element mesh is utilized as graph information. Therefore, the assembly of element matrices takes place automatically during the factorization phase. The implemented parallel factorization in the multifrontal solver is the well-known Cholesky factorization

PERFORMANCE ENHANCEMENT OF PARALLEL MULTIFRONTAL SOLVER ON BLOCK LANCZOS METHOD 5 (Algorithm.). It is noteworthy that the factorization needs to be conducted only once before the Lanczos loop if the shift is not varied. Algorithm. Parallel sparse Cholesky factorization for i 0,, N K K SYM K K factorizek,k, updatek ji whileremj, extend_addk with another domain branch in tree factorizek,k, updatek j j / end end parallel procedure n whilen N end extend_addk with another processor branch in tree factorizek,k, updatek nn In a matrix form, the Cholesky factorization factorizek,k and the update of a dense matrix updatek can be written in three steps. K T K L L () L K K (4) K K T K (5) Equation ()-(5) are based on the assumption that the lower part of the symmetric frontal matrix is active and that the sub-matrix K is allocated according to the memory structure in Figure. The feature of the frontal matrix structure in Figure is that the sub-matrix K is firstly assigned at the initial position of the allocated memory. Such a structure of the frontal matrix makes the extend_addk ) operation be easily implemented because K remains always at the head of the allocated memory. If the extend_add operation is

6 Wanil BYUN and Seung Jo KIM implemented appropriately, the major communication overhead of the proposed algorithm is caused by parallel dense matrix operation such as factorizek,k and updatek. FIGURE. Memory structure of frontal matrix K During the phase of solving triangular systems, a forward elimination followed by a backward substitution is a general procedure. They use the same processor mapping and frontal matrix distribution of the factorization phase. If the RHS (right-hand side) is indicated by W, the forward elimination is generally composed of two steps as follows. L W W (6) W W K W (7) The backward substitution is also completed by the following two steps. W W K W (8) T W W (9) L Here, W is the solution we are looking for. There are many communication overheads occurring from panel or block broadcast and summation in parallel operation of Equation ()-(9). The main objectives of performance enhancement in these routines are to reduce the total communication volume and to minimize the idling time for transferring panel or block matrices.. PARALLEL PERFORMANCE ENHANCEMENT AND NUMERICAL TEST Performance tuning of the multifrontal solver is conducted from three points of views. The first one is to reduce communication occurring in the Cholesky factorization. It is notable that the naming convention of subroutines used in the eigensolver is based on those of BLAS (Basic Linear Algebra Subprograms)[5] and LAPACK (Linear Algebra Package) [6]. As shown in Figure, the symmetric frontal matrix is composed of three matrix entities. Therefore, three routines are required to factorize the frontal matrix. In parallel matrix operations based on panel communication which are adopted in the present research, each routine performs panel or block communications which are column(or row) broadcasting or reducing type. Considering the detailed communication pattern, it is apparent that some parts of broadcasting or reducing are duplicated. For example, column broadcasting of panels of K in POTRF is also present in TRSM (Figure ). If combining

PERFORMANCE ENHANCEMENT OF PARALLEL MULTIFRONTAL SOLVER ON BLOCK LANCZOS METHOD 7 three routines into one is possible, duplicated communications will be avoidable. In Figure, they show a communication pattern of one condensation subroutine by combining duplicated communications among three subroutines. FIGURE. Frontal matrix and factorization routines FIGURE. Communication pattern of each subroutine Table lists some numerical test comparison without condensation and with condensation. The condensation subroutine which combines duplicated communications between nodes and nodes takes less time compared with non-condensation one. TABLE. Numerical test comparison without condensation and with condensation process map=4x8, block size=00 with nodes ( CPUs) matrix dimension without condensation with condensation difference 6000x6000 09.064 sec 0.58 sec -7.806 sec process map=8x8, block size=00 with 64 nodes (64 CPUs) matrix dimension without condensation with condensation difference 6000x6000 9.6 sec.99 sec -6. sec 000x000 08.96 sec 86.9 sec -0.08 sec

8 Wanil BYUN and Seung Jo KIM The second performance tuning can be conducted by topology control in panel or block communication. We implemented two kinds of topology which are increasing ring and split ring. The condensation subroutine is composed of five topology options. Therefore, there is topology sets. When the topology set is issii, the best performance in our parallel environment is achieved (Table ). It is notable that the best topology set depends on own network environment. TABLE. Topology comparison of the condensation subroutine Topology Sets cases sssss ssssi sssis sssii ssiss ssisi ssiis ssiii sisss sissi sisis sisii siiss siisi siiis siiii issss isssi issis issii isiss isisi isiis Isiii iisss iissi iisis iisii iiiss iiisi iiiis iiiii Matrix dimension = 6000x6000, process map = 8x8, block size = 00 with 64 nodes (64 CPUs) 5.4.50 6.77.4 0.86.605 0.89.8 0.96.4 9.6.5 4.059 4.57.407 4.57 5.85 0.64 7.704 09. 0.59.4 0.0.096 0.85.08.9.5.04.96 4.08.459 Matrix dimension = 000x000, process map = 8x8, block size = 00 with 64 nodes (64 CPUs) 607.8 589.898 600.976 586.587 67.98 6.97 64.0 609.89 6.79 59.75 608.47 594.00 68.87 66.56 65.957 64.8 600.697 588.547 598.778 585.88 64.406 608.557 6.545 606.8 6.58 59.975 60. 590.74 67.907 65.7 6.89 6.86 The last one is to tune the triangular system solving routines. Operations involved in solving the triangular system are generally composed of TRSM and GEMM with F and F. Since TRSM routine has data-dependent algorithmic flow, there may be idle processors between panel broadcasting. One idea to resolve this bottleneck is converting TRSM into TRMM by computing inverse of L by using TRTRI routine. Although TRMM requires the same number of floating point operations and communication volume as TRSM does, the LCM (least common multiple) concept can apparently reduce the idling time. The LCM concept which is originally proposed in the research of GEMM [7] can also be applied to any case of data-independent communication pattern. In order to apply such a concept to (6) and (9), the factorization phase should be modified so that the operation can be conducted only with matrix multiplication as follows.

PERFORMANCE ENHANCEMENT OF PARALLEL MULTIFRONTAL SOLVER ON BLOCK LANCZOS METHOD 9 L W W (0) T W W () L In this approach, performance negotiation between computing time for inverse of L and gains by using TRSM should be taken into account. TABLE. Performance results of eigenvalue problem ( CPUs) Case Case Case Difference (Case - Case ) Hex8 60 x 60 x (0M DOF) 0 eigenvalues (9 loop) 9.9 sec 66.858 sec 57.7 sec 7.54 sec 67.7 sec 505.865 sec 9.676 sec 5.588 sec 476.549 sec - 40.8 sec ( 7.89% ) Hex8 60 x 60 x (0M DOF) 50 eigenvalues ( loop) 9.066 sec 648.99 sec 899.0 sec 5.60 sec 64.754 sec 879.649 sec 7.000 sec 5.79 sec 770.086 sec - 9.7 sec ( 4.4% ) TABLE 4. Performance results of eigenvalue problem (64 CPUs) Case Case Case Difference (Case - Case ) Hex8 000 x 000 x (4M DOF) 0 eigenvalues (9 loop) 484.78 sec 797.078 sec 96.588 sec 44.65 sec 79.7 sec 0.586 sec 48.794 sec 694.068 sec 9.68 sec - 0.40 sec ( 7.98% ) Hex8 60 x 60 x (0M DOF) 00 eigenvalues (6 loop) 4.696 sec 70.70 sec 0.05 sec 5.56 sec 66.60 sec 07.9 sec 5.044 sec 909.7 sec 066.75 sec - 5.6 sec ( 9.% ) Hex8 60 x 60 x (0M DOF) 500 eigenvalues (4 loop) 47.7 sec 478.680 sec 45.87 sec 9.744 sec 474.50 sec 459.864 sec 44.48 sec 0.570 sec 8.6 sec - 48.776 sec ( 9.8% ) Hex8 60 x 60 x (0M DOF) 000 eigenvalues (5 loop) 4.0 sec 844.70 sec 856.5 sec 6.644 sec 8408.90 sec 8550.448 sec 4.47 sec 65.00 sec 664.09 sec - 99.50 sec ( 5.7 % ) The eigenvalue analysis of simple finite element models was performed considering the three performance enhancement issues. Table and 4 list the performance results in and 64 CPUs parallel computing environment. Case is for normal elimination-substitution algorithm with optimized network topology, Case is for a condensation based on Case and Case is for partial inverting algorithm with LCM concept based on case. Then, lists the Cholesky factorization time, lists Triangular solution time and lists MFS

0 Wanil BYUN and Seung Jo KIM elapsed time. The results of the Cholesky factorization time show performance enhancement compares with Case and Case. And the Cholesky factorization time is similar between Case and Case because of condensation in spite of adding inverting routine TRTRI. In a view point of triangular system solving time, Case has a good performance enhancement if the iteration loop is more and more. 4. CONCLUSION Parallel performance tuning of the multifrontal algorithm in block Lanczos method was conducted from three points of views in this research. Using condensation to reduce communication volume results in better performance during Cholesky factorization phase. In a viewpoint of topology, a network topology optimization also needs to have a better performance. The best topology set is dependent on network structure and device type. In this research, about 8~% enhancement can be obtained by topology control and condensation in case of parallel computing environment (64 CPUs). At last, the factorization time increases when applying the LCM concept because inverse subroutine (TRTRI) is added. However, if the number of Lanczos iteration is large, a partial inverting algorithm is more efficient and shows good performance. About 8~5% enhancement in this research can be obtained by using LCM concept due to the data-independent communication. Applying the LCM concept inside an iteration loop shows that it is significant for reducing the computing time in parallel matrix operations. ACKNOWLEDGMENTS This research has been partially supported by the Rapid Design of Satellite Structures by Multi-disciplinary Design Optimization project from Aerospace Research Institute(KARI) and NRL program administered via the Institute of Advanced Aerospace Technology at Seoul National University. REFERENCES [] D. Calvetti, L. Reichel and D. Sorensen, An Implicit Restarted Lanczos Method for Large Symmetric Eigenvalue Problems, Electronic Transaction on Numerical Analysis, Vol., pp. -, 994 [] K. Wu and H. Simon, An Evaluation of the Parallel Shift-and-Invert Lanczos Method, Proceedings of International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 9-99, Las Vegas USA, June 999 [] R. Morgan and D. Scott, Preconditioning the Lanczos Algorithm for Sparse Symmetric Eigenvalue Problems, SIAM Journal on Scientific Computing, Vol. 4, pp. 585-59, 99 [4] Y.T. Feng and D.R.J. Owen, Conjugate Gradient Methods for Solving the Smallest Eigenpair of Large Symmetric Eigenvalue Problems, International Journal for Numerical Methods in Engineering, Vol. 9, pp. 09-9, 996 [5] http://www.netlib.org/blas [6] http://www.netlib.org/lapack [7] Choi, J., A Fast Scalable Universal Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers, Proceedings of the IPPS, pp. 0-4, 997