The LINPACK Benchmark in Co-Array Fortran J. K. Reid Atlas Centre, Rutherford Appleton Laboratory, Chilton, Didcot, Oxon OX11 0QX, UK J. M. Rasmussen
|
|
- Alexis Ellis
- 5 years ago
- Views:
Transcription
1 The LINPACK Benchmark in Co-Array Fortran J. K. Reid Atlas Centre, Rutherford Appleton Laboratory, Chilton, Didcot, Oxon OX11 0QX, UK J. M. Rasmussen and P. C. Hansen Department of Mathematical Modelling, Technical University of Denmark, DK-2800 Lyngby, Denmark Abstract Co-array Fortran, abbreviated to CAF, is an extension of Fortran 90/95 for parallel programming that has been designed to be easy both for the compiler writer to implement and for the programmer to write and understand. It offers the prospect of clear and efficient parallel programming on homogeneous parallel systems. A subset of Co-Array Fortran is available on the Cray T3E and the aim of this talk is to explain how the LINPACK benchmark can be written in this language and compare its performance with that of ScaLAPACK. We pay particular attention to the solution of a single set of equations since we have not found a clear description in the literature of an algorithm that is asymptotically fully parallel. 1 Introduction Co-array Fortran (Numrich and Reid 1998), abbreviated to CAF, is an extension of Fortran 90/95 for parallel programming that has been designed to be easy both for the compiler writer to implement and for the programmer to write and understand. It offers the prospect of clear and efficient parallel programming on homogeneous parallel systems. Each processor has an identical copy of the program and has its own data objects. Each co-array is evenly spread over all the processors with each processor having a part of exactly the same shape. The language is carefully designed so that implementations will usually use the same address on each processor for the processor's part of the co-array. Subscripts in round brackets are used in the usual way to address the local part and subscripts in square brackets are used to address parts on other processors. References without square brackets are to local data, so code that can run independently is uncluttered. Only where there is are square brackets, or a 1
2 procedure call to code that involves square brackets, is there communication between processors. There are intrinsic procedures to synchronize processors, return the number of processors, and return the index of the current processor. A subset of Co-Array Fortran is available on the T3E and the aim of this talk is to explain how the LINPACK benchmark involving the solution of a dense set of 1000 linear equations can be written in this language and compare its performance with that of ScaLAPACK (Blackford et al. 1997). 2 LU Factorization of a Full Matrix We have written a CAF code for LU factorization with partial pivoting by rows of a large dense unsymmetric matrix of order N. To give good load balance, we have followed ScaLAPACK and used ablock cyclic distribution by rows and columns. The matrix is treated as a block matrix with all blocks of size r r except those in the last block column, which may have fewer columns, and the last block row, which may have fewer rows. The processors are arranged in a P Q rectangular array and the blocks are stored cyclically in both directions, that is, block i; j is stored on processor [1 + mod(i 1;P); 1 + mod(j 1;Q)]. We use a co-array of local shape (dn=p e; dn=q)e and co-shape [P; Q]. This permits the local part of each block Schur-complement update to be performed by a single call of the level-3 BLAS routine GEMM, with very good cache usage. We obtained the following comparisons with ScaLAPACK on the Linkoping T3E-600 and the Manchester T3E-1200, using block sizes of 32 and 48. LU factorization Co-shape times in ms T3E-600, block size32 CAF ScaLAPACK T3E-600, block size48 CAF ScaLAPACK T3E-1200, block size 32 CAF ScaLAPACK T3E-1200, block size 48 CAF ScaLAPACK
3 It may be seen that the Co-ArrayFortran code is slower for small numbers of processors, but scales better and is faster for larger numbers of processors. We do not see it as sensible to run a problem of this size on more than 16 processors. The differing block sizes can alter the speed by some 10%, but do not alter the relative performance of CAF and ScaLAPACK much. 3 Using the Factorization to Solve a Set of Equations Given the LU factorization and a record of the permutation used, we can solve a system of linear equations by applying the permutation to the righthand side vector, performing the forward substitution Liiyi = bi followed by the back-substitution Uiixi = yi mx Xi 1 j=1 j=i+1 Lijyj; i =1; 2;:::;m Uijxj; i = m; m 1;:::;1 Here, we assume that each bi, xi or yi is a block of order r (except perhaps for the last ones) and m = dn=re. We assume that the matrix blocks are as left after the LU factorization, that is, block i; j is stored on processor [1 + mod(i 1;P), 1 + mod(j 1;Q)]. We also assume that the vectors are similarly blocked, with eachblock on the same processor as the corresponding block on the diagonal of the matrix. There is a danger that applying the permutations may be as time consuming as a forward or backward substitution. Our original code applied each interchange in turn. The speed was greatly improved by holding the permutation explicitly and making each processor involved responsible for calculating its part of the permuted vector. However, for the numbers of processors that we have used here, our best speed was obtained by collecting the data onto one processor, performing the permutation there, and distributing the permuted vector. How to best solve a single triangular system of linear equations in parallel is not obvious and wehave not found a clear description in the literature of an algorithm that is asymptotically fully parallel. We developed our algorithm independently, but believe that it is a generalization of that of Bisseling and van de Vorst (1991). The main differences are that we work with blocks so that Level-2 BLAS (Dongarra, Du Croz, Hammarling, and Hanson, 1988) 3
4 can be employed locally and we use a rectangular array of processors instead of a square array. We will describe the solution of a lower triangular system. It is straightforward to adapt the algorithm to an upper triangular system. For an off-diagonal block, the main task of its processor is the multiplication Lijyj For a diagonal block, the main task of its processor is to perform a forward substitution. These operations are performed by Level-2 BLAS. Once a processor holding a diagonal block has computed its part of the solution, this must be passed to the other processors involved in the block column. We do this by passing it from neighbour to neighbour down the block column until all processors have received it. The products computed by the off-diagonal blocks in a block row must be accumulated and passed to the processor holding the diagonal block. This accumulation is initially performed locally; for blocks whose distance to the diagonal is less than Q, the partially accumulated sums are passed from neighbour to neighbour along the block row, each further accumulating the sum. Each processor performs actions associated with its first diagonal block, then the rest of that block column, then its second diagonal block, then the rest of that block column, etc. For each block whose distance to the diagonal is less than Q, the processor must wait for data from its neighbour to the left; and for each off-diagonal block whose distance to the diagonal is less than P, the processor must wait for data from its neighbour above it. No other synchronizations are needed. In particular, no other synchronizations are needed for blocks whose distance to the diagonal is at least max(p; Q). To see that the algorithm is fully parallel asymptotically, consider the processing of the first Q block columns. All processors are involved, but each works on a single block column. Processor [1,1] performs the first forward substitution, places the solution on processor [2,1], and synchronizes with it. Processor [2,1] performs its multiplication, places the result on processor [2,2], and synchronizes with it. It also places the vector it received from [1,1] on [3,1] and synchronizes with it. This continues until the processors are working on all Q block columns and no further synchronizations are needed. The critical path for this `start-up phase' involves the blocks [1,1], [2,1], [2,2], [3,2], [3,3],... involving a total of about r 2 (3Q=2 1) sequential operations and a small amountofcommunication. The rest of the calculation is fully pararallel and involves about r 2 N=(Pr)=rN=P operations on each processor. Applying similar arguments to the other block columns, we get a 4
5 critical path total of about 3Nr=2 operations and about N 2 =(PQ) operations on each processor. The ratio of these counts is N=(3PQr=2), which tends to infinity with N for fixed P; Q and r. Note that each processor that has critical path tasks in a block column attends to these before others for the block column. An outline of the CAF program is as follows: me = this_image() m = ceiling(n/r)! Number of blocks in a row or column do j = 1, m do i = j, m prow = mod(i-1,p) + 1! Row co-index pcol = mod(j-1,q) + 1! Column co-index! Perform work only if this block is owned by this processor if (me /= procgrid(prow,pcol) ) exit right = procgrid(prow,mod(pcol,q)+1) left = procgrid(prow,mod(pcol-2,q)+1) above = procgrid(mod(prow-2,p)+1,pcol) below = procgrid(mod(prow,p)+1,pcol) s = min(r,n-r*(i-1))! Size of this block i1 = (i-1)/p*r+1! First local index i2 = i1+s-1! Last local index if (i-j >= max(p,q) then call sgemv(..)! Multiply the block by xj and add into asum EXIT! No synchronization needed if (i > j) then if (i-j < P) then call sync_images( (/ me, above /) ) xj = temp_xj! receive part of x from image above call sgemv(..)! Multiply the block by xj and add into asum if (j > 1.AND. i-j < Q-1) then! accumulate sum with data from image to the left in rvec call sync_images( (/ me, left /) ) asum(i1:i2) = asum(i1:i2) + rvec(1:s) if (i == j) then! perform local forward-substitution b(i1:i2) = b(i1:i2) - asum(i1:i2) 5
6 call strsv(..)! Solution in b(i1:i2) xj(1:s) = b(i1:i2) else if (i-j < Q) then! send accumulated sum to image to the right rvec(1:s)[prow,mod(pcol,q)+1] = asum(i1:i2) call sync_images( (/ me, right /) )! send part of x to image below if (i < m.and. i-j < P-1) then temp_xj(:,:)[mod(prow,p)+1,pcol] = xj(:,:) call sync_images( (/ me, below /) ) end do end do Using our CAF code, we obtained the following comparisons with ScaLAPACK on the LINPACK test involving a single right-hand side for a problem of order N = For the CAF code, we include the time for applying the permutations as well as performing the forward and back substitution. For the ScaLAPACK code, we found that the permutation time was significant, so we also show the time excluding the permutation time, that is, the time for the forward and backward substitution. Solution Co-shape times in ms T3E-600, block size 32 CAF ScaLAPACK ScaLAPACK, no perm T3E-600, block size 48 CAF ScaLAPACK ScaLAPACK, no perm T3E-1200, block size 32 CAF ScaLAPACK ScaLAPACK, no perm T3E-1200, block size 48 CAF ScaLAPACK ScaLAPACK, no perm We note that, again, the Co-Array Fortran code is slower for small numbers of processors, but scales better and is faster for larger numbers of processors. 6
7 4 Conclusions We have demonstrated that Co-Array Fortran can be used to write codes for the LINPACK benchmark that are clear and perform well. In particular, they scale better with increasing numbers of processors than the ScaLAPACK codes. We have also provided a straightforward description of a fully parallel algorithm for solving a triangular set of equations. Acknowledgements We would like to express our thanks to the National Supercomputer Centre at Linkoping University for making the Linkoping T3E available to us for the project `Investigation of the effectiveness of Co-array Fortran' and to Bo Einarsson for the substantial amount of help that he has given us. We would also like thank EPSRC for making the University of Manchester Computer Services for Academic Research (CSAR) T3E available to John Reid under the project GR/M7850Z. References [1] Bisseling, R. H. and van de Vorst, J. G. G. (1991). Parallel triangular system solving on a mesh network of transputers, SIAM J. Sci. Stat. Comput., 12, [2] Blackford, L. S., Choi, J., Cleary, A., D'Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D. and Whaley, R. C. (1997). ScaLAPACK users' guide. SIAM, Philadelphia. [3] Dongarra, J. J., Du Croz, J., Hammarling, S., and Hanson, R. J. (1988). An extended set of Fortran Basic Linear Algebra Subprograms. ACM Trans Math. Software, 14, 1-17 and [4] Numrich, R. W. and Reid, J. K. (1998), Co-Array Fortran for parallel programming, ACM Fortran Forum 17, 2,
LAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006.
LAPACK-Style Codes for Pivoted Cholesky and QR Updating Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig 2007 MIMS EPrint: 2006.385 Manchester Institute for Mathematical Sciences School of Mathematics
More informationSymmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano
Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Introduction Introduction We wanted to parallelize a serial algorithm for the pivoted Cholesky factorization
More informationLAPACK-Style Codes for Pivoted Cholesky and QR Updating
LAPACK-Style Codes for Pivoted Cholesky and QR Updating Sven Hammarling 1, Nicholas J. Higham 2, and Craig Lucas 3 1 NAG Ltd.,Wilkinson House, Jordan Hill Road, Oxford, OX2 8DR, England, sven@nag.co.uk,
More informationPorting a Sphere Optimization Program from lapack to scalapack
Porting a Sphere Optimization Program from lapack to scalapack Paul C. Leopardi Robert S. Womersley 12 October 2008 Abstract The sphere optimization program sphopt was originally written as a sequential
More informationMatrix Computations: Direct Methods II. May 5, 2014 Lecture 11
Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would
More informationOn aggressive early deflation in parallel variants of the QR algorithm
On aggressive early deflation in parallel variants of the QR algorithm Bo Kågström 1, Daniel Kressner 2, and Meiyue Shao 1 1 Department of Computing Science and HPC2N Umeå University, S-901 87 Umeå, Sweden
More informationMore Gaussian Elimination and Matrix Inversion
Week7 More Gaussian Elimination and Matrix Inversion 7 Opening Remarks 7 Introduction 235 Week 7 More Gaussian Elimination and Matrix Inversion 236 72 Outline 7 Opening Remarks 235 7 Introduction 235 72
More informationAlgorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems
Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems LESLIE FOSTER and RAJESH KOMMU San Jose State University Existing routines, such as xgelsy or xgelsd in LAPACK, for
More informationI-v k e k. (I-e k h kt ) = Stability of Gauss-Huard Elimination for Solving Linear Systems. 1 x 1 x x x x
Technical Report CS-93-08 Department of Computer Systems Faculty of Mathematics and Computer Science University of Amsterdam Stability of Gauss-Huard Elimination for Solving Linear Systems T. J. Dekker
More informationLinear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4
Linear Algebra Section. : LU Decomposition Section. : Permutations and transposes Wednesday, February 1th Math 01 Week # 1 The LU Decomposition We learned last time that we can factor a invertible matrix
More informationUsing Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods
Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods Marc Baboulin 1, Xiaoye S. Li 2 and François-Henry Rouet 2 1 University of Paris-Sud, Inria Saclay, France 2 Lawrence Berkeley
More informationA hybrid Hermitian general eigenvalue solver
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,
More informationImplementation for LAPACK of a Block Algorithm for Matrix 1-Norm Estimation
Implementation for LAPACK of a Block Algorithm for Matrix 1-Norm Estimation Sheung Hun Cheng Nicholas J. Higham August 13, 2001 Abstract We describe double precision and complex16 Fortran 77 implementations,
More informationRoundoff Error. Monday, August 29, 11
Roundoff Error A round-off error (rounding error), is the difference between the calculated approximation of a number and its exact mathematical value. Numerical analysis specifically tries to estimate
More information2 MULTIPLYING COMPLEX MATRICES It is rare in matrix computations to be able to produce such a clear-cut computational saving over a standard technique
STABILITY OF A METHOD FOR MULTIPLYING COMPLEX MATRICES WITH THREE REAL MATRIX MULTIPLICATIONS NICHOLAS J. HIGHAM y Abstract. By use of a simple identity, the product of two complex matrices can be formed
More informationANONSINGULAR tridiagonal linear system of the form
Generalized Diagonal Pivoting Methods for Tridiagonal Systems without Interchanges Jennifer B. Erway, Roummel F. Marcia, and Joseph A. Tyson Abstract It has been shown that a nonsingular symmetric tridiagonal
More informationDownloaded 08/28/12 to Redistribution subject to SIAM license or copyright; see
SIAMJ.MATRIX ANAL. APPL. c 1997 Society for Industrial and Applied Mathematics Vol. 18, No. 1, pp. 140 158, January 1997 012 AN UNSYMMETRIC-PATTERN MULTIFRONTAL METHOD FOR SPARSE LU FACTORIZATION TIMOTHY
More informationPivoting. Reading: GV96 Section 3.4, Stew98 Chapter 3: 1.3
Pivoting Reading: GV96 Section 3.4, Stew98 Chapter 3: 1.3 In the previous discussions we have assumed that the LU factorization of A existed and the various versions could compute it in a stable manner.
More informationStability of a method for multiplying complex matrices with three real matrix multiplications. Higham, Nicholas J. MIMS EPrint: 2006.
Stability of a method for multiplying complex matrices with three real matrix multiplications Higham, Nicholas J. 1992 MIMS EPrint: 2006.169 Manchester Institute for Mathematical Sciences School of Mathematics
More informationA distributed packed storage for large dense parallel in-core calculations
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2007; 19:483 502 Published online 28 September 2006 in Wiley InterScience (www.interscience.wiley.com)..1119 A
More informationLecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 2. Systems of Linear Equations
Lecture Notes to Accompany Scientific Computing An Introductory Survey Second Edition by Michael T. Heath Chapter 2 Systems of Linear Equations Copyright c 2001. Reproduction permitted only for noncommercial,
More informationWeek6. Gaussian Elimination. 6.1 Opening Remarks Solving Linear Systems. View at edx
Week6 Gaussian Elimination 61 Opening Remarks 611 Solving Linear Systems View at edx 193 Week 6 Gaussian Elimination 194 61 Outline 61 Opening Remarks 193 611 Solving Linear Systems 193 61 Outline 194
More informationComputing least squares condition numbers on hybrid multicore/gpu systems
Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning
More informationAn evaluation of sparse direct symmetric solvers: an introduction and preliminary finding
Numerical Analysis Group Internal Report 24- An evaluation of sparse direct symmetric solvers: an introduction and preliminary finding J A Scott Y Hu N I M Gould November 8, 24 c Council for the Central
More informationKey words. modified Cholesky factorization, optimization, Newton s method, symmetric indefinite
SIAM J. MATRIX ANAL. APPL. c 1998 Society for Industrial and Applied Mathematics Vol. 19, No. 4, pp. 197 111, October 1998 16 A MODIFIED CHOLESKY ALGORITHM BASED ON A SYMMETRIC INDEFINITE FACTORIZATION
More informationNAG Fortran Library Routine Document F04CFF.1
F04 Simultaneous Linear Equations NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised
More informationMatrix Eigensystem Tutorial For Parallel Computation
Matrix Eigensystem Tutorial For Parallel Computation High Performance Computing Center (HPC) http://www.hpc.unm.edu 5/21/2003 1 Topic Outline Slide Main purpose of this tutorial 5 The assumptions made
More informationScaling and Pivoting in an Out-of-Core Sparse Direct Solver
Scaling and Pivoting in an Out-of-Core Sparse Direct Solver JENNIFER A. SCOTT Rutherford Appleton Laboratory Out-of-core sparse direct solvers reduce the amount of main memory needed to factorize and solve
More informationLinear Algebra Linear Algebra : Matrix decompositions Monday, February 11th Math 365 Week #4
Linear Algebra Linear Algebra : Matrix decompositions Monday, February 11th Math Week # 1 Saturday, February 1, 1 Linear algebra Typical linear system of equations : x 1 x +x = x 1 +x +9x = 0 x 1 +x x
More informationScientific Computing: An Introductory Survey
Scientific Computing: An Introductory Survey Chapter 2 Systems of Linear Equations Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign Copyright c 2002. Reproduction
More informationLecture 8: Fast Linear Solvers (Part 7)
Lecture 8: Fast Linear Solvers (Part 7) 1 Modified Gram-Schmidt Process with Reorthogonalization Test Reorthogonalization If Av k 2 + δ v k+1 2 = Av k 2 to working precision. δ = 10 3 2 Householder Arnoldi
More informationD. Gimenez, M. T. Camara, P. Montilla. Aptdo Murcia. Spain. ABSTRACT
Accelerating the Convergence of Blocked Jacobi Methods 1 D. Gimenez, M. T. Camara, P. Montilla Departamento de Informatica y Sistemas. Univ de Murcia. Aptdo 401. 0001 Murcia. Spain. e-mail: fdomingo,cpmcm,cppmmg@dif.um.es
More informationTall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222
Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department
More informationTile QR Factorization with Parallel Panel Processing for Multicore Architectures
Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, Jack Dongarra Department of Electrical Engineering and Computer Science, University
More informationA Parallel Divide and Conquer Algorithm for the Symmetric Eigenvalue Problem on Distributed Memory Architectures. F Tisseur and J Dongarra
A Parallel Divide and Conquer Algorithm for the Symmetric Eigenvalue Problem on Distributed Memory Architectures F Tisseur and J Dongarra 999 MIMS EPrint: 2007.225 Manchester Institute for Mathematical
More informationIntel Math Kernel Library (Intel MKL) LAPACK
Intel Math Kernel Library (Intel MKL) LAPACK Linear equations Victor Kostin Intel MKL Dense Solvers team manager LAPACK http://www.netlib.org/lapack Systems of Linear Equations Linear Least Squares Eigenvalue
More informationLAPACK-Style Codes for Level 2 and 3 Pivoted Cholesky Factorizations. C. Lucas. Numerical Analysis Report No February 2004
ISSN 1360-1725 UMIST LAPACK-Style Codes for Level 2 and 3 Pivoted Cholesky Factorizations C. Lucas Numerical Analysis Report No. 442 February 2004 Manchester Centre for Computational Mathematics Numerical
More informationV C V L T I 0 C V B 1 V T 0 I. l nk
Multifrontal Method Kailai Xu September 16, 2017 Main observation. Consider the LDL T decomposition of a SPD matrix [ ] [ ] [ ] [ ] B V T L 0 I 0 L T L A = = 1 V T V C V L T I 0 C V B 1 V T, 0 I where
More informationNAG Library Routine Document F07HAF (DPBSV)
NAG Library Routine Document (DPBSV) Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent
More informationMatrix decompositions
Matrix decompositions How can we solve Ax = b? 1 Linear algebra Typical linear system of equations : x 1 x +x = x 1 +x +9x = 0 x 1 +x x = The variables x 1, x, and x only appear as linear terms (no powers
More informationTesting Linear Algebra Software
Testing Linear Algebra Software Nicholas J. Higham, Department of Mathematics, University of Manchester, Manchester, M13 9PL, England higham@ma.man.ac.uk, http://www.ma.man.ac.uk/~higham/ Abstract How
More informationPorting a sphere optimization program from LAPACK to ScaLAPACK
Porting a sphere optimization program from LAPACK to ScaLAPACK Mathematical Sciences Institute, Australian National University. For presentation at Computational Techniques and Applications Conference
More informationMatrix decompositions
Matrix decompositions How can we solve Ax = b? 1 Linear algebra Typical linear system of equations : x 1 x +x = x 1 +x +9x = 0 x 1 +x x = The variables x 1, x, and x only appear as linear terms (no powers
More informationSDPARA : SemiDefinite Programming Algorithm PARAllel version
Research Reports on Mathematical and Computing Sciences Series B : Operations Research Department of Mathematical and Computing Sciences Tokyo Institute of Technology 2-12-1 Oh-Okayama, Meguro-ku, Tokyo
More informationLinear Systems of Equations by Gaussian Elimination
Chapter 6, p. 1/32 Linear of Equations by School of Engineering Sciences Parallel Computations for Large-Scale Problems I Chapter 6, p. 2/32 Outline 1 2 3 4 The Problem Consider the system a 1,1 x 1 +
More informationLecture 4: Linear Algebra 1
Lecture 4: Linear Algebra 1 Sourendu Gupta TIFR Graduate School Computational Physics 1 February 12, 2010 c : Sourendu Gupta (TIFR) Lecture 4: Linear Algebra 1 CP 1 1 / 26 Outline 1 Linear problems Motivation
More informationNumerical Methods I: Numerical linear algebra
1/3 Numerical Methods I: Numerical linear algebra Georg Stadler Courant Institute, NYU stadler@cimsnyuedu September 1, 017 /3 We study the solution of linear systems of the form Ax = b with A R n n, x,
More informationExploiting zeros on the diagonal in the direct solution of indefinite sparse symmetric linear systems
RAL-R-95-040 Exploiting zeros on the diagonal in the direct solution of indefinite sparse symmetric linear systems by I. S. Duff and J. K. Reid Abstract We describe the design of a new code for the solution
More informationSolution of Large, Dense Symmetric Generalized Eigenvalue Problems Using Secondary Storage
Solution of Large, Dense Symmetric Generalized Eigenvalue Problems Using Secondary Storage ROGER G. GRIMES and HORST D. SIMON Boeing Computer Services This paper describes a new implementation of algorithms
More informationPrinciples of Scientific Computing Linear Algebra II, Algorithms
Principles of Scientific Computing Linear Algebra II, Algorithms David Bindel and Jonathan Goodman last revised March 2, 2006, printed February 26, 2009 1 1 Introduction This chapter discusses some of
More information1 Multiply Eq. E i by λ 0: (λe i ) (E i ) 2 Multiply Eq. E j by λ and add to Eq. E i : (E i + λe j ) (E i )
Direct Methods for Linear Systems Chapter Direct Methods for Solving Linear Systems Per-Olof Persson persson@berkeleyedu Department of Mathematics University of California, Berkeley Math 18A Numerical
More informationSolution of Linear Systems
Solution of Linear Systems Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico May 12, 2016 CPD (DEI / IST) Parallel and Distributed Computing
More informationCS 542G: Conditioning, BLAS, LU Factorization
CS 542G: Conditioning, BLAS, LU Factorization Robert Bridson September 22, 2008 1 Why some RBF Kernel Functions Fail We derived some sensible RBF kernel functions, like φ(r) = r 2 log r, from basic principles
More informationMATH 3511 Lecture 1. Solving Linear Systems 1
MATH 3511 Lecture 1 Solving Linear Systems 1 Dmitriy Leykekhman Spring 2012 Goals Review of basic linear algebra Solution of simple linear systems Gaussian elimination D Leykekhman - MATH 3511 Introduction
More informationEfficient Implementation of the Ensemble Kalman Filter
University of Colorado at Denver and Health Sciences Center Efficient Implementation of the Ensemble Kalman Filter Jan Mandel May 2006 UCDHSC/CCM Report No. 231 CENTER FOR COMPUTATIONAL MATHEMATICS REPORTS
More informationA Comparison of Parallel Solvers for Diagonally. Dominant and General Narrow-Banded Linear. Systems II.
A Comparison of Parallel Solvers for Diagonally Dominant and General Narrow-Banded Linear Systems II Peter Arbenz 1, Andrew Cleary 2, Jack Dongarra 3, and Markus Hegland 4 1 Institute of Scientic Computing,
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationPRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM
Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value
More informationExponentials of Symmetric Matrices through Tridiagonal Reductions
Exponentials of Symmetric Matrices through Tridiagonal Reductions Ya Yan Lu Department of Mathematics City University of Hong Kong Kowloon, Hong Kong Abstract A simple and efficient numerical algorithm
More informationParallel eigenvalue reordering in real Schur forms
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1 7 [Version: 2002/09/19 v2.02] Parallel eigenvalue reordering in real Schur forms R. Granat 1, B. Kågström
More informationNAG Library Routine Document F08PNF (ZGEES)
F08 Least-squares and Eigenvalue Problems (LAPACK) F08PNF NAG Library Routine Document F08PNF (ZGEES) Note: before using this routine, please read the Users Note for your implementation to check the interpretation
More informationSection 5.6. LU and LDU Factorizations
5.6. LU and LDU Factorizations Section 5.6. LU and LDU Factorizations Note. We largely follow Fraleigh and Beauregard s approach to this topic from Linear Algebra, 3rd Edition, Addison-Wesley (995). See
More informationAx = b. Systems of Linear Equations. Lecture Notes to Accompany. Given m n matrix A and m-vector b, find unknown n-vector x satisfying
Lecture Notes to Accompany Scientific Computing An Introductory Survey Second Edition by Michael T Heath Chapter Systems of Linear Equations Systems of Linear Equations Given m n matrix A and m-vector
More informationBLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8
More informationNAG Library Routine Document F08UBF (DSBGVX)
NAG Library Routine Document (DSBGVX) Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent
More informationFACTORIZING COMPLEX SYMMETRIC MATRICES WITH POSITIVE DEFINITE REAL AND IMAGINARY PARTS
MATHEMATICS OF COMPUTATION Volume 67, Number 4, October 1998, Pages 1591 1599 S 005-5718(98)00978-8 FACTORIZING COMPLEX SYMMETRIC MATRICES WITH POSITIVE DEFINITE REAL AND IMAGINARY PARTS NICHOLAS J. HIGHAM
More informationAasen s Symmetric Indefinite Linear Solvers in LAPACK
Aasen s Symmetric Indefinite Linear Solvers in LAPACK Ichitaro Yamazaki and Jack Dongarra December 2, 217 Abstract Recently, we released two LAPACK subroutines that implement Aasen s algorithms for solving
More informationThe design and use of a sparse direct solver for skew symmetric matrices
The design and use of a sparse direct solver for skew symmetric matrices Iain S. Duff May 6, 2008 RAL-TR-2006-027 c Science and Technology Facilities Council Enquires about copyright, reproduction and
More informationCHAPTER 6. Direct Methods for Solving Linear Systems
CHAPTER 6 Direct Methods for Solving Linear Systems. Introduction A direct method for approximating the solution of a system of n linear equations in n unknowns is one that gives the exact solution to
More informationSpecialized Parallel Algorithms for Solving Lyapunov and Stein Equations
Journal of Parallel and Distributed Computing 61, 14891504 (2001) doi:10.1006jpdc.2001.1732, available online at http:www.idealibrary.com on Specialized Parallel Algorithms for Solving Lyapunov and Stein
More informationDirect solution methods for sparse matrices. p. 1/49
Direct solution methods for sparse matrices p. 1/49 p. 2/49 Direct solution methods for sparse matrices Solve Ax = b, where A(n n). (1) Factorize A = LU, L lower-triangular, U upper-triangular. (2) Solve
More informationChapter 2. Solving Systems of Equations. 2.1 Gaussian elimination
Chapter 2 Solving Systems of Equations A large number of real life applications which are resolved through mathematical modeling will end up taking the form of the following very simple looking matrix
More informationMTH 464: Computational Linear Algebra
MTH 464: Computational Linear Algebra Lecture Outlines Exam 2 Material Prof. M. Beauregard Department of Mathematics & Statistics Stephen F. Austin State University February 6, 2018 Linear Algebra (MTH
More informationA Column Pre-ordering Strategy for the Unsymmetric-Pattern Multifrontal Method
A Column Pre-ordering Strategy for the Unsymmetric-Pattern Multifrontal Method TIMOTHY A. DAVIS University of Florida A new method for sparse LU factorization is presented that combines a column pre-ordering
More information1. Introduction. Applying the QR algorithm to a real square matrix A yields a decomposition of the form
BLOCK ALGORITHMS FOR REORDERING STANDARD AND GENERALIZED SCHUR FORMS LAPACK WORKING NOTE 171 DANIEL KRESSNER Abstract. Block algorithms for reordering a selected set of eigenvalues in a standard or generalized
More informationGeneralized interval arithmetic on compact matrix Lie groups
myjournal manuscript No. (will be inserted by the editor) Generalized interval arithmetic on compact matrix Lie groups Hermann Schichl, Mihály Csaba Markót, Arnold Neumaier Faculty of Mathematics, University
More informationCommunication Avoiding LU Factorization using Complete Pivoting
Communication Avoiding LU Factorization using Complete Pivoting Implementation and Analysis Avinash Bhardwaj Department of Industrial Engineering and Operations Research University of California, Berkeley
More informationA Review of Matrix Analysis
Matrix Notation Part Matrix Operations Matrices are simply rectangular arrays of quantities Each quantity in the array is called an element of the matrix and an element can be either a numerical value
More informationDirect Methods for Solving Linear Systems. Matrix Factorization
Direct Methods for Solving Linear Systems Matrix Factorization Numerical Analysis (9th Edition) R L Burden & J D Faires Beamer Presentation Slides prepared by John Carroll Dublin City University c 2011
More informationFinite-choice algorithm optimization in Conjugate Gradients
Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the
More informationPreconditioned Parallel Block Jacobi SVD Algorithm
Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic
More informationParallel Scientific Computing
IV-1 Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication. Direct method for solving a linear equation. Gaussian Elimination. Iterative method for solving a linear equation.
More informationThe Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers
The Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers ANSHUL GUPTA and FRED G. GUSTAVSON IBM T. J. Watson Research Center MAHESH JOSHI
More informationSparse BLAS-3 Reduction
Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc
More information. =. a i1 x 1 + a i2 x 2 + a in x n = b i. a 11 a 12 a 1n a 21 a 22 a 1n. i1 a i2 a in
Vectors and Matrices Continued Remember that our goal is to write a system of algebraic equations as a matrix equation. Suppose we have the n linear algebraic equations a x + a 2 x 2 + a n x n = b a 2
More informationNUMERICAL ANALYSIS GROUP PROGRESS REPORT
RAL 94-062 NUMERICAL ANALYSIS GROUP PROGRESS REPORT January 1991 December 1993 Edited by Iain S. Duff Central Computing Department Atlas Centre Rutherford Appleton Laboratory Oxon OX11 0QX June 1994. CONTENTS
More informationEvaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries
Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou (ctheodos@grid.auth.gr) User and Application Support Scientific Computing Centre @ AUTH Presentation Outline
More informationA parallel tiled solver for dense symmetric indefinite systems on multicore architectures
A parallel tiled solver for dense symmetric indefinite systems on multicore architectures Marc Baboulin, Dulceneia Becker, Jack Dongarra INRIA Saclay-Île de France, F-91893 Orsay, France Université Paris
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization
More informationPracticality of Large Scale Fast Matrix Multiplication
Practicality of Large Scale Fast Matrix Multiplication Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz UC Berkeley IWASEP June 5, 2012 Napa Valley, CA Research supported by
More informationFinite Math - J-term Section Systems of Linear Equations in Two Variables Example 1. Solve the system
Finite Math - J-term 07 Lecture Notes - //07 Homework Section 4. - 9, 0, 5, 6, 9, 0,, 4, 6, 0, 50, 5, 54, 55, 56, 6, 65 Section 4. - Systems of Linear Equations in Two Variables Example. Solve the system
More informationImplementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:
Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 214 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University
More informationMAC1105-College Algebra. Chapter 5-Systems of Equations & Matrices
MAC05-College Algebra Chapter 5-Systems of Equations & Matrices 5. Systems of Equations in Two Variables Solving Systems of Two Linear Equations/ Two-Variable Linear Equations A system of equations is
More informationTowards a stable static pivoting strategy for the sequential and parallel solution of sparse symmetric indefinite systems
Technical Report RAL-TR-2005-007 Towards a stable static pivoting strategy for the sequential and parallel solution of sparse symmetric indefinite systems Iain S. Duff and Stéphane Pralet April 22, 2005
More informationCommunication avoiding parallel algorithms for dense matrix factorizations
Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013
More informationBlock-tridiagonal matrices
Block-tridiagonal matrices. p.1/31 Block-tridiagonal matrices - where do these arise? - as a result of a particular mesh-point ordering - as a part of a factorization procedure, for example when we compute
More informationA Parallel Bisection and Inverse Iteration Solver for a Subset of Eigenpairs of Symmetric Band Matrices
A Parallel Bisection and Inverse Iteration Solver for a Subset of Eigenpairs of Symmetric Band Matrices Hiroyui Ishigami, Hidehio Hasegawa, Kinji Kimura, and Yoshimasa Naamura Abstract The tridiagonalization
More information5.7 Cramer's Rule 1. Using Determinants to Solve Systems Assumes the system of two equations in two unknowns
5.7 Cramer's Rule 1. Using Determinants to Solve Systems Assumes the system of two equations in two unknowns (1) possesses the solution and provided that.. The numerators and denominators are recognized
More informationA New Block Algorithm for Full-Rank Solution of the Sylvester-observer Equation.
1 A New Block Algorithm for Full-Rank Solution of the Sylvester-observer Equation João Carvalho, DMPA, Universidade Federal do RS, Brasil Karabi Datta, Dep MSc, Northern Illinois University, DeKalb, IL
More informationLinear Equations in Linear Algebra
1 Linear Equations in Linear Algebra 1.1 SYSTEMS OF LINEAR EQUATIONS LINEAR EQUATION x 1,, x n A linear equation in the variables equation that can be written in the form a 1 x 1 + a 2 x 2 + + a n x n
More information