Computing least squares condition numbers on hybrid multicore/gpu systems

Similar documents
Computing least squares condition numbers on hybrid multicore/gpu systems

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

A hybrid Hermitian general eigenvalue solver

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Exponentials of Symmetric Matrices through Tridiagonal Reductions

Reducing the Amount of Pivoting in Symmetric Indefinite Systems

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

A parallel tiled solver for dense symmetric indefinite systems on multicore architectures

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

arxiv: v1 [math.na] 3 Oct 2007

LAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006.

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods

Numerical Methods I Non-Square and Sparse Linear Systems

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

LAPACK-Style Codes for Pivoted Cholesky and QR Updating

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

A note on eigenvalue computation for a tridiagonal matrix with real eigenvalues Akiko Fukuda

SOLUTION of linear systems of equations of the form:

Computing the conditioning of the components of a linear least-squares solution

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Block Bidiagonal Decomposition and Least Squares Problems

The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors

On aggressive early deflation in parallel variants of the QR algorithm

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

LU Preconditioning for Overdetermined Sparse Least Squares Problems

Numerical Methods in Matrix Computations

Parallel tools for solving incremental dense least squares problems. Application to space geodesy. LAPACK Working Note 179

Some notes on efficient computing and setting up high performance computing environments

Key words. conjugate gradients, normwise backward error, incremental norm estimation.

Generalized interval arithmetic on compact matrix Lie groups

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

On GPU Acceleration of Common Solvers for (Quasi-) Triangular Generalized Lyapunov Equations

IN THE international academic circles MATLAB is accepted

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures

arxiv:cs/ v1 [cs.ms] 13 Aug 2004

Porting a Sphere Optimization Program from lapack to scalapack

A distributed packed storage for large dense parallel in-core calculations

Using Godunov s Two-Sided Sturm Sequences to Accurately Compute Singular Vectors of Bidiagonal Matrices.

Preconditioned Parallel Block Jacobi SVD Algorithm

MARCH 24-27, 2014 SAN JOSE, CA

Accelerating incompressible fluid flow simulations on hybrid CPU/GPU systems

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Sparse BLAS-3 Reduction

Positive Denite Matrix. Ya Yan Lu 1. Department of Mathematics. City University of Hong Kong. Kowloon, Hong Kong. Abstract

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

On the design of parallel linear solvers for large scale problems

Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem

1 Cricket chirps: an example

Scientific Computing: Solving Linear Systems

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Some Notes on Least Squares, QR-factorization, SVD and Fitting

Scientific Computing

Total least squares. Gérard MEURANT. October, 2008

Direct Methods for Matrix Sylvester and Lyapunov Equations

NAG Library Routine Document F08UBF (DSBGVX)

NAG Library Routine Document F07HAF (DPBSV)

NAG Library Routine Document F08FAF (DSYEV)

S N. hochdimensionaler Lyapunov- und Sylvestergleichungen. Peter Benner. Mathematik in Industrie und Technik Fakultät für Mathematik TU Chemnitz

Solving PDEs with CUDA Jonathan Cohen

A PARTIAL CONDITION NUMBER FOR LINEAR LEAST SQUARES PROBLEMS

NAG Fortran Library Routine Document F04CFF.1

Numerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms

Intel Math Kernel Library (Intel MKL) LAPACK

Block Lanczos Tridiagonalization of Complex Symmetric Matrices

Numerical Linear Algebra

Dynamic Scheduling within MAGMA

Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

NAG Library Routine Document F08JDF (DSTEVR)

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

The geometric mean algorithm

D. Gimenez, M. T. Camara, P. Montilla. Aptdo Murcia. Spain. ABSTRACT

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

c 2015 Society for Industrial and Applied Mathematics

A fast randomized algorithm for orthogonal projection

On a quadratic matrix equation associated with an M-matrix

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Large Scale Sparse Linear Algebra

Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs

Communication avoiding parallel algorithms for dense matrix factorizations

Accuracy and Stability in Numerical Linear Algebra

Numerical Methods I Eigenvalue Problems

A Parallel Bisection and Inverse Iteration Solver for a Subset of Eigenpairs of Symmetric Band Matrices

Index. for generalized eigenvalue problem, butterfly form, 211

Roundoff Error. Monday, August 29, 11

problem Au = u by constructing an orthonormal basis V k = [v 1 ; : : : ; v k ], at each k th iteration step, and then nding an approximation for the e

Transcription:

Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning or estimates of it. We propose performance results using new routines on top of the multicore-gpu library MAGMA. This set of routines is based on an efficient computation of the variance-covariance matrix for which, to our knowledge, there is no implementation in current public domain libraries LAPACK and ScaLAPACK. 1 Introduction Linear least squares (LLS) is a classical linear algebra problem in scientific computing, arising for instance in many parameter estimation problems [5]. We consider the overdetermined full rank linear least squares problem min x R n Ax b, with A R m n,m n and b R m. In addition to computing LLS solutions efficiently, an important issue is to assess the numerical quality of the computed solution. The notion of conditioning provides a theoretical framework that can be used to measure the numerical sensitivity of a problem solution to perturbations. Similarly to [, 3] we suppose that the perturbations on data are mesured using the Frobenius norms for matrices and the Euclidean norm for vectors. Then we can derive simple formulas for the condition number of the LLS solution x or its components using the R factor (from the QR decomposition of A), the residual and x. We can also use the variance-covariance matrix. In this paper we propose algorithms to compute LLS condition numbers in a computational time that is affordable for large scale simulations, in particular using M. Baboulin Inria and University of Paris-Sud, Orsay, France, e-mail: marc.baboulin@inria.fr J. Dongarra University of Tennessee, Knoxville, USA, e-mail: dongarra@eecs.utk.edu R. Lacroix Inria and University Pierre et Marie Curie, Paris, France, e-mail: remi.lacroix@inria.fr 1

M. Baboulin and J. Dongarra and R. Lacroix the variance-covariance matrix. We also compute statistical condition estimates that can be obtained cheaply (O(n ) operations) and with a satisfying accuracy using an approach similar to [6, 8]. For these algorithms we describe an implementation for LLS conditioning using the MAGMA library [4, 10] which is a dense linear algebra library for heterogeneous multicore-gpu architectures with interface similar to LAPACK. Our implementation takes advantage of current hybrid multicore-gpu systems by splitting the computational work between the GPU and the multicore host. We present performance results and these results are compared with the computational cost for computing the LLS solution itself. Closed formulas and statistical estimates In this section we recall some existing formulas to compute or estimate the condition number of an LLS solution x or of its components. We suppose that the LLS problem has already been solved using a QR factorization (the normal equations method is also possible but the condition number is then proportional to cond(a) ). Then the solution x, the residual r = b Ax, and the factor R R n n of the QR factorization of A are readily available. From [3] we obtain a closed formula for the absolute condition number of the LLS solution as κ LS = R 1 ( R 1 r + x + 1 ) 1, (1) where x, r and R are exact quantities. We can also compute κ LS, statistical estimate of κ LS that is obtained using the condition numbers of z T i x where z 1,z,...,z q are q random orthogonal vectors of R n, obtained for instance via a QR factorization of a random matrix Z R n q. The condition number of z T i x can be computed using the expression given in [3] as κ i = ( R 1 R T z i r + R T z i ( x + 1) ) 1. () Then κ LS is computed using the expression κ LS = ω q ω n q j=1 κ j with ω q = π(q 1 ). As explained in [6], choosing q = random vectors enables us to obtain a satisfying accuracy. By considering in Equation () the special case where z i = e i where e i is a canonical vector of R n, we can express in Equation (3) the condition number of the component x i = e T i x. Then we can calculate a vector κ CW R n with components κ i being the exact condition number of x i and expressed by κ i = ( R 1 R T e i r + R T e i ( x + 1) ) 1. (3) We can also find in [6, 8] a statistical estimate for each κ i.

Computing least squares condition numbers on hybrid multicore/gpu systems 3 3 Variance-covariance matrix In many physical applications, LLS problems are expressed using a statistical model often referred to as linear statistical model where we have to solve b = Ax + ε, A R m n, b R m, with ε being a vector of random errors having expected value 0 and variancecovariance σb I. The matrix A is called the regression matrix and the unknown vector x is called the vector of regression coefficients. Following the Gauss-Markov theorem [11], the least squares estimate ˆx is the linear unbiased estimator of x satisfying ˆx = argmin x R n Ax b, with minimum variance-covariance equal to C = σb (AT A) 1. The diagonal elements c ii of C give the variance of each component ˆx i. The off-diagonal elements c i j, i j give the covariance between ˆx i and ˆx j. Then instead of computing condition numbers (which are notions more commonly handled by numerical linear algebra practitioners) physicists often compute the variance-covariance matrix whose entries are intimately correlated with condition numbers κ i and κ LS mentioned previously. When the variance-covariance matrix has been computed, the condition numbers can be easily obtained. Indeed, we can use the fact that R 1 = C, R T e i = c ii, and R 1 R T e σb i = C i where C σb i and c ii are respectively the ith column and the ith diagonal element of the matrix C. Then by replacing respectively in Equations (1) and (3) we get the formulas κ LS = C 1/ ((m n) C + x + 1) 1/, (4) σ b σ b and κ i = 1 σ b ((m n) C i + c ii ( x + 1)) 1/. (5) 1 Note that, when m > n, m n r is an unbiased estimate of σ b [7, p. 4]. 4 Implementation details We developped a set of routines that compute the following quantities using the MAGMA library (release 1..1): - Variance-covariance matrix C. - κ LS, condition number of x. - κ CW, vector of the κ i, condition numbers of the solution components. - κ LS, statistical estimate of κ LS. - κ CW, vector of the statistical estimates κ i.

4 M. Baboulin and J. Dongarra and R. Lacroix The variance-covariance computation requires inverting a triangular matrix and multiplying this triangular matrix by its transpose (similarly to the LAPACK routine DPOTRI [1, p. 6] that computes the inverse of a matrix from its Cholesky factorization). These operations use a block algorithm which, for the diagonal blocks, is performed recursively. The recursive part is performed by the CPU for sake of performance while the rest of the algorithm is executed on the GPU. The computation of the exact condition number κ LS from the variance-covariance using Equation (4) involves the computation of the spectral norm of C which is generally computed via an SVD. However, since A is a full rank matrix, C is symmetric positive definite and its singular values coincide with its eigenvalues. Then we use an eigenvalue decomposition of C which is faster than an SVD because it takes into account the symmetry of C. The tridiagonalization phase is performed on the GPU while the subsequent eigenvalue computation is performed on the CPU host. The statistical estimates require the generation and orthonormalization of random vectors followed by triangular solves. The random generation and the triangular solves are performed on the GPU. The orthonormalization is performed on the CPU because it is applied to small matrices (small number of samples). 5 Performance results Our experiments have been achieved on a multicore processor Intel Xeon E5645 ( sockets 6 cores) running at.4 GHz (the cache size per core is 1 MB and the size of the main memory is 48 GB). This system hosts two GPU NVIDIA Tesla C075 running at 1.15 GHz with 6 GB memory each. MAGMA was linked with the libraries MKL 10.3.8 and CUDA 4.1, respectively, for multicore and GPU. We consider random LLS problems obtained using the method given in [9] for generating LLS test problems with known solution x and residual norm. We plot in Fig. 1 the CPU time to compute LLS solution and condition numbers using 1 threads and 1 GPU. We observe that the computation of the variancecovariance matrix and of the components conditioning κ i are significantly faster than the cost for solving the problem with respectively a time factor larger than 3 and, this factor increasing with the problem size. The κ i are computed using the variance-covariance matrix via Equation (5). The time overhead between the computation of the κ i and the variance-covariance computation comes from the computation of the norms of the columns (routine cublasdnrm) which has a nonoptimal implementation. As expected, the routines SCE LLS and SCE LLS CW that compute statistical condition estimates for respectively the solution and all solution components outperform the other routines. Note that we did not mention on this graph the performance for computing κ LS using Equation (4). Indeed this involves an eigenvalue decomposition of the variance-covariance matrix (MAGMA routine magma dsyevd gpu), which turns out to be much slower than the LLS solution (MAGMA routine magma dgels3 gpu) in spite of a smaller number of flops

Computing least squares condition numbers on hybrid multicore/gpu systems 5 (O(n 3 ) vs O(mn )) which shows that having an efficient implementation on the targetted architecture is essential to take advantage of the gain in flops. 10 9 8 LLS solution Conditioning of all components Variance-covariance matrix SCE LLS CW SCE LLS 7 6 Time (s) 5 4 3 1 0 9984 496 14080 350 1780 430 19968 499 7 5568 4448 611 6368 659 Problem size (m n) Fig. 1: Performance for computing LLS condition numbers with MAGMA We can illustrate this by comparing in Fig. the time for computing an LLS solution and its conditioning using LAPACK and MAGMA. We observe that MAGMA provides faster solution and condition number but, contrary to LAPACK, the computation of the condition number is slower than the time for the solution, in spite of a smaller flops count. This shows the need for improving the Gflop/s performance of eigensolvers or SVD solvers for GPUs but it also confirms the interest of considering statistical estimates on multicore-gpu architectures to get fast computations. 6 Conclusion We proposed new implementations for computing LLS condition numbers using the software libraries LAPACK and MAGMA. The performance results that we obtained on a current multicore-gpu system confirmed the interest of using statistical condition estimates. New routines will be integrated in the next releases of LAPACK and MAGMA to compute the variance-covariance matrix after a linear regression.

6 M. Baboulin and J. Dongarra and R. Lacroix 30 5 LLS solution LAPACK Exact condition number LAPACK Exact condition number MAGMA LLS solution MAGMA 0 Time (s) 15 10 5 0 9984 496 14080 350 1780 430 19968 499 7 5568 4448 611 6368 659 Problem size (m n) Fig. : Time for LLS solution and condition number References 1. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Croz, J.D., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users Guide. SIAM, Philadelphia (1999). Third edition. Arioli, M., Baboulin, M., Gratton, S.: A partial condition number for linear least-squares problems. SIAM J. Matrix Analysis and Applications 9(), 413 433 (007) 3. Baboulin, M., Dongarra, J., Gratton, S., Langou, J.: Computing the conditioning of the components of a linear least squares solution. Numerical Linear Algebra with Applications 16(7), 517 533 (009) 4. Baboulin, M., Dongarra, J., Tomov, S.: Some issues in dense linear algebra for multicore and special purpose architectures. In: 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA 08), Lecture Notes in Computer Science, vol. 616-617. Springer-Verlag (008) 5. Baboulin, M., Giraud, L., Gratton, S., Langou, J.: Parallel tools for solving incremental dense least squares problems. Application to space geodesy. Journal of Algorithms and Computational Technology 3(1), 177 133 (009) 6. Baboulin, M., Gratton, S., Lacroix, R., Laub, A.J.: Statistical estimates for the conditioning of linear least squares problems. In: Proceedings of 10th International Conference on Parallel Processing and Applied Mathematics (PPAM 013) (013) 7. Björck, A.: Numerical Methods for Least Squares Problems. SIAM, Philadelphia (1996) 8. Kenney, C.S., Laub, A.J., Reese, M.S.: Statistical condition estimation for linear least squares. SIAM J. Matrix Analysis and Applications 19(4), 906 93 (1998) 9. Paige, C.C., Saunders, M.A.: LSQR: An algorithm for sparse linear equations and sparse least squares. ACM Trans. Math. Softw. 8(1), 43 71 (198) 10. Tomov, S., Dongarra, J., Baboulin, M.: Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Computing 36(5&6), 3 40 (010) 11. Zelen, M.: Linear estimation and related topics. In: J.Todd (ed.) Survey of numerical analysis, pp. 558 584. McGraw-Hill, New York (196)