GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Similar documents
Parallel Polynomial Evaluation

Quality Up in Polynomial Homotopy Continuation

A Blackbox Polynomial System Solver on Parallel Shared Memory Computers

Quality Up in Polynomial Homotopy Continuation by Multithreaded Path Tracking

A Blackbox Polynomial System Solver on Parallel Shared Memory Computers

Solving Polynomial Systems in the Cloud with Polynomial Homotopy Continuation

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Computing least squares condition numbers on hybrid multicore/gpu systems

Welcome to MCS 572. content and organization expectations of the course. definition and classification

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Dense Arithmetic over Finite Fields with CUMODP

Introduction to numerical computations on the GPU

Heterogenous Parallel Computing with Ada Tasking

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

CRYPTOGRAPHIC COMPUTING

SOLUTION of linear systems of equations of the form:

Strassen s Algorithm for Tensor Contraction

Parallel Sparse Tensor Decompositions using HiCOO Format

Factoring Solution Sets of Polynomial Systems in Parallel

Multitasking Polynomial Homotopy Continuation in PHCpack. Jan Verschelde

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Shortest Lattice Vector Enumeration on Graphics Cards

Review for the Midterm Exam

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

on Newton polytopes, tropisms, and Puiseux series to solve polynomial systems

A new multiplication algorithm for extended precision using floating-point expansions. Valentina Popescu, Jean-Michel Muller,Ping Tak Peter Tang

A CUDA Solver for Helmholtz Equation

Porting a sphere optimization program from LAPACK to ScaLAPACK

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): )

Tropical Approach to the Cyclic n-roots Problem

PSEUDORANDOM numbers are very important in practice

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

The geometric mean algorithm

arxiv: v1 [physics.comp-ph] 22 Nov 2012

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

MARCH 24-27, 2014 SAN JOSE, CA

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Porting a Sphere Optimization Program from lapack to scalapack

Certificates in Numerical Algebraic Geometry. Jan Verschelde

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Practicality of Large Scale Fast Matrix Multiplication

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Parallel Transposition of Sparse Data Structures

DIMACS Workshop on Parallelism: A 2020 Vision Lattice Basis Reduction and Multi-Core

An Algorithmic Framework of Large-Scale Circuit Simulation Using Exponential Integrators

Accelerating Quantum Chromodynamics Calculations with GPUs

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM

APPLIED NUMERICAL LINEAR ALGEBRA

A High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Scalable and Power-Efficient Data Mining Kernels

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Numerical Computations in Algebraic Geometry. Jan Verschelde

A hybrid Hermitian general eigenvalue solver

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Calculate Sensitivity Function Using Parallel Algorithm

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Solving Schubert Problems with Littlewood-Richardson Homotopies

Level-3 BLAS on a GPU

Parallel Numerical Linear Algebra

A communication-avoiding thick-restart Lanczos method on a distributed-memory system

A Stochastic-based Optimized Schwarz Method for the Gravimetry Equations on GPU Clusters

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

Polynomial multiplication and division using heap.

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Hydra. A library for data analysis in massively parallel platforms. A. Augusto Alves Jr and Michael D. Sokoloff

Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI

Sparse Polynomial Multiplication and Division in Maple 14

On the design of parallel linear solvers for large scale problems

Sparse BLAS-3 Reduction

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8))

Nuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory

ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS

Introduction, basic but important concepts

Solving PDEs with CUDA Jonathan Cohen

Direct Self-Consistent Field Computations on GPU Clusters

Practical Combustion Kinetics with CUDA

Math 411 Preliminaries

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

Solving Triangular Systems More Accurately and Efficiently

Tropical Islands. Jan Verschelde

IN THE international academic circles MATLAB is accepted

Lecture 8: Fast Linear Solvers (Part 7)

INITIAL INTEGRATION AND EVALUATION

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

Towards Fast, Accurate and Reproducible LU Factorization

Toward High Performance Matrix Multiplication for Exact Computation

Transcription:

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago Department of Mathematics, Statistics, and Computer Science http://www.math.uic.edu/ jan jan@math.uic.edu the 16th IEEE International Conference on High Performance Computing and Communications, 20-22 August 2014, Paris, France Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 1 / 29

Outline 1 Solving Polynomial Systems Numerically compensating for the cost of double doubles and quad doubles how large? sizing up the problem dimensions 2 Polynomial System Evaluation and Differentiation the reverse mode of algorithmic differentiation arithmetic circuits computational results 3 A Massively Parallel Modified Gram-Schmidt Method implementing a right-looking algorithm 4 Computational Results the Chandrasekhar H-Equation the cyclic n-roots problem Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 2 / 29

GPU accelerated Newton s method 1 Solving Polynomial Systems Numerically compensating for the cost of double doubles and quad doubles how large? sizing up the problem dimensions 2 Polynomial System Evaluation and Differentiation the reverse mode of algorithmic differentiation arithmetic circuits computational results 3 A Massively Parallel Modified Gram-Schmidt Method implementing a right-looking algorithm 4 Computational Results the Chandrasekhar H-Equation the cyclic n-roots problem Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 3 / 29

solving polynomial systems numerically Problem statement: when solving large polynomial systems, the hardware double precision may not be sufficient for accurate solutions. Our goal: accelerate computations with general purpose Graphics Processing Units (GPUs) to compensate for the overhead caused by double double and quad double arithmetic. Our first results (jointly with Genady Yoffe) on pursuing this goal with multicore computers are in the PASCO 2010 proceedings. Our focus is on Newton s method: 1 To evaluate and differentiate the polynomial systems, we apply the reverse method of algorithmic differentiation. 2 To solve linear systems in the least squares sense, we apply the modified Gram-Schmidt method. Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 4 / 29

quad double precision A quad double is an unevaluated sum of 4 doubles, improves working precision from 2.2 10 16 to 2.4 10 63. Y. Hida, X.S. Li, and D.H. Bailey: Algorithms for quad-double precision floating point arithmetic. In the 15th IEEE Symposium on Computer Arithmetic, pages 155 162. IEEE, 2001. Software at http://crd.lbl.gov/ dhbailey/mpdist/qd-2.3.9.tar.gz. Predictable overhead: working with double double is of the same cost as working with complex numbers. Simple memory management. The QD library has been ported to the GPU by M. Lu, B. He, and Q. Luo: Supporting extended precision on graphics processors. In the Proceedings of the Sixth International Workshop on Data Management on New Hardware (DaMoN 2010), pages 19 26, 2010. Software at http://code.google.com/p/gpuprec/. Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 5 / 29

GPU accelerated Newton s method 1 Solving Polynomial Systems Numerically compensating for the cost of double doubles and quad doubles how large? sizing up the problem dimensions 2 Polynomial System Evaluation and Differentiation the reverse mode of algorithmic differentiation arithmetic circuits computational results 3 A Massively Parallel Modified Gram-Schmidt Method implementing a right-looking algorithm 4 Computational Results the Chandrasekhar H-Equation the cyclic n-roots problem Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 6 / 29

about large polynomial systems: how large? The number of isolated solutions and/or the degrees of positive dimensional solution sets may grow exponentially in the dimensions of the input: degrees of the equations and number of variables. Because of the exponential growth of the output (e.g. 20 quadrics 2 20 solutions), global solving is limited to rather small dimensions. In local solving, we focus on one solution and may increase the dimension to several hundreds and even thousands. One limitation is the storage needed for all exponent vectors and coefficients of the monomials. Most systems arising in applications are sparse, the number of monomials grows modestly in the number of variables. Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 7 / 29

some related work in algorithmic differentiation M. Grabner, T. Pock, T. Gross, and B. Kainz. Automatic differentiation for GPU-accelerated 2D/3D registration. In Advances in Automatic Differentiation, pages 259 269. Springer, 2008. G. Kozikowski and B.J. Kubica. Interval arithmetic and automatic differentiation on GPU using OpenCL. In PARA 2012, LNCS 7782, pages 489-503, Springer 2013. some related work in polynomial system solving R.A. Klopotek and J. Porter-Sobieraj. Solving systems of polynomial equations on a GPU. In Preprints of the Federated Conference on Computer Science and Information Systems, September 9-12, 2012, Wroclaw, Poland, pages 567 572, 2012. M.M. Maza and W. Pan. Solving bivariate polynomial systems on a GPU. ACM Communications in Computer Algebra, 45(2):127 128, 2011. Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 8 / 29

some related work in numerical linear algebra M. Anderson, G. Ballard, J. Demmel, and K. Keutzer. Communication-avoiding QR decomposition for GPUs. In Proceedings of the IPDPS 2011, pages 48 58. IEEE Computer Society, 2011. T. Bartkewitz and T. Güneysu. Full lattice basis reduction on graphics cards. In WEWoRC 11 Proceedings, LNCS vol. 7242, pages 30 44, Springer, 2012. J. Demmel, Y. Hida, X.S. Li, and E.J. Riedy. Extra-precise iterative refinement for overdetermined least squares problems. ACM Trans. Math. Softw., 35(4):28:1 28:32, 2009. D. Mukunoki and D. Takashashi. Implementation and evaluation of triple precision BLAS subroutines on GPUs. In Proceedings of PDSEC 2012, pages 1372 1380. IEEE Computer Society, 2012. V. Volkov and J. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE Press, 2008. Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 9 / 29

hardware and software Our main target: the NVIDIA Tesla K20C, with 2496 cores at 706 MHz, hosted by a RHEL workstation of Microway, with Intel Xeon E5-2670 at 2.6 GHz. Used 4.4.7 of gcc and 5.5 of nvcc. Our other computer is an HP Z800 RHEL workstation with 3.47 GHz Intel Xeon X5690, hosting the NVIDIA Tesla C2050 has 448 cores at 1147 Mhz. Used 4.4.7 of gcc and 5.5 of nvcc. The C++ code for the Gram-Schmidt method to run on the host is directly based on the pseudo code, compiled with -O2 flag. Our serial C++ code served mainly to verify correctness. Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 10 / 29

GPU accelerated Newton s method 1 Solving Polynomial Systems Numerically compensating for the cost of double doubles and quad doubles how large? sizing up the problem dimensions 2 Polynomial System Evaluation and Differentiation the reverse mode of algorithmic differentiation arithmetic circuits computational results 3 A Massively Parallel Modified Gram-Schmidt Method implementing a right-looking algorithm 4 Computational Results the Chandrasekhar H-Equation the cyclic n-roots problem Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 11 / 29

polynomial evaluation and differentiation We distinguish three stages: 1 Common factors and tables of power products: x d 1 1 x d 2 2 x n dn = x i1 x i2 x ik x e j 1 j 1 x e j 2 j 2 x e j l j l The factor x e j 1 j 1 x e j 2 j 2 x e j l j l is common to all partial derivatives. The factors are evaluated as products of pure powers of the variables, computed in shared memory by each block of threads. 2 Evaluation and differentiation of products of variables: Computing the gradient of x 1 x 2 x n with the reverse mode of algorithmic differentiation requires 3n 5 multiplications. 3 Coefficient multiplication and term summation. Summation jobs are ordered by the number of terms so each warp has the same amount of terms to sum. Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 12 / 29

GPU accelerated Newton s method 1 Solving Polynomial Systems Numerically compensating for the cost of double doubles and quad doubles how large? sizing up the problem dimensions 2 Polynomial System Evaluation and Differentiation the reverse mode of algorithmic differentiation arithmetic circuits computational results 3 A Massively Parallel Modified Gram-Schmidt Method implementing a right-looking algorithm 4 Computational Results the Chandrasekhar H-Equation the cyclic n-roots problem Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 13 / 29

arithmetic circuits First to evaluate the product: and then to compute the gradient: x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 x 1 x 3 x 4 x 2 x 3 x 4 x 1 x 2 x 3 x 1 x 2 x 4 2 x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 14 / 29

computing the gradient of x 1 x 2 x 8 Denote by x i:j the product x i x k x j for all k between i and j. x 1 x 3:8 x 2 x 3:8 x 3 x 1:2 x 5:8 x 4 x 1:2 x 5:8 x 5 x 1:4 x 7:8 x 6 x 1:4 x 7:8 x 7 x 1:6 x 8 x 1:6 x 1:4 x x 1:2 x 5:8 x 3:4 x 5:6 x 1:4 x 7:8 2 5:8 x 1:4 x 5:8 x 1:2 x 3:4 x 5:6 x 7:8 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 The computation of the gradient of x 1 x 2 x n requires 2n 4 multiplications, and n 1 extra memory locations. Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 15 / 29

GPU accelerated Newton s method 1 Solving Polynomial Systems Numerically compensating for the cost of double doubles and quad doubles how large? sizing up the problem dimensions 2 Polynomial System Evaluation and Differentiation the reverse mode of algorithmic differentiation arithmetic circuits computational results 3 A Massively Parallel Modified Gram-Schmidt Method implementing a right-looking algorithm 4 Computational Results the Chandrasekhar H-Equation the cyclic n-roots problem Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 16 / 29

optimized parallel reduction Evaluation and differentiation of 65,024 monomials in 1,024 doubles: Times on the K20C obtained with nvprof (the NVIDIA profiler) are in milliseconds (ms). Dividing the number of bytes read and written by the time gives the bandwidth. Times on the CPU are on one 2.6GHz Intel Xeon E5-2670, with code optimized with the -O2 flag. method time bandwidth speedup CPU 330.24ms GPU one thread per monomial 86.43ms 3.82 one block per monomial 15.54ms 79.81GB/s 21.25 sequential addressing 14.08ms 88.08GB/s 23.45 unroll last wrap 10.19ms 121.71GB/s 32.40 complete unroll 9.10ms 136.28GB/s 36.29 Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 17 / 29

different sizes of monomials Evaluation and differentiation of m monomials of different size n: by 65,024 blocks with 512 threads per block for 1,024 doubles in shared memory, accelerated by the K20C with timings in milliseconds obtained by the NVIDIA profiler. Times on the CPU are on one 2.6GHz Intel Xeon E5-2670, with code optimized with the -O2 flag. n m CPU GPU speedup 1024 1 330.24ms 9.12ms 36.20 512 2 328.92ms 8.73ms 37.66 256 4 320.78ms 8.84ms 36.29 128 8 309.02ms 8.15ms 37.89 64 16 289.30ms 7.27ms 39.77 32 32 256.07ms 9.51ms 26.94 16 64 230.34ms 8.86ms 25.99 8 128 218.74ms 7.79ms 28.07 4 256 202.20ms 7.05ms 28.69 Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 18 / 29

GPU accelerated Newton s method 1 Solving Polynomial Systems Numerically compensating for the cost of double doubles and quad doubles how large? sizing up the problem dimensions 2 Polynomial System Evaluation and Differentiation the reverse mode of algorithmic differentiation arithmetic circuits computational results 3 A Massively Parallel Modified Gram-Schmidt Method implementing a right-looking algorithm 4 Computational Results the Chandrasekhar H-Equation the cyclic n-roots problem Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 19 / 29

modified Gram-Schmidt on the GPU With Gram-Schmidt orthonormalization we solve linear systems in the least squares sense, capable of handling overdetermined problems. In [Volkov-Demmel, 2008], the left-looking scheme is dismissed because of its limited inherent parallelism. For thread-level parallelism, we implemented the right-looking version of the modified Gram-Schmidt method. Although left-looking is preferable for memory transfers, our computations are compute bound for complex double double, quad double, and complex quad double arithmetic. Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 20 / 29

GPU accelerated Newton s method 1 Solving Polynomial Systems Numerically compensating for the cost of double doubles and quad doubles how large? sizing up the problem dimensions 2 Polynomial System Evaluation and Differentiation the reverse mode of algorithmic differentiation arithmetic circuits computational results 3 A Massively Parallel Modified Gram-Schmidt Method implementing a right-looking algorithm 4 Computational Results the Chandrasekhar H-Equation the cyclic n-roots problem Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 21 / 29

the discretization of an integral equation Formulating a polynomial system for any dimension: n 1 f i (H 1, H 2,..., H n ) = 2nH i ch i i i + j H j 2n = 0, j=0 i = 1, 2,...,n, where c is some real nonzero constant, 0 < c 1. The cost to evaluate and differentiate grows linear in n... the cost of Newton s method is dominated by the cost of solving a linear system which grows as n 3. The value for the parameter c we used in our experiments is 33/64. Starting at H i = 1 for all i leads to a quadratically convergent process. Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 22 / 29

a benchmark problem The problem was treated with Newton s method in [1] and added to a collection of benchmark problems in [2]. In [3], the system was studied with methods in computer algebra. 1. C.T. Kelley. Solution of the Chandrasekhar h-equation by Newton s method. J. Math. Phys., 21(7):1625 1628, 1980. 2. J.J. Moré. A collection of nonlinear model problems. In Computational Solution of Nonlinear Systems of Equations, volume 26 of Lectures in Applied Mathematics, pages 723 762. AMS, 1990. 3. L. Gonzalez-Vega. Some examples of problem solving by using the symbolic viewpoint when dealing with polynomial systems of equations. In J. Fleischer, J. Grabmeier, F.W. Hehl, and W. Küchlin, editors, Computer Algebra in Science and Engineering, pages 102 116. World Scientific, 1995. Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 23 / 29

experimental setup for a number of iterations : 1. The host evaluates and differentiates the system at the current approximation, stored in an n-by-(n + 1) matrix [A b], with b = f(h 1, H 2,...,H n ); print b 1. 2. A x = b is solved in the least squares sense, either entirely by the host; or if accelerated, then 2.1 the matrix [A b] is transferred from the host to the device; 2.2 the device does a QR decomposition on [A b] and back substitution on the system R x = y; 2.3 the matrices Q, R, and the solution x are transferred from the device to the host. 3. The host performs the update x = x + x to compute the new approximation. The first component of x and x are printed. Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 24 / 29

accelerating on the K20C Running six iterations of Newton s method in complex double double arithmetic on one core on the CPU and accelerated by the K20C with block size equal to 128, once with the evaluation and differentiation done by the CPU (GPU1) and once with all computations on the GPU (GPU2). n mode real user sys speedup 1024 CPU 5m22.360s 5m21.680s 0.139s GPU1 24.074s 18.667s 5.203s 13.39 GPU2 20.083s 11.564s 8.268s 16.05 2048 CPU 42m41.597s 42m37.236s 0.302s GPU1 2m45.084s 1m48.502s 56.175s 15.52 GPU2 2m29.770s 1m26.373s 1m03.014s 17.10 3072 CPU 144m13.978s 144m00.880s 0.216s GPU1 8m50.933s 5m34.427s 3m15.608s 16.30 GPU2 8m15.565s 4m43.333s 3m31.362s 17.46 4096 CPU 340m00.724s 339m27.019s 0.929s GPU1 20m26.989s 13m39.416s 6m45.799s 16.63 GPU2 19m24.243s 11m01.558s 8m20.698s 17.52 Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 25 / 29

GPU accelerated Newton s method 1 Solving Polynomial Systems Numerically compensating for the cost of double doubles and quad doubles how large? sizing up the problem dimensions 2 Polynomial System Evaluation and Differentiation the reverse mode of algorithmic differentiation arithmetic circuits computational results 3 A Massively Parallel Modified Gram-Schmidt Method implementing a right-looking algorithm 4 Computational Results the Chandrasekhar H-Equation the cyclic n-roots problem Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 26 / 29

the cyclic n-roots problem A system relevant to operator algebras is: x 0 + x 1 + + x n 1 = 0 x 0 x 1 + x 1 x 2 + + x n 2 x n 1 + x n 1 x 0 = 0 i = 3, 4,...,n 1 : n 1 j=0 x 0 x 1 x 2 x n 1 1 = 0. j+i 1 k=j x k mod n = 0 The system is a benchmark problem in computer algebra. The cost to evaluate and differentiate the system is O(n 3 ). Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 27 / 29

accelerating on the K20C Evaluation and differentiation of the cyclic n-roots problem in complex double (D), complex double double (DD), and complex quad double (QD) arithmetic. Times in milliseconds are obtained with the NVIDIA profiler. n CPU GPU speedup D 128 16.39ms 1.13ms 14.87 256 136.26ms 6.97ms 19.55 384 475.94ms 21.59ms 22.05 448 747.44ms 32.68ms 22.87 512 1097.20ms 46.43ms 23.63 DD 128 144.27ms 6.45ms 22.36 256 1169.07ms 37.15ms 31.47 384 3981.07ms 120.48ms 33.04 448 6323.17ms 182.19ms 34.52 512 9411.55ms 267.39ms 35.20 QD 128 1349.55ms 29.45ms 45.82 256 10987.87ms 152.82ms 71.90 384 37323.08ms 513.78ms 72.64 448 59247.04ms 809.15ms 73.22 Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 28 / 29

conclusions On a massively parallel implementation of Newton s method for large polynomial systems: The cost of linear algebra dominates in case the cost to evaluate and differentiate the system is linear in the dimension. For systems where the cost to evaluate and differentiate is cubic in the dimension, the computations are memory bound for double and double double arithmetic; and become compute bound for complex double double, quad double, and complex quad double arithmetic. Jan Verschelde (UIC) GPU accelerated Newton s method HPCC 2014 29 / 29