Dense Arithmetic over Finite Fields with CUMODP

Similar documents
Hardware Acceleration Technologies in Computer Algebra: Challenges and Impact

A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads

Parallel Integer Polynomial Multiplication Changbo Chen, Svyatoslav Parallel Integer Covanov, Polynomial FarnamMultiplication

On The Parallelization Of Integer Polynomial Multiplication

Plain Polynomial Arithmetic on GPU

Accelerating linear algebra computations with hybrid GPU-multicore systems.

FFT-based Dense Polynomial Arithmetic on Multi-cores

Toward High Performance Matrix Multiplication for Exact Computation

Cache Complexity and Multicore Implementation for Univariate Real Root Isolation

Big Prime Field FFT on the GPU

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Review for the Midterm Exam

PUTTING FÜRER ALGORITHM INTO PRACTICE WITH THE BPAS LIBRARY. (Thesis format: Monograph) Linxiao Wang. Graduate Program in Computer Science

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Introduction to numerical computations on the GPU

Parallel Performance Theory - 1

Balanced Dense Polynomial Multiplication on Multicores

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

High-Performance Symbolic Computation in a Hybrid Compiled-Interpreted Programming Environment

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

On the Factor Refinement Principle and its Implementation on Multicore Architectures

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Computations Modulo Regular Chains

A Blackbox Polynomial System Solver on Parallel Shared Memory Computers

Solving Polynomial Systems Symbolically and in Parallel

Toward High-performance Polynomial System Solvers Based on Triangular Decompositions

The modpn library: Bringing Fast Polynomial Arithmetic into Maple

arxiv: v1 [cs.na] 8 Feb 2016

Balanced dense polynomial multiplication on multi-cores

Computing least squares condition numbers on hybrid multicore/gpu systems

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Parallel Polynomial Evaluation

What s the best data structure for multivariate polynomials in a world of 64 bit multicore computers?

Component-level Parallelization of Triangular Decompositions

CPSC 518 Introduction to Computer Algebra Schönhage and Strassen s Algorithm for Integer Multiplication

Parallel Performance Theory

Some notes on efficient computing and setting up high performance computing environments

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

In-place Arithmetic for Univariate Polynomials over an Algebraic Number Field

POLY : A new polynomial data structure for Maple 17 that improves parallel speedup.

Parallelism in Computer Arithmetic: A Historical Perspective

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

TOWARD HIGH-PERFORMANCE POLYNOMIAL SYSTEM SOLVERS BASED ON TRIANGULAR DECOMPOSITIONS. (Spine title: Contributions to Polynomial System Solvers)

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

POLY : A new polynomial data structure for Maple 17 that improves parallel speedup.

Sparse Polynomial Multiplication and Division in Maple 14

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Subquadratic Space Complexity Multiplication over Binary Fields with Dickson Polynomial Representation

Efficient implementation of the Hardy-Ramanujan-Rademacher formula

Real root isolation for univariate polynomials on GPUs and multicores

Analysis of Multithreaded Algorithms

Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm)

Continued fractions and number systems: applications to correctly-rounded implementations of elementary functions and modular arithmetic.

Computing Characteristic Polynomials of Matrices of Structured Polynomials

Exact Arithmetic on a Computer

CPSC 518 Introduction to Computer Algebra Asymptotically Fast Integer Multiplication

Marwan Burelle. Parallel and Concurrent Programming. Introduction and Foundation

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Polynomial multiplication and division using heap.

Notes on MapReduce Algorithms

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

POLY : A new polynomial data structure for Maple.

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Perm State University Research-Education Center Parallel and Distributed Computing

Disconnecting Networks via Node Deletions

CS 829 Polynomial systems: geometry and algorithms Lecture 3: Euclid, resultant and 2 2 systems Éric Schost

Towards Fast, Accurate and Reproducible LU Factorization

On the design of parallel linear solvers for large scale problems

Solving PDEs with CUDA Jonathan Cohen

Change of Ordering for Regular Chains in Positive Dimension

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

R ij = 2. Using all of these facts together, you can solve problem number 9.

Multi-threading model

Heterogenous Parallel Computing with Ada Tasking

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Parallel Transposition of Sparse Data Structures

Computational Complexity

Fast reversion of formal power series

On the Computational Complexity of the Discrete Pascal Transform

Space- and Time-Efficient Polynomial Multiplication

Determine the size of an instance of the minimum spanning tree problem.

On enumerating monomials and other combinatorial structures by polynomial interpolation

A new multiplication algorithm for extended precision using floating-point expansions. Valentina Popescu, Jean-Michel Muller,Ping Tak Peter Tang

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

COMPUTER SCIENCE TRIPOS

ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS

The Ongoing Development of CSDP

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

CRYPTOGRAPHIC COMPUTING

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Efficient Evaluation of Large Polynomials

On the Complexity of Mapping Pipelined Filtering Services on Heterogeneous Platforms

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

Modular Methods for Solving Nonlinear Polynomial Systems

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

P C max. NP-complete from partition. Example j p j What is the makespan on 2 machines? 3 machines? 4 machines?

Lecture 2: Scheduling on Parallel Machines

Anatomy of SINGULAR talk at p. 1

Transcription:

Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III, Spain 3 Intel Corporation, Canada ICMS, August 8, 2014 1 / 30

Outline 1 Overview 2 A many-core machine model for minimizing parallelism overheads 3 Putting the MCM model into practice 4 Adaptive algorithms 5 Bivariate system solving 6 Conclusion 2 / 30

Background Reducing everything to multiplication Polynomial multiplication and matrix multiplication are at the core of many algorithms in symbolic computation. Algebraic complexity is often estimated in terms of multiplication time At the software level, this reduction is also common (Magma, NTL) Can we do the same for SIMD-multithreaded algorithms? Building blocks in scientific software The Basic Linear Algebra Subprograms (BLAS) is an inspiring and successful project providing low-level kernels in linear algebra, used by LINPACK, LAPACK, MATLAB, Mathematica, Julia (among others). Other BB s successful projects: FFTW, SPIRAL (among others). The The GNU Multiple Precision Arithmetic Library project plays a similar role for rational numbers and floating-point numbers. No symbolic computation software dedicated to sequential polynomial arithmetic managed to play the unification role of the BLAS. Could this work in the case of GPUs? 3 / 30

CUMODP Mandate (1/2) Functionalities Level 1: basic arithmetic operations that are specific to a polynomial representation or a coefficient ring: multi-dimensional FFTs/TFTs, converting integers from CRA to mixed-radix representations Level 2: basic arithmetic operations for dense or sparse polynomials with coefficients in Z/pZ, Z or in floating point numbers: polynomial multiplication, polynomial division Level 3: advanced arithmetic operations on families of polynomials: operations based on subproduct trees, computations of subresultant chains. 4 / 30

CUMODP Mandate (2/2) Targeted architectures Graphics Processing Units (GPUs) with code written in CUDA, CUMODP aims at supporting BPAS (Basic Polynomial Algebra Subprograms) which targets multi-core processors and which is written in CilkPlus. Thanks to our Meta Fork framework, automatic translation between CilkPlus and OpenMP are possible, as well as conversions to C/C++. Unifiying code for CUMODP and BPAS is conceivable (see the SPIRAL project) but highly complex (multi-core processors enforce memory consistency while GPUs do not, etc.) 5 / 30

CUMODP Design Algorithm choice Level 1 functions (n-d FFTs/TFTs) are highly optimized in terms of arithmetic count, locality and parallelism. Level 2 functions provide several algorithms or implementation for the same operation: coarse-grained & fine-grained, plain & FFT-based. Level 3 functions combine several Level 2 algorithms for achieving a given task. Implementation techniques At Level 2, the user can choose between algorithms minimizing work or algorithms maximizing parallelism At Level 3, this leads to adaptive algorithms that select appropriate Level 2 functions depending on available resources (number of cores, input data size). 6 / 30

Code organization Main features developers and users have access to the same code. mainly 32bit arithmetic so far Regression tests and benchmark scripts are also distributed. Documentation is generated by doxygen. A manually written documentation is work in progress. Three student theses and about 10 papers present & document CUMODP. 7 / 30

Outline 1 Overview 2 A many-core machine model for minimizing parallelism overheads 3 Putting the MCM model into practice 4 Adaptive algorithms 5 Bivariate system solving 6 Conclusion 8 / 30

A many-core machine model Similarly to a CUDA program, an MMM program specifies for each kernel the number of thread-blocks and the number of threads per thread-block. 9 / 30

A many-core machine model: programs An MMM program P is a directed acyclic graph (DAG), called kernel DAG of P, whose vertices are kernels and edges indicate serial dependencies. Since each kernel of the program P decomposes into a finite number of thread-blocks, we map P to a second graph, called thread-block DAG of P, whose vertex set consists of all thread-blocks of P. 10 / 30

A many-core machine model: machine parameters Machine parameters Z: Size (expressed in machine words) of the local memory of any SM. U: Time (expressed in clock cycles) to transfer one machine word between the global memory and the local memory of any SM. (Assume U > 1.) Thus, if α and β are the maximum number of words respectively read and written to the global memory by a thread of a thread-block, then the data transferring time T D between the global and local memory of a thread-block satisfies T D (α + β) U. 11 / 30

A many-core machine model: complexity measures (1/2) For any kernel K of an MMM program, work W (K) is the total number of local operations of all its threads; span S(K) is the maximum number of local operations of one thread. For the entire program P, work W (P) is the total work of all its kernels; span S(P) is the longest path, counting the weight (span) of each vertex (kernel), in the kernel DAG. 12 / 30

A many-core machine model: complexity measures (2/2) For any kernel K of an MMM program, parallelism overhead O(K) is the total data transferring time among all its thread-blocks. For the entire program P, parallelism overhead O(P) is the total parallelism overhead of all its kernels. 13 / 30

A many-core machine model: running time estimates A Graham-Brent theorem with parallelism overhead Theorem. Let K be the maximum number of thread blocks along an anti-chain of the thread-block DAG of P. Then the running time T P of the program P satisfies: T P (N(P)/K + L(P)) C(P), where N(P): number of vertices in the thread-block DAG, L(P): critical path length (where length of a path is the number of edges in that path) in the thread-block DAG. Note: The Graham-Brent theorem:, a greedy scheduler operating with P processors executes a multithreaded computation with work T 1 and span T in time T P T 1/P + T. 14 / 30

Outline 1 Overview 2 A many-core machine model for minimizing parallelism overheads 3 Putting the MCM model into practice 4 Adaptive algorithms 5 Bivariate system solving 6 Conclusion 15 / 30

Applying the MCM model Applying MCM to minimize parallelism overheads by determining an appropriate value range for a given program parameter In each case, a program P(s) depends on a parameter s which varies in a range S around an initial value s 0 ; the work ratio W s0 /W s remains essentially constant, meanwhile the parallelism overhead O s varies more substantially, say O s0 /O s Θ(s s 0 ); Then, we determine a value s min S maximizing the ratio O s0 /O s. Next, we use our version of Graham-Brent theorem to check whether the upper bound for the running time of P(s min ) is less than that of P(s o ). If this holds, we view P(s min ) as a solution of our problem of algorithm optimization (in terms of parallelism overheads). 16 / 30

MCM applied to plain univariate polynomial division Applying MCM to the polynomial division operation Naive division algorithm of a thread-block with s = 1: each kernel performs 1 division step W 1 W s = 8 (Z+1) 9 Z+7, O 1 O s = 20 441 Z, T 1 T s = Optimized division algorithm a thread-block with s > 1: each kernel performs s division (3+5 U) Z 3 (Z+21 U). steps T 1 /T s > 1 if and only if Z > 12.6 holds, which clearly holds on actual GPU architectures. Thus, the optimized algorithm (that is for s > 1) is overall better than the naive one (that is for s = 1). 17 / 30

MCM applied to plain univariate polynomial multiplication and the Euclidean algorithm (1/2) Applying MCM to the plain multiplication and the Euclidean algorithm For plain polynomial multiplication, this analysis suggested to minimize s. For the Euclidean algorithm, our analysis suggested to maximize the program parameter s. Both are verified experimentally. Plain polynomial multiplication The Euclidean algorithm 18 / 30

MCM applied to plain univariate polynomial multiplication and the Euclidean algorithm (1/2) CUMODP vs NTL CUMODP: plain but parallel algorithms NTL: asymptotically fast FFT-based but serial algorithms Our experimentations are executed on a NVIDIA Tesla M2050 GPU and an Intel Xeon X5650 CPU. CUMODP plain polynomial division vs NTL FFT-based polynomial division. CUMODP plain Euclidean algorithm vs NTL FFT-based polynomial GCD. 19 / 30

FFT-based multiplication Our GPU implementation relies on Stockham FFT algorithm Let d be the degree, then n = 2 log 2 (2 d 1). Based on the MCM, the work is 15 n log 2 (n) + 2 n, the span is 15 n + 2, and the parallelism overhead is (36 n + 21) U. Size CUMODP FLINT Speedup 2 12 0.0032 0.003 2 13 0.0023 0.008 3.441 2 14 0.0039 0.013 3.346 2 15 0.0032 0.023 7.216 2 16 0.0065 0.045 6.942 2 17 0.0084 0.088 10.475 2 18 0.0122 0.227 18.468 2 19 0.0198 0.471 23.738 2 20 0.0266 1.011 27.581 2 21 0.0718 2.086 29.037 2 22 0.1451 4.419 30.454 2 23 0.3043 9.043 29.717 Table: Running time (in sec.) for FFT-based polynomial multiplication: CUMODP vs 20 / 30

Outline 1 Overview 2 A many-core machine model for minimizing parallelism overheads 3 Putting the MCM model into practice 4 Adaptive algorithms 5 Bivariate system solving 6 Conclusion 21 / 30

Parallelizing subproduct-tree technniques Challenging in parallel implementation: The divide-and-conquer formulation of operations is not sufficient to provide enough parallelism. One must parallelize the underlying polynomial arithmetic operations. The degrees of the involved polynomials vary greatly during the course of the execution of operations (subproduct tree, evaluation or interpolation). So does the work load of the tasks, which makes those algorithms complex to implement on many-core GPUs. 22 / 30

Subproduct trees Subproduct tree technique Split the point set U = {u 0,..., u n 1 } with n = 2 k into two halves of equal cardinality and proceed recursively with each of the two halves. This leads to a complete binary tree M n of depth k having the points u 0,..., u n 1 as leaves. Let m i = x u i and define each non-leaf node in the binary tree as a polynomial multiplication M i,j = m j 2 i m j 2 i +1 m j 2 i +2 i 1 = 0 l<2 i m j 2 i +l. For each level of the binary tree smaller than a threshold H, we compute the subproducts using plain multiplication. For each level of the binary tree larger than the threshold H, we compute the subproducts using FFT-based multiplication. 23 / 30

Subinverse tree Associated with the subproduct tree M n (n = 2 k ), the subinverse tree InvM n is a complete binary tree with the same height as M n. For j-th node of level i in InvM n for 0 i k, 0 j < 2 k i, where rev 2 i (M i,j ) = x 2i M i,j (1/x). InvM i,j rev 2 i (M i,j ) 1 mod x 2i, 24 / 30

Multi-point evaluation & interpolation: estimates Given a univariate polynomial f K[x] of degree less than n = 2 k and evaluation points u 0,..., u n 1 K, compute (f (u 0 ),..., f (u n 1 )). (Recall Recall the threshold H in the subproduct (subinverse) tree.) the work is O(n log 2 2(n) + n log 2 (n) + n 2 H ), the span is O(log 2 2(n) + log 2 (n) + 2 H ), and the parallelism overhead is O((log 2 2(n) + log 2 (n) + H) U). Given distinct points u 0,..., u n 1 K and arbitrary values v 0,..., v n 1 K, compute the unique polynomial f K[x] of degree less than n = 2 k that takes the value v i at the point u i for all i. the work is O(n log 2 2(n) + n log 2 (n) + n 2 H ), the span is O(log 2 2(n) + log 2 (n) + 2 H ), and the parallelism overhead is O(log 2 2(n) + log 2 (n) + H). 25 / 30

Multi-point evaluation & interpolation: benchmarks Evaluation Interpolation Deg. CUMODP FLINT SpeedUp CUMODP FLINT SpeedUp 2 14 0.2034 0.17 0.2548 0.22 2 15 0.2415 0.41 1.6971 0.3073 0.53 1.7242 2 16 0.3126 0.99 3.1666 0.4026 1.26 3.1294 2 17 0.4285 2.33 5.4375 0.5677 2.94 5.1780 2 18 0.7106 5.43 7.6404 0.9034 6.81 7.5379 2 19 1.0936 12.63 11.5484 1.3931 15.85 11.3768 2 20 1.9412 29.2 15.0420 2.4363 36.61 15.0268 2 21 3.6927 67.18 18.1923 4.5965 83.98 18.2702 2 22 7.4855 153.07 20.4486 9.2940 191.32 20.5851 2 23 15.796 346.44 21.9321 19.6923 432.13 21.9441 Table: Running time (in sec.) on NVIDIA Tesla C2050 for multi-point evaluation and interpolation: CUMODP vs FLINT. 26 / 30

Outline 1 Overview 2 A many-core machine model for minimizing parallelism overheads 3 Putting the MCM model into practice 4 Adaptive algorithms 5 Bivariate system solving 6 Conclusion 27 / 30

GPU support for bivariate system solving Bivariate polynomial system solver (based on the theory of regular chains) Polynomial subresultant chains are calculated in CUDA. Univariate polynomial GCDs are computed in C either by means of the plain Euclidean algorithm or an asymptotically fast algorithm. System Pure C Mostly CUDA code SpeedUp dense-70 5.22 0.50 10.26 dense-80 6.63 0.77 8.59 dense-90 8.39 1.16 7.19 dense-100 19.53 1.80 10.79 dense-110 21.41 2.57 8.33 dense-120 25.71 3.48 7.39 sparse-70 0.89 0.31 2.81 sparse-80 3.64 1.18 3.09 sparse-90 3.13 0.92 3.40 sparse-100 8.86 1.20 7.38 Table: Running time (in sec.) for bivariate system solving over a small prime field 28 / 30

Outline 1 Overview 2 A many-core machine model for minimizing parallelism overheads 3 Putting the MCM model into practice 4 Adaptive algorithms 5 Bivariate system solving 6 Conclusion 29 / 30

Summary www.cumodp.org 30 / 30