A new multiplication algorithm for extended precision using floating-point expansions. Valentina Popescu, Jean-Michel Muller,Ping Tak Peter Tang

Size: px
Start display at page:

Download "A new multiplication algorithm for extended precision using floating-point expansions. Valentina Popescu, Jean-Michel Muller,Ping Tak Peter Tang"

Transcription

1 A new multiplication algorithm for extended precision using floating-point expansions Valentina Popescu, Jean-Michel Muller,Ping Tak Peter Tang ARITH 23 July 2016 AMPAR CudA Multiple Precision ARithmetic library

2 1/1 Target applications 1 Need massive parallel computations! high performance computing using graphics processors GPUs; 2 Need more precision than standard available (up to few hundred bits)! extend precision using floating-point expansions.

3 1/1 Target applications 1 Need massive parallel computations! high performance computing using graphics processors GPUs; 2 Need more precision than standard available (up to few hundred bits)! extend precision using floating-point expansions. Chaotic dynamical systems: bifurcation analysis; compute periodic orbits (e.g., finding sinks in the Hénon map, iterating the Lorenz attractor); celestial mechanics (e.g., long term stability of the solar system). Experimental mathematics: ill-posed SDP problems in computational geometry (e.g., computation of kissing numbers); quantum chemistry/information; polynomial optimization etc. x x1

4 2/1 Extended precision Existing libraries: GNU MPFR - not ported on GPU; GARPREC & CUMP - tuned for big array operations: data generated on host, operations on device; QD & GQD - limited to double-double and quad-double; no correctness proofs.

5 2/1 Extended precision Existing libraries: GNU MPFR - not ported on GPU; GARPREC & CUMP - tuned for big array operations: data generated on host, operations on device; QD & GQD - limited to double-double and quad-double; no correctness proofs. What we need: support for arbitrary precision; runs both on CPU and GPU; easy to use. CAMPARY CudaA Multiple Precision ARithmetic library

6 3/1 CAMPARY (CudaA Multiple Precision ARithmetic library) Our approach: multiple-term representation floating-pointexpansions

7 3/1 CAMPARY (CudaA Multiple Precision ARithmetic library) Our approach: multiple-term representation floating-pointexpansions Pros: usedirectlyavailableandhighlyoptimizednative FP infrastructure; straightforwardly portable to highly parallel architectures, such as GPUs; su cientlysimple and regular algorithms for addition.

8 3/1 CAMPARY (CudaA Multiple Precision ARithmetic library) Our approach: multiple-term representation floating-pointexpansions Pros: usedirectlyavailableandhighlyoptimizednative FP infrastructure; straightforwardly portable to highly parallel architectures, such as GPUs; su cientlysimple and regular algorithms for addition. Cons: morethanonerepresentation; existingmultiplication algorithms do not generalize well for an arbitrary number of terms; di cultrigorouserroranalysis! lack of thorough error bounds.

9 4/1 Non-overlapping expansions R = e 1 can be represented, using a p =5(in radix 2) system,as: Less compact R 8 = x 0 + x 1 + x 2 : < x 0 =1.1000e 1; R 8 = y 0 + y 1 + y 2 + y 3 + y 4 + y 5 : x 1 =1.0010e 3; : x 2 =1.0110e 6. Most compact R = z 0 + z 1 : z0 =1.1101e 1; z 1 =1.1000e 8. >< >: y 0 =1.0000e 1; y 1 =1.0000e 2; y 2 =1.0000e 3; y 3 =1.0000e 5; y 4 =1.0000e 8; y 5 =1.0000e 9.

10 4/1 Non-overlapping expansions R = e 1 can be represented, using a p =5(in radix 2) system,as: Less compact R 8 = x 0 + x 1 + x 2 : < x 0 =1.1000e 1; R 8 = y 0 + y 1 + y 2 + y 3 + y 4 + y 5 : x 1 =1.0010e 3; : x 2 =1.0110e 6. Most compact R = z 0 + z 1 : z0 =1.1101e 1; z 1 =1.1000e 8. >< >: y 0 =1.0000e 1; y 1 =1.0000e 2; y 2 =1.0000e 3; y 3 =1.0000e 5; y 4 =1.0000e 8; y 5 =1.0000e 9. Solution: the FP expansions are required to be non-overlapping. Definition: ulp -nonoverlapping. For an expansion u 0,u 1,...,u n 1 if for all 0 <i<n,wehave u i apple ulp (u i 1 ). Example: p =5(in radix 2) 8 x >< 0 =1.1010e 2; x 1 =1.1101e 7; >: x 2 =1.0000e 11; x 3 =1.1000e 17.

11 4/1 Non-overlapping expansions R = e 1 can be represented, using a p =5(in radix 2) system,as: Less compact R 8 = x 0 + x 1 + x 2 : < x 0 =1.1000e 1; R 8 = y 0 + y 1 + y 2 + y 3 + y 4 + y 5 : x 1 =1.0010e 3; : x 2 =1.0110e 6. Most compact R = z 0 + z 1 : z0 =1.1101e 1; z 1 =1.1000e 8. >< >: y 0 =1.0000e 1; y 1 =1.0000e 2; y 2 =1.0000e 3; y 3 =1.0000e 5; y 4 =1.0000e 8; y 5 =1.0000e 9. Solution: the FP expansions are required to be non-overlapping. Definition: ulp -nonoverlapping. For an expansion u 0,u 1,...,u n 1 if for all 0 <i<n,wehave u i apple ulp (u i 1 ). Example: p =5(in radix 2) 8 x >< 0 =1.1010e 2; x 1 =1.1101e 7; >: x 2 =1.0000e 11; x 3 =1.1000e 17. Restriction: n apple 12 for single-precision and n apple 39 for double-precision.

12 5/1 Error-Free Transforms: Fast2Sum & 2MultFMA Algorithm 1 (Fast2Sum (a, b)) s RN (a + b) z RN (s a) e RN (b z) return (s, e) Algorithm 2 (2MultFMA (a, b)) p RN (a b) e fma(a, b, p) return (p, e) Requirement: e a e b ;! Uses 3 FP operations. Requirement: e a + e b e min + p 1;! Uses 2 FP operations.

13 6/1 Existing multiplication algorithms 1 Priest s multiplication [Pri91]: verycomplexandcostly; basedonscalarproducts; uses re-normalization after each step; computestheentireresultand truncates a-posteriori; comes with an error bound and correctness proof.

14 6/1 Existing multiplication algorithms 1 Priest s multiplication [Pri91]: verycomplexandcostly; basedonscalarproducts; uses re-normalization after each step; computestheentireresultand truncates a-posteriori; comes with an error bound and correctness proof. 2 quad-double multiplication in QD library [Hida et.al. 07]: does not straightforwardly generalize; canleadtoo(n 3 ) complexity; worstcaseerrorboundispessimistic; nocorrectnessproofispublished.

15 7/1 New multiplication algorithms requires: ulp -nonoverlapping FP expansion x =(x 0,x 1,...,x n 1 ) and y =(y 0,y 1,...,y m 1 ); ensures: ulp -nonoverlapping FP expansion =( 0, 1,..., r 1 ). Let me explain it with an example...

16 Example: n =4,m=3and r =4 8/1

17 8/1 Example: n =4,m=3and r =4 paper-and-pencil intuition; on-the-fly truncation ;

18 8/1 Example: n =4,m=3and r =4 paper-and-pencil intuition; on-the-fly truncation ; term-times-expansion products, x i y; error correction term, r.

19 9/1 Example: n =4,m=3and r =4 [ r p b ]+2containers of size b (s.t. 3b >2p); b + c = p 1, s.t.wecanadd2 c numbers without error (binary64! b =45, binary32! b =18); starting exponent e = e x0 + e y0 ; each bin s LSB has a fixed weight; bins initialized with e (i+1)b+p 1 ;

20 9/1 Example: n =4,m=3and r =4 [ r p b ]+2containers of size b (s.t. 3b >2p); b + c = p 1, s.t.wecanadd2 c numbers without error (binary64! b =45, binary32! b =18); starting exponent e = e x0 + e y0 ; each bin s LSB has a fixed weight; bins initialized with e (i+1)b+p 1 ; the number of leading bits, `; accumulation done using a Fast2Sum and addition [Rump09].

21 Example: n =4,m=3and r =4 10 / 1

22 10 / 1 Example: n =4,m=3and r =4 subtract initial value;

23 10 / 1 Example: n =4,m=3and r =4 subtract initial value; apply renormalization step to B: Fast2Sum and branches; rendertheresult ulp -nonoverlapping.

24 11 / 1 Error bound Exact result computed as: n 1;m 1 X i=0;j=0 x i y j = m+n 2 X k=0 X x i y j. i+j=k On-the-fly truncation! three error sources:

25 11 / 1 Error bound Exact result computed as: n 1;m 1 X i=0;j=0 x i y j = m+n 2 X k=0 X x i y j. i+j=k On-the-fly truncation! three error sources: discarded partial products: weaddonlythefirst r+1 P m+n P 2 k=r+1 P i+j=k k=1 k; x i y j apple x 0 y 0 2 (p h i 1)(r+1) (p 1) 2 + m+n r 2 (1 2 (p 1) ) (p 1) ;

26 11 / 1 Error bound Exact result computed as: n 1;m 1 X i=0;j=0 x i y j = m+n 2 X k=0 X x i y j. i+j=k On-the-fly truncation! three error sources: discarded partial products: weaddonlythefirst r+1 P m+n P 2 k=r+1 P i+j=k k=1 k; x i y j apple x 0 y 0 2 (p h i 1)(r+1) (p 1) 2 + m+n r 2 (1 2 (p 1) ) (p 1) discarded rounding errors: wecomputethelastr +1partial products using simple FP multiplication; weneglectr +1error terms bounded by x 0 y 0 2 (p 1)r 2 p ; ;

27 11 / 1 Error bound Exact result computed as: n 1;m 1 X i=0;j=0 x i y j = m+n 2 X k=0 X x i y j. i+j=k On-the-fly truncation! three error sources: discarded partial products: weaddonlythefirst r+1 P m+n P 2 k=r+1 P i+j=k k=1 k; x i y j apple x 0 y 0 2 (p h i 1)(r+1) (p 1) 2 + m+n r 2 (1 2 (p 1) ) (p 1) discarded rounding errors: wecomputethelastr +1partial products using simple FP multiplication; weneglectr +1error terms bounded by x 0 y 0 2 (p 1)r 2 p ; error caused by the renormalization step: errorlessorequalto x 0 y 0 2 (p 1)r. ;

28 11 / 1 Error bound Exact result computed as: n 1;m 1 X i=0;j=0 x i y j = m+n 2 X k=0 X x i y j. i+j=k On-the-fly truncation! three error sources: discarded partial products: weaddonlythefirst r+1 P m+n P 2 k=r+1 P i+j=k k=1 k; x i y j apple x 0 y 0 2 (p h i 1)(r+1) (p 1) 2 + m+n r 2 (1 2 (p 1) ) (p 1) discarded rounding errors: wecomputethelastr +1partial products using simple FP multiplication; weneglectr +1error terms bounded by x 0 y 0 2 (p 1)r 2 p ; error caused by the renormalization step: errorlessorequalto x 0 y 0 2 (p 1)r. Tight error bound: x 0 y 0 2 (p 1)r [1 + (r +1)2 p (p 1) 2 (p 1) (1 2 (p 1) ) 2 + m + n r 2 A]. 1 2 (p 1) ;

29 12 / 1 Comparison Table : Worst case FP operation count when the input and output expansions are of size r. r New algorithm Priest s mul.[pri91]

30 Comparison Table : Worst case FP operation count when the input and output expansions are of size r. r New algorithm Priest s mul.[pri91] Table : Performance in Mops/s for multiplying two FP expansions on a Tesla K40c GPU, using CUDA 7.5 software architecture, running on a single thread of execution. precision not supported. d x,d y,d r New algorithm QD 2, 2, , 2, , 3, , 3, , 4, , 4, , 4, , 8, , 8, , 16, / 1

31 13 / 1 Conclusions AMPAR CudA Multiple Precision ARithmetic library Available online at: algorithm with strong regularity; based on partial products accumulation; uses a fixed-point structure that is floating-point friendly; thorough error analysis and tight error bound; natural fit for GPUs; proved to be too complex for small precisions; performance gains with increased precision.

32 Comparison Table : Performance in Mops/s for multiplying two FP expansions on the CPU; d x and d y represent the number of terms in the input expansions and d r is the size of the computed result. precision not supported d x,d y,d r New algorithm QD MPFR 2, 2, , 2, , 3, , 3, , 4, , 4, , 4, , 8, , 8, , 16,

On the computation of the reciprocal of floating point expansions using an adapted Newton-Raphson iteration

On the computation of the reciprocal of floating point expansions using an adapted Newton-Raphson iteration On the computation of the reciprocal of floating point expansions using an adapted Newton-Raphson iteration Mioara Joldes, Valentina Popescu, Jean-Michel Muller ASAP 2014 1 / 10 Motivation Numerical problems

More information

Tight and rigourous error bounds for basic building blocks of double-word arithmetic. Mioara Joldes, Jean-Michel Muller, Valentina Popescu

Tight and rigourous error bounds for basic building blocks of double-word arithmetic. Mioara Joldes, Jean-Michel Muller, Valentina Popescu Tight and rigourous error bounds for basic building blocks of double-word arithmetic Mioara Joldes, Jean-Michel Muller, Valentina Popescu SCAN - September 2016 AMPAR CudA Multiple Precision ARithmetic

More information

MANY numerical problems in dynamical systems

MANY numerical problems in dynamical systems IEEE TRANSACTIONS ON COMPUTERS, VOL., 201X 1 Arithmetic algorithms for extended precision using floating-point expansions Mioara Joldeş, Olivier Marty, Jean-Michel Muller and Valentina Popescu Abstract

More information

Accurate polynomial evaluation in floating point arithmetic

Accurate polynomial evaluation in floating point arithmetic in floating point arithmetic Université de Perpignan Via Domitia Laboratoire LP2A Équipe de recherche en Informatique DALI MIMS Seminar, February, 10 th 2006 General motivation Provide numerical algorithms

More information

On various ways to split a floating-point number

On various ways to split a floating-point number On various ways to split a floating-point number Claude-Pierre Jeannerod Jean-Michel Muller Paul Zimmermann Inria, CNRS, ENS Lyon, Université de Lyon, Université de Lorraine France ARITH-25 June 2018 -2-

More information

Implementation of float float operators on graphic hardware

Implementation of float float operators on graphic hardware Implementation of float float operators on graphic hardware Guillaume Da Graça David Defour DALI, Université de Perpignan Outline Why GPU? Floating point arithmetic on GPU Float float format Problem &

More information

Optimizing the Representation of Intervals

Optimizing the Representation of Intervals Optimizing the Representation of Intervals Javier D. Bruguera University of Santiago de Compostela, Spain Numerical Sofware: Design, Analysis and Verification Santander, Spain, July 4-6 2012 Contents 1

More information

Short Division of Long Integers. (joint work with David Harvey)

Short Division of Long Integers. (joint work with David Harvey) Short Division of Long Integers (joint work with David Harvey) Paul Zimmermann October 6, 2011 The problem to be solved Divide efficiently a p-bit floating-point number by another p-bit f-p number in the

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

Complement Arithmetic

Complement Arithmetic Complement Arithmetic Objectives In this lesson, you will learn: How additions and subtractions are performed using the complement representation, What is the Overflow condition, and How to perform arithmetic

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

The Classical Relative Error Bounds for Computing a2 + b 2 and c/ a 2 + b 2 in Binary Floating-Point Arithmetic are Asymptotically Optimal

The Classical Relative Error Bounds for Computing a2 + b 2 and c/ a 2 + b 2 in Binary Floating-Point Arithmetic are Asymptotically Optimal -1- The Classical Relative Error Bounds for Computing a2 + b 2 and c/ a 2 + b 2 in Binary Floating-Point Arithmetic are Asymptotically Optimal Claude-Pierre Jeannerod Jean-Michel Muller Antoine Plet Inria,

More information

Getting tight error bounds in floating-point arithmetic: illustration with complex functions, and the real x n function

Getting tight error bounds in floating-point arithmetic: illustration with complex functions, and the real x n function Getting tight error bounds in floating-point arithmetic: illustration with complex functions, and the real x n function Jean-Michel Muller collaborators: S. Graillat, C.-P. Jeannerod, N. Louvet V. Lefèvre

More information

4ccurate 4lgorithms in Floating Point 4rithmetic

4ccurate 4lgorithms in Floating Point 4rithmetic 4ccurate 4lgorithms in Floating Point 4rithmetic Philippe LANGLOIS 1 1 DALI Research Team, University of Perpignan, France SCAN 2006 at Duisburg, Germany, 26-29 September 2006 Philippe LANGLOIS (UPVD)

More information

Computation of dot products in finite fields with floating-point arithmetic

Computation of dot products in finite fields with floating-point arithmetic Computation of dot products in finite fields with floating-point arithmetic Stef Graillat Joint work with Jérémy Jean LIP6/PEQUAN - Université Pierre et Marie Curie (Paris 6) - CNRS Computer-assisted proofs

More information

Introduction to numerical computations on the GPU

Introduction to numerical computations on the GPU Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming

More information

Solving Triangular Systems More Accurately and Efficiently

Solving Triangular Systems More Accurately and Efficiently 17th IMACS World Congress, July 11-15, 2005, Paris, France Scientific Computation, Applied Mathematics and Simulation Solving Triangular Systems More Accurately and Efficiently Philippe LANGLOIS and Nicolas

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017 9. Datapath Design Jacob Abraham Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017 October 2, 2017 ECE Department, University of Texas at Austin

More information

MANY numerical problems in dynamical systems or

MANY numerical problems in dynamical systems or IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 4, APRIL 2016 1197 Arithmetic Algorithms for Extended Precision Using Floating-Point Expansions Mioara Joldeş, Olivier Marty, Jean-Michel Muller, Senior Member,

More information

Lecture 11. Advanced Dividers

Lecture 11. Advanced Dividers Lecture 11 Advanced Dividers Required Reading Behrooz Parhami, Computer Arithmetic: Algorithms and Hardware Design Chapter 15 Variation in Dividers 15.3, Combinational and Array Dividers Chapter 16, Division

More information

How fast can we add (or subtract) two numbers n and m?

How fast can we add (or subtract) two numbers n and m? Addition and Subtraction How fast do we add (or subtract) two numbers n and m? How fast can we add (or subtract) two numbers n and m? Definition. Let A(d) denote the maximal number of steps required to

More information

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)

More information

Complex Logarithmic Number System Arithmetic Using High-Radix Redundant CORDIC Algorithms

Complex Logarithmic Number System Arithmetic Using High-Radix Redundant CORDIC Algorithms Complex Logarithmic Number System Arithmetic Using High-Radix Redundant CORDIC Algorithms David Lewis Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario, Canada M5S

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

Remainders. We learned how to multiply and divide in elementary

Remainders. We learned how to multiply and divide in elementary Remainders We learned how to multiply and divide in elementary school. As adults we perform division mostly by pressing the key on a calculator. This key supplies the quotient. In numerical analysis and

More information

Continued fractions and number systems: applications to correctly-rounded implementations of elementary functions and modular arithmetic.

Continued fractions and number systems: applications to correctly-rounded implementations of elementary functions and modular arithmetic. Continued fractions and number systems: applications to correctly-rounded implementations of elementary functions and modular arithmetic. Mourad Gouicem PEQUAN Team, LIP6/UPMC Nancy, France May 28 th 2013

More information

Two case studies of Monte Carlo simulation on GPU

Two case studies of Monte Carlo simulation on GPU Two case studies of Monte Carlo simulation on GPU National Institute for Computational Sciences University of Tennessee Seminar series on HPC, Feb. 27, 2014 Outline 1 Introduction 2 Discrete energy lattice

More information

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2019

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2019 CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis Ruth Anderson Winter 2019 Today Algorithm Analysis What do we care about? How to compare two algorithms Analyzing Code Asymptotic Analysis

More information

CRYPTOGRAPHIC COMPUTING

CRYPTOGRAPHIC COMPUTING CRYPTOGRAPHIC COMPUTING ON GPU Chen Mou Cheng Dept. Electrical Engineering g National Taiwan University January 16, 2009 COLLABORATORS Daniel Bernstein, UIC, USA Tien Ren Chen, Army Tanja Lange, TU Eindhoven,

More information

UNIT V FINITE WORD LENGTH EFFECTS IN DIGITAL FILTERS PART A 1. Define 1 s complement form? In 1,s complement form the positive number is represented as in the sign magnitude form. To obtain the negative

More information

Accurate Solution of Triangular Linear System

Accurate Solution of Triangular Linear System SCAN 2008 Sep. 29 - Oct. 3, 2008, at El Paso, TX, USA Accurate Solution of Triangular Linear System Philippe Langlois DALI at University of Perpignan France Nicolas Louvet INRIA and LIP at ENS Lyon France

More information

Towards Fast, Accurate and Reproducible LU Factorization

Towards Fast, Accurate and Reproducible LU Factorization Towards Fast, Accurate and Reproducible LU Factorization Roman Iakymchuk 1, David Defour 2, and Stef Graillat 3 1 KTH Royal Institute of Technology, CSC, CST/PDC 2 Université de Perpignan, DALI LIRMM 3

More information

Numbering Systems. Computational Platforms. Scaling and Round-off Noise. Special Purpose. here that is dedicated architecture

Numbering Systems. Computational Platforms. Scaling and Round-off Noise. Special Purpose. here that is dedicated architecture Computational Platforms Numbering Systems Basic Building Blocks Scaling and Round-off Noise Computational Platforms Viktor Öwall viktor.owall@eit.lth.seowall@eit lth Standard Processors or Special Purpose

More information

Branch Prediction based attacks using Hardware performance Counters IIT Kharagpur

Branch Prediction based attacks using Hardware performance Counters IIT Kharagpur Branch Prediction based attacks using Hardware performance Counters IIT Kharagpur March 19, 2018 Modular Exponentiation Public key Cryptography March 19, 2018 Branch Prediction Attacks 2 / 54 Modular Exponentiation

More information

Faster arbitrary-precision dot product and matrix multiplication

Faster arbitrary-precision dot product and matrix multiplication Faster arbitrary-precision dot product and matrix multiplication Fredrik Johansson LFANT Inria Bordeaux Talence, France fredrik.johansson@gmail.com arxiv:1901.04289v1 [cs.ms] 14 Jan 2019 Abstract We present

More information

Efficient Reproducible Floating Point Summation and BLAS

Efficient Reproducible Floating Point Summation and BLAS Efficient Reproducible Floating Point Summation and BLAS Peter Ahrens Hong Diep Nguyen James Demmel Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No.

More information

Parallel Polynomial Evaluation

Parallel Polynomial Evaluation Parallel Polynomial Evaluation Jan Verschelde joint work with Genady Yoffe University of Illinois at Chicago Department of Mathematics, Statistics, and Computer Science http://www.math.uic.edu/ jan jan@math.uic.edu

More information

High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA

High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University of Singapore (NUS) Sept 10 th 2018 CHES 2018 FHE The holy grail

More information

Topic 17. Analysis of Algorithms

Topic 17. Analysis of Algorithms Topic 17 Analysis of Algorithms Analysis of Algorithms- Review Efficiency of an algorithm can be measured in terms of : Time complexity: a measure of the amount of time required to execute an algorithm

More information

Chapter 4 Number Representations

Chapter 4 Number Representations Chapter 4 Number Representations SKEE2263 Digital Systems Mun im/ismahani/izam {munim@utm.my,e-izam@utm.my,ismahani@fke.utm.my} February 9, 2016 Table of Contents 1 Fundamentals 2 Signed Numbers 3 Fixed-Point

More information

Parallel Reproducible Summation

Parallel Reproducible Summation Parallel Reproducible Summation James Demmel Mathematics Department and CS Division University of California at Berkeley Berkeley, CA 94720 demmel@eecs.berkeley.edu Hong Diep Nguyen EECS Department University

More information

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library S6349 - XMP LIBRARY INTERNALS Niall Emmart University of Massachusetts Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library High Performance Modular Exponentiation A^K mod P Where A,

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier

Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier Espen Stenersen Master of Science in Electronics Submission date: June 2008 Supervisor: Per Gunnar Kjeldsberg, IET Co-supervisor: Torstein

More information

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign

More information

Software implementation of ECC

Software implementation of ECC Software implementation of ECC Radboud University, Nijmegen, The Netherlands June 4, 2015 Summer school on real-world crypto and privacy Šibenik, Croatia Software implementation of (H)ECC Radboud University,

More information

Accurate Multiple-Precision Gauss-Legendre Quadrature

Accurate Multiple-Precision Gauss-Legendre Quadrature Accurate Multiple-Precision Gauss-Legendre Quadrature Laurent Fousse Université Henri-Poincaré Nancy 1 laurent@komite.net Abstract Numerical integration is an operation that is frequently available in

More information

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018 CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis Ruth Anderson Winter 2018 Today Algorithm Analysis What do we care about? How to compare two algorithms Analyzing Code Asymptotic Analysis

More information

ALU (3) - Division Algorithms

ALU (3) - Division Algorithms HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Lecture 12 ALU (3) - Division Algorithms Sommersemester 2002 Leitung: Prof. Dr. Miroslaw Malek www.informatik.hu-berlin.de/rok/ca CA - XII - ALU(3)

More information

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge González-Domínguez*, Bertil Schmidt*, Jan C. Kässens**, Lars Wienbrandt** *Parallel and Distributed Architectures

More information

Tunable Floating-Point for Energy Efficient Accelerators

Tunable Floating-Point for Energy Efficient Accelerators Tunable Floating-Point for Energy Efficient Accelerators Alberto Nannarelli DTU Compute, Technical University of Denmark 25 th IEEE Symposium on Computer Arithmetic A. Nannarelli (DTU Compute) Tunable

More information

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018 CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis Ruth Anderson Winter 2018 Today Algorithm Analysis What do we care about? How to compare two algorithms Analyzing Code Asymptotic Analysis

More information

Newton-Raphson Algorithms for Floating-Point Division Using an FMA

Newton-Raphson Algorithms for Floating-Point Division Using an FMA Newton-Raphson Algorithms for Floating-Point Division Using an FMA Nicolas Louvet, Jean-Michel Muller, Adrien Panhaleux Abstract Since the introduction of the Fused Multiply and Add (FMA) in the IEEE-754-2008

More information

14:332:231 DIGITAL LOGIC DESIGN. Why Binary Number System?

14:332:231 DIGITAL LOGIC DESIGN. Why Binary Number System? :33:3 DIGITAL LOGIC DESIGN Ivan Marsic, Rutgers University Electrical & Computer Engineering Fall 3 Lecture #: Binary Number System Complement Number Representation X Y Why Binary Number System? Because

More information

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Accelerating Model Reduction of Large Linear Systems with Graphics Processors Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex

More information

Proposal to Improve Data Format Conversions for a Hybrid Number System Processor

Proposal to Improve Data Format Conversions for a Hybrid Number System Processor Proposal to Improve Data Format Conversions for a Hybrid Number System Processor LUCIAN JURCA, DANIEL-IOAN CURIAC, AUREL GONTEAN, FLORIN ALEXA Department of Applied Electronics, Department of Automation

More information

Automated design of floating-point logarithm functions on integer processors

Automated design of floating-point logarithm functions on integer processors 23rd IEEE Symposium on Computer Arithmetic Santa Clara, CA, USA, 10-13 July 2016 Automated design of floating-point logarithm functions on integer processors Guillaume Revy (presented by Florent de Dinechin)

More information

The Euclidean Division Implemented with a Floating-Point Multiplication and a Floor

The Euclidean Division Implemented with a Floating-Point Multiplication and a Floor The Euclidean Division Implemented with a Floating-Point Multiplication and a Floor Vincent Lefèvre 11th July 2005 Abstract This paper is a complement of the research report The Euclidean division implemented

More information

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Faithful Horner Algorithm

Faithful Horner Algorithm ICIAM 07, Zurich, Switzerland, 16 20 July 2007 Faithful Horner Algorithm Philippe Langlois, Nicolas Louvet Université de Perpignan, France http://webdali.univ-perp.fr Ph. Langlois (Université de Perpignan)

More information

Chapter 5: Solutions to Exercises

Chapter 5: Solutions to Exercises 1 DIGITAL ARITHMETIC Miloš D. Ercegovac and Tomás Lang Morgan Kaufmann Publishers, an imprint of Elsevier Science, c 2004 Updated: September 9, 2003 Chapter 5: Solutions to Selected Exercises With contributions

More information

Optimizing Scientific Libraries for the Itanium

Optimizing Scientific Libraries for the Itanium 0 Optimizing Scientific Libraries for the Itanium John Harrison Intel Corporation Gelato Federation Meeting, HP Cupertino May 25, 2005 1 Quick summary Intel supplies drop-in replacement versions of common

More information

Hydra. A library for data analysis in massively parallel platforms. A. Augusto Alves Jr and Michael D. Sokoloff

Hydra. A library for data analysis in massively parallel platforms. A. Augusto Alves Jr and Michael D. Sokoloff Hydra A library for data analysis in massively parallel platforms A. Augusto Alves Jr and Michael D. Sokoloff University of Cincinnati aalvesju@cern.ch Presented at NVIDIA s GPU Technology Conference,

More information

NUMBERS AND CODES CHAPTER Numbers

NUMBERS AND CODES CHAPTER Numbers CHAPTER 2 NUMBERS AND CODES 2.1 Numbers When a number such as 101 is given, it is impossible to determine its numerical value. Some may say it is five. Others may say it is one hundred and one. Could it

More information

Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm)

Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm) Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm) Alexander Smith & Khashayar Khavari Department of Electrical and Computer Engineering University of Toronto April 15, 2009 Alexander

More information

arxiv: v1 [hep-lat] 7 Oct 2010

arxiv: v1 [hep-lat] 7 Oct 2010 arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA

More information

Chapter 1 Error Analysis

Chapter 1 Error Analysis Chapter 1 Error Analysis Several sources of errors are important for numerical data processing: Experimental uncertainty: Input data from an experiment have a limited precision. Instead of the vector of

More information

CS 310 Advanced Data Structures and Algorithms

CS 310 Advanced Data Structures and Algorithms CS 310 Advanced Data Structures and Algorithms Runtime Analysis May 31, 2017 Tong Wang UMass Boston CS 310 May 31, 2017 1 / 37 Topics Weiss chapter 5 What is algorithm analysis Big O, big, big notations

More information

Elements of Floating-point Arithmetic

Elements of Floating-point Arithmetic Elements of Floating-point Arithmetic Sanzheng Qiao Department of Computing and Software McMaster University July, 2012 Outline 1 Floating-point Numbers Representations IEEE Floating-point Standards Underflow

More information

Logic and Computer Design Fundamentals. Chapter 5 Arithmetic Functions and Circuits

Logic and Computer Design Fundamentals. Chapter 5 Arithmetic Functions and Circuits Logic and Computer Design Fundamentals Chapter 5 Arithmetic Functions and Circuits Arithmetic functions Operate on binary vectors Use the same subfunction in each bit position Can design functional block

More information

Solving PDEs with PGI CUDA Fortran Part 4: Initial value problems for ordinary differential equations

Solving PDEs with PGI CUDA Fortran Part 4: Initial value problems for ordinary differential equations Solving PDEs with PGI CUDA Fortran Part 4: Initial value problems for ordinary differential equations Outline ODEs and initial conditions. Explicit and implicit Euler methods. Runge-Kutta methods. Multistep

More information

What s the Deal? MULTIPLICATION. Time to multiply

What s the Deal? MULTIPLICATION. Time to multiply What s the Deal? MULTIPLICATION Time to multiply Multiplying two numbers requires a multiply Luckily, in binary that s just an AND gate! 0*0=0, 0*1=0, 1*0=0, 1*1=1 Generate a bunch of partial products

More information

Are numerical studies of long term dynamics conclusive: the case of the Hénon map

Are numerical studies of long term dynamics conclusive: the case of the Hénon map Journal of Physics: Conference Series PAPER OPEN ACCESS Are numerical studies of long term dynamics conclusive: the case of the Hénon map To cite this article: Zbigniew Galias 2016 J. Phys.: Conf. Ser.

More information

Isolating critical cases for reciprocals using integer factorization. John Harrison Intel Corporation ARITH-16 Santiago de Compostela 17th June 2003

Isolating critical cases for reciprocals using integer factorization. John Harrison Intel Corporation ARITH-16 Santiago de Compostela 17th June 2003 0 Isolating critical cases for reciprocals using integer factorization John Harrison Intel Corporation ARITH-16 Santiago de Compostela 17th June 2003 1 result before the final rounding Background Suppose

More information

1 ERROR ANALYSIS IN COMPUTATION

1 ERROR ANALYSIS IN COMPUTATION 1 ERROR ANALYSIS IN COMPUTATION 1.2 Round-Off Errors & Computer Arithmetic (a) Computer Representation of Numbers Two types: integer mode (not used in MATLAB) floating-point mode x R ˆx F(β, t, l, u),

More information

Linear System of Equations

Linear System of Equations Linear System of Equations Linear systems are perhaps the most widely applied numerical procedures when real-world situation are to be simulated. Example: computing the forces in a TRUSS. F F 5. 77F F.

More information

DIVISION BY DIGIT RECURRENCE

DIVISION BY DIGIT RECURRENCE DIVISION BY DIGIT RECURRENCE 1 SEVERAL DIVISION METHODS: DIGIT-RECURRENCE METHOD studied in this chapter MULTIPLICATIVE METHOD (Chapter 7) VARIOUS APPROXIMATION METHODS (power series expansion), SPECIAL

More information

Computer Arithmetic. MATH 375 Numerical Analysis. J. Robert Buchanan. Fall Department of Mathematics. J. Robert Buchanan Computer Arithmetic

Computer Arithmetic. MATH 375 Numerical Analysis. J. Robert Buchanan. Fall Department of Mathematics. J. Robert Buchanan Computer Arithmetic Computer Arithmetic MATH 375 Numerical Analysis J. Robert Buchanan Department of Mathematics Fall 2013 Machine Numbers When performing arithmetic on a computer (laptop, desktop, mainframe, cell phone,

More information

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method NUCLEAR SCIENCE AND TECHNIQUES 25, 0501 (14) Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method XU Qi ( 徐琪 ), 1, YU Gang-Lin ( 余纲林 ), 1 WANG Kan ( 王侃 ),

More information

Extended precision with a rounding mode toward zero environment. Application on the CELL processor

Extended precision with a rounding mode toward zero environment. Application on the CELL processor Extended precision with a rounding mode toward zero environment. Application on the CELL processor Hong Diep Nguyen (hong.diep.nguyen@ens-lyon.fr), Stef Graillat (stef.graillat@lip6.fr), Jean-Luc Lamotte

More information

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!

More information

Approximation of inverse Poisson CDF on GPUs

Approximation of inverse Poisson CDF on GPUs Approximation of inverse Poisson CDF on GPUs Mike Giles Mathematical Institute, University of Oxford Oxford-Man Institute of Quantitative Finance 38th Conference on Stochastic Processes and their Applications

More information

ECE260: Fundamentals of Computer Engineering

ECE260: Fundamentals of Computer Engineering Data Representation & 2 s Complement James Moscola Dept. of Engineering & Computer Science York College of Pennsylvania Based on Computer Organization and Design, 5th Edition by Patterson & Hennessy Data

More information

Computation of the error functions erf and erfc in arbitrary precision with correct rounding

Computation of the error functions erf and erfc in arbitrary precision with correct rounding Computation of the error functions erf and erfc in arbitrary precision with correct rounding Sylvain Chevillard Arenaire, LIP, ENS-Lyon, France Sylvain.Chevillard@ens-lyon.fr Nathalie Revol INRIA, Arenaire,

More information

Error Bounds for Iterative Refinement in Three Precisions

Error Bounds for Iterative Refinement in Three Precisions Error Bounds for Iterative Refinement in Three Precisions Erin C. Carson, New York University Nicholas J. Higham, University of Manchester SIAM Annual Meeting Portland, Oregon July 13, 018 Hardware Support

More information

Chapter 1. Numerical Errors. Module No. 1. Errors in Numerical Computations

Chapter 1. Numerical Errors. Module No. 1. Errors in Numerical Computations Numerical Analysis by Dr. Anita Pal Assistant Professor Department of Mathematics National Institute of Technology Durgapur Durgapur-73209 email: anita.buie@gmail.com . Chapter Numerical Errors Module

More information

CMP 334: Seventh Class

CMP 334: Seventh Class CMP 334: Seventh Class Performance HW 5 solution Averages and weighted averages (review) Amdahl's law Ripple-carry adder circuits Binary addition Half-adder circuits Full-adder circuits Subtraction, negative

More information

Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System

Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System G.Suresh, G.Indira Devi, P.Pavankumar Abstract The use of the improved table look up Residue Number System

More information

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science EAD 115 Numerical Solution of Engineering and Scientific Problems David M. Rocke Department of Applied Science Computer Representation of Numbers Counting numbers (unsigned integers) are the numbers 0,

More information

Analysis of Algorithms [Reading: CLRS 2.2, 3] Laura Toma, csci2200, Bowdoin College

Analysis of Algorithms [Reading: CLRS 2.2, 3] Laura Toma, csci2200, Bowdoin College Analysis of Algorithms [Reading: CLRS 2.2, 3] Laura Toma, csci2200, Bowdoin College Why analysis? We want to predict how the algorithm will behave (e.g. running time) on arbitrary inputs, and how it will

More information

Algorithms for Experimental Mathematics I David H Bailey Lawrence Berkeley National Lab

Algorithms for Experimental Mathematics I David H Bailey Lawrence Berkeley National Lab Algorithms for Experimental Mathematics I David H Bailey Lawrence Berkeley National Lab All truths are easy to understand once they are discovered; the point is to discover them. Galileo Galilei Algorithms

More information

Leone B.Bosi, Loriano Storchi et al. CCR Workshop Stato e Prospettive del Calcolo Scientifico

Leone B.Bosi, Loriano Storchi et al. CCR Workshop Stato e Prospettive del Calcolo Scientifico Leone B.Bosi, Loriano Storchi et al MaCGO Project - Einstein Telescope Project INFN Perugia CCR Workshop 2011 - Stato e Prospettive del Calcolo Scientifico Laboratori Nazionali di Legnaro 16-18 Febbraio

More information

Scalable and Power-Efficient Data Mining Kernels

Scalable and Power-Efficient Data Mining Kernels Scalable and Power-Efficient Data Mining Kernels Alok Choudhary, John G. Searle Professor Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management Director of the

More information

I. Numerical Computing

I. Numerical Computing I. Numerical Computing A. Lectures 1-3: Foundations of Numerical Computing Lecture 1 Intro to numerical computing Understand difference and pros/cons of analytical versus numerical solutions Lecture 2

More information

Cost/Performance Tradeoff of n-select Square Root Implementations

Cost/Performance Tradeoff of n-select Square Root Implementations Australian Computer Science Communications, Vol.22, No.4, 2, pp.9 6, IEEE Comp. Society Press Cost/Performance Tradeoff of n-select Square Root Implementations Wanming Chu and Yamin Li Computer Architecture

More information

Floating Point Number Systems. Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le

Floating Point Number Systems. Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le Floating Point Number Systems Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le 1 Overview Real number system Examples Absolute and relative errors Floating point numbers Roundoff

More information

NUMERICAL METHODS C. Carl Gustav Jacob Jacobi 10.1 GAUSSIAN ELIMINATION WITH PARTIAL PIVOTING

NUMERICAL METHODS C. Carl Gustav Jacob Jacobi 10.1 GAUSSIAN ELIMINATION WITH PARTIAL PIVOTING 0. Gaussian Elimination with Partial Pivoting 0.2 Iterative Methods for Solving Linear Systems 0.3 Power Method for Approximating Eigenvalues 0.4 Applications of Numerical Methods Carl Gustav Jacob Jacobi

More information

Elements of Floating-point Arithmetic

Elements of Floating-point Arithmetic Elements of Floating-point Arithmetic Sanzheng Qiao Department of Computing and Software McMaster University July, 2012 Outline 1 Floating-point Numbers Representations IEEE Floating-point Standards Underflow

More information

Midterm Review. Igor Yanovsky (Math 151A TA)

Midterm Review. Igor Yanovsky (Math 151A TA) Midterm Review Igor Yanovsky (Math 5A TA) Root-Finding Methods Rootfinding methods are designed to find a zero of a function f, that is, to find a value of x such that f(x) =0 Bisection Method To apply

More information