A new multiplication algorithm for extended precision using floating-point expansions. Valentina Popescu, Jean-Michel Muller,Ping Tak Peter Tang
|
|
- Rudolph Berry
- 5 years ago
- Views:
Transcription
1 A new multiplication algorithm for extended precision using floating-point expansions Valentina Popescu, Jean-Michel Muller,Ping Tak Peter Tang ARITH 23 July 2016 AMPAR CudA Multiple Precision ARithmetic library
2 1/1 Target applications 1 Need massive parallel computations! high performance computing using graphics processors GPUs; 2 Need more precision than standard available (up to few hundred bits)! extend precision using floating-point expansions.
3 1/1 Target applications 1 Need massive parallel computations! high performance computing using graphics processors GPUs; 2 Need more precision than standard available (up to few hundred bits)! extend precision using floating-point expansions. Chaotic dynamical systems: bifurcation analysis; compute periodic orbits (e.g., finding sinks in the Hénon map, iterating the Lorenz attractor); celestial mechanics (e.g., long term stability of the solar system). Experimental mathematics: ill-posed SDP problems in computational geometry (e.g., computation of kissing numbers); quantum chemistry/information; polynomial optimization etc. x x1
4 2/1 Extended precision Existing libraries: GNU MPFR - not ported on GPU; GARPREC & CUMP - tuned for big array operations: data generated on host, operations on device; QD & GQD - limited to double-double and quad-double; no correctness proofs.
5 2/1 Extended precision Existing libraries: GNU MPFR - not ported on GPU; GARPREC & CUMP - tuned for big array operations: data generated on host, operations on device; QD & GQD - limited to double-double and quad-double; no correctness proofs. What we need: support for arbitrary precision; runs both on CPU and GPU; easy to use. CAMPARY CudaA Multiple Precision ARithmetic library
6 3/1 CAMPARY (CudaA Multiple Precision ARithmetic library) Our approach: multiple-term representation floating-pointexpansions
7 3/1 CAMPARY (CudaA Multiple Precision ARithmetic library) Our approach: multiple-term representation floating-pointexpansions Pros: usedirectlyavailableandhighlyoptimizednative FP infrastructure; straightforwardly portable to highly parallel architectures, such as GPUs; su cientlysimple and regular algorithms for addition.
8 3/1 CAMPARY (CudaA Multiple Precision ARithmetic library) Our approach: multiple-term representation floating-pointexpansions Pros: usedirectlyavailableandhighlyoptimizednative FP infrastructure; straightforwardly portable to highly parallel architectures, such as GPUs; su cientlysimple and regular algorithms for addition. Cons: morethanonerepresentation; existingmultiplication algorithms do not generalize well for an arbitrary number of terms; di cultrigorouserroranalysis! lack of thorough error bounds.
9 4/1 Non-overlapping expansions R = e 1 can be represented, using a p =5(in radix 2) system,as: Less compact R 8 = x 0 + x 1 + x 2 : < x 0 =1.1000e 1; R 8 = y 0 + y 1 + y 2 + y 3 + y 4 + y 5 : x 1 =1.0010e 3; : x 2 =1.0110e 6. Most compact R = z 0 + z 1 : z0 =1.1101e 1; z 1 =1.1000e 8. >< >: y 0 =1.0000e 1; y 1 =1.0000e 2; y 2 =1.0000e 3; y 3 =1.0000e 5; y 4 =1.0000e 8; y 5 =1.0000e 9.
10 4/1 Non-overlapping expansions R = e 1 can be represented, using a p =5(in radix 2) system,as: Less compact R 8 = x 0 + x 1 + x 2 : < x 0 =1.1000e 1; R 8 = y 0 + y 1 + y 2 + y 3 + y 4 + y 5 : x 1 =1.0010e 3; : x 2 =1.0110e 6. Most compact R = z 0 + z 1 : z0 =1.1101e 1; z 1 =1.1000e 8. >< >: y 0 =1.0000e 1; y 1 =1.0000e 2; y 2 =1.0000e 3; y 3 =1.0000e 5; y 4 =1.0000e 8; y 5 =1.0000e 9. Solution: the FP expansions are required to be non-overlapping. Definition: ulp -nonoverlapping. For an expansion u 0,u 1,...,u n 1 if for all 0 <i<n,wehave u i apple ulp (u i 1 ). Example: p =5(in radix 2) 8 x >< 0 =1.1010e 2; x 1 =1.1101e 7; >: x 2 =1.0000e 11; x 3 =1.1000e 17.
11 4/1 Non-overlapping expansions R = e 1 can be represented, using a p =5(in radix 2) system,as: Less compact R 8 = x 0 + x 1 + x 2 : < x 0 =1.1000e 1; R 8 = y 0 + y 1 + y 2 + y 3 + y 4 + y 5 : x 1 =1.0010e 3; : x 2 =1.0110e 6. Most compact R = z 0 + z 1 : z0 =1.1101e 1; z 1 =1.1000e 8. >< >: y 0 =1.0000e 1; y 1 =1.0000e 2; y 2 =1.0000e 3; y 3 =1.0000e 5; y 4 =1.0000e 8; y 5 =1.0000e 9. Solution: the FP expansions are required to be non-overlapping. Definition: ulp -nonoverlapping. For an expansion u 0,u 1,...,u n 1 if for all 0 <i<n,wehave u i apple ulp (u i 1 ). Example: p =5(in radix 2) 8 x >< 0 =1.1010e 2; x 1 =1.1101e 7; >: x 2 =1.0000e 11; x 3 =1.1000e 17. Restriction: n apple 12 for single-precision and n apple 39 for double-precision.
12 5/1 Error-Free Transforms: Fast2Sum & 2MultFMA Algorithm 1 (Fast2Sum (a, b)) s RN (a + b) z RN (s a) e RN (b z) return (s, e) Algorithm 2 (2MultFMA (a, b)) p RN (a b) e fma(a, b, p) return (p, e) Requirement: e a e b ;! Uses 3 FP operations. Requirement: e a + e b e min + p 1;! Uses 2 FP operations.
13 6/1 Existing multiplication algorithms 1 Priest s multiplication [Pri91]: verycomplexandcostly; basedonscalarproducts; uses re-normalization after each step; computestheentireresultand truncates a-posteriori; comes with an error bound and correctness proof.
14 6/1 Existing multiplication algorithms 1 Priest s multiplication [Pri91]: verycomplexandcostly; basedonscalarproducts; uses re-normalization after each step; computestheentireresultand truncates a-posteriori; comes with an error bound and correctness proof. 2 quad-double multiplication in QD library [Hida et.al. 07]: does not straightforwardly generalize; canleadtoo(n 3 ) complexity; worstcaseerrorboundispessimistic; nocorrectnessproofispublished.
15 7/1 New multiplication algorithms requires: ulp -nonoverlapping FP expansion x =(x 0,x 1,...,x n 1 ) and y =(y 0,y 1,...,y m 1 ); ensures: ulp -nonoverlapping FP expansion =( 0, 1,..., r 1 ). Let me explain it with an example...
16 Example: n =4,m=3and r =4 8/1
17 8/1 Example: n =4,m=3and r =4 paper-and-pencil intuition; on-the-fly truncation ;
18 8/1 Example: n =4,m=3and r =4 paper-and-pencil intuition; on-the-fly truncation ; term-times-expansion products, x i y; error correction term, r.
19 9/1 Example: n =4,m=3and r =4 [ r p b ]+2containers of size b (s.t. 3b >2p); b + c = p 1, s.t.wecanadd2 c numbers without error (binary64! b =45, binary32! b =18); starting exponent e = e x0 + e y0 ; each bin s LSB has a fixed weight; bins initialized with e (i+1)b+p 1 ;
20 9/1 Example: n =4,m=3and r =4 [ r p b ]+2containers of size b (s.t. 3b >2p); b + c = p 1, s.t.wecanadd2 c numbers without error (binary64! b =45, binary32! b =18); starting exponent e = e x0 + e y0 ; each bin s LSB has a fixed weight; bins initialized with e (i+1)b+p 1 ; the number of leading bits, `; accumulation done using a Fast2Sum and addition [Rump09].
21 Example: n =4,m=3and r =4 10 / 1
22 10 / 1 Example: n =4,m=3and r =4 subtract initial value;
23 10 / 1 Example: n =4,m=3and r =4 subtract initial value; apply renormalization step to B: Fast2Sum and branches; rendertheresult ulp -nonoverlapping.
24 11 / 1 Error bound Exact result computed as: n 1;m 1 X i=0;j=0 x i y j = m+n 2 X k=0 X x i y j. i+j=k On-the-fly truncation! three error sources:
25 11 / 1 Error bound Exact result computed as: n 1;m 1 X i=0;j=0 x i y j = m+n 2 X k=0 X x i y j. i+j=k On-the-fly truncation! three error sources: discarded partial products: weaddonlythefirst r+1 P m+n P 2 k=r+1 P i+j=k k=1 k; x i y j apple x 0 y 0 2 (p h i 1)(r+1) (p 1) 2 + m+n r 2 (1 2 (p 1) ) (p 1) ;
26 11 / 1 Error bound Exact result computed as: n 1;m 1 X i=0;j=0 x i y j = m+n 2 X k=0 X x i y j. i+j=k On-the-fly truncation! three error sources: discarded partial products: weaddonlythefirst r+1 P m+n P 2 k=r+1 P i+j=k k=1 k; x i y j apple x 0 y 0 2 (p h i 1)(r+1) (p 1) 2 + m+n r 2 (1 2 (p 1) ) (p 1) discarded rounding errors: wecomputethelastr +1partial products using simple FP multiplication; weneglectr +1error terms bounded by x 0 y 0 2 (p 1)r 2 p ; ;
27 11 / 1 Error bound Exact result computed as: n 1;m 1 X i=0;j=0 x i y j = m+n 2 X k=0 X x i y j. i+j=k On-the-fly truncation! three error sources: discarded partial products: weaddonlythefirst r+1 P m+n P 2 k=r+1 P i+j=k k=1 k; x i y j apple x 0 y 0 2 (p h i 1)(r+1) (p 1) 2 + m+n r 2 (1 2 (p 1) ) (p 1) discarded rounding errors: wecomputethelastr +1partial products using simple FP multiplication; weneglectr +1error terms bounded by x 0 y 0 2 (p 1)r 2 p ; error caused by the renormalization step: errorlessorequalto x 0 y 0 2 (p 1)r. ;
28 11 / 1 Error bound Exact result computed as: n 1;m 1 X i=0;j=0 x i y j = m+n 2 X k=0 X x i y j. i+j=k On-the-fly truncation! three error sources: discarded partial products: weaddonlythefirst r+1 P m+n P 2 k=r+1 P i+j=k k=1 k; x i y j apple x 0 y 0 2 (p h i 1)(r+1) (p 1) 2 + m+n r 2 (1 2 (p 1) ) (p 1) discarded rounding errors: wecomputethelastr +1partial products using simple FP multiplication; weneglectr +1error terms bounded by x 0 y 0 2 (p 1)r 2 p ; error caused by the renormalization step: errorlessorequalto x 0 y 0 2 (p 1)r. Tight error bound: x 0 y 0 2 (p 1)r [1 + (r +1)2 p (p 1) 2 (p 1) (1 2 (p 1) ) 2 + m + n r 2 A]. 1 2 (p 1) ;
29 12 / 1 Comparison Table : Worst case FP operation count when the input and output expansions are of size r. r New algorithm Priest s mul.[pri91]
30 Comparison Table : Worst case FP operation count when the input and output expansions are of size r. r New algorithm Priest s mul.[pri91] Table : Performance in Mops/s for multiplying two FP expansions on a Tesla K40c GPU, using CUDA 7.5 software architecture, running on a single thread of execution. precision not supported. d x,d y,d r New algorithm QD 2, 2, , 2, , 3, , 3, , 4, , 4, , 4, , 8, , 8, , 16, / 1
31 13 / 1 Conclusions AMPAR CudA Multiple Precision ARithmetic library Available online at: algorithm with strong regularity; based on partial products accumulation; uses a fixed-point structure that is floating-point friendly; thorough error analysis and tight error bound; natural fit for GPUs; proved to be too complex for small precisions; performance gains with increased precision.
32 Comparison Table : Performance in Mops/s for multiplying two FP expansions on the CPU; d x and d y represent the number of terms in the input expansions and d r is the size of the computed result. precision not supported d x,d y,d r New algorithm QD MPFR 2, 2, , 2, , 3, , 3, , 4, , 4, , 4, , 8, , 8, , 16,
On the computation of the reciprocal of floating point expansions using an adapted Newton-Raphson iteration
On the computation of the reciprocal of floating point expansions using an adapted Newton-Raphson iteration Mioara Joldes, Valentina Popescu, Jean-Michel Muller ASAP 2014 1 / 10 Motivation Numerical problems
More informationTight and rigourous error bounds for basic building blocks of double-word arithmetic. Mioara Joldes, Jean-Michel Muller, Valentina Popescu
Tight and rigourous error bounds for basic building blocks of double-word arithmetic Mioara Joldes, Jean-Michel Muller, Valentina Popescu SCAN - September 2016 AMPAR CudA Multiple Precision ARithmetic
More informationMANY numerical problems in dynamical systems
IEEE TRANSACTIONS ON COMPUTERS, VOL., 201X 1 Arithmetic algorithms for extended precision using floating-point expansions Mioara Joldeş, Olivier Marty, Jean-Michel Muller and Valentina Popescu Abstract
More informationAccurate polynomial evaluation in floating point arithmetic
in floating point arithmetic Université de Perpignan Via Domitia Laboratoire LP2A Équipe de recherche en Informatique DALI MIMS Seminar, February, 10 th 2006 General motivation Provide numerical algorithms
More informationOn various ways to split a floating-point number
On various ways to split a floating-point number Claude-Pierre Jeannerod Jean-Michel Muller Paul Zimmermann Inria, CNRS, ENS Lyon, Université de Lyon, Université de Lorraine France ARITH-25 June 2018 -2-
More informationImplementation of float float operators on graphic hardware
Implementation of float float operators on graphic hardware Guillaume Da Graça David Defour DALI, Université de Perpignan Outline Why GPU? Floating point arithmetic on GPU Float float format Problem &
More informationOptimizing the Representation of Intervals
Optimizing the Representation of Intervals Javier D. Bruguera University of Santiago de Compostela, Spain Numerical Sofware: Design, Analysis and Verification Santander, Spain, July 4-6 2012 Contents 1
More informationShort Division of Long Integers. (joint work with David Harvey)
Short Division of Long Integers (joint work with David Harvey) Paul Zimmermann October 6, 2011 The problem to be solved Divide efficiently a p-bit floating-point number by another p-bit f-p number in the
More informationGPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic
GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago
More informationComplement Arithmetic
Complement Arithmetic Objectives In this lesson, you will learn: How additions and subtractions are performed using the complement representation, What is the Overflow condition, and How to perform arithmetic
More informationDense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More informationThe Classical Relative Error Bounds for Computing a2 + b 2 and c/ a 2 + b 2 in Binary Floating-Point Arithmetic are Asymptotically Optimal
-1- The Classical Relative Error Bounds for Computing a2 + b 2 and c/ a 2 + b 2 in Binary Floating-Point Arithmetic are Asymptotically Optimal Claude-Pierre Jeannerod Jean-Michel Muller Antoine Plet Inria,
More informationGetting tight error bounds in floating-point arithmetic: illustration with complex functions, and the real x n function
Getting tight error bounds in floating-point arithmetic: illustration with complex functions, and the real x n function Jean-Michel Muller collaborators: S. Graillat, C.-P. Jeannerod, N. Louvet V. Lefèvre
More information4ccurate 4lgorithms in Floating Point 4rithmetic
4ccurate 4lgorithms in Floating Point 4rithmetic Philippe LANGLOIS 1 1 DALI Research Team, University of Perpignan, France SCAN 2006 at Duisburg, Germany, 26-29 September 2006 Philippe LANGLOIS (UPVD)
More informationComputation of dot products in finite fields with floating-point arithmetic
Computation of dot products in finite fields with floating-point arithmetic Stef Graillat Joint work with Jérémy Jean LIP6/PEQUAN - Université Pierre et Marie Curie (Paris 6) - CNRS Computer-assisted proofs
More informationIntroduction to numerical computations on the GPU
Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming
More informationSolving Triangular Systems More Accurately and Efficiently
17th IMACS World Congress, July 11-15, 2005, Paris, France Scientific Computation, Applied Mathematics and Simulation Solving Triangular Systems More Accurately and Efficiently Philippe LANGLOIS and Nicolas
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More information9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017
9. Datapath Design Jacob Abraham Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017 October 2, 2017 ECE Department, University of Texas at Austin
More informationMANY numerical problems in dynamical systems or
IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 4, APRIL 2016 1197 Arithmetic Algorithms for Extended Precision Using Floating-Point Expansions Mioara Joldeş, Olivier Marty, Jean-Michel Muller, Senior Member,
More informationLecture 11. Advanced Dividers
Lecture 11 Advanced Dividers Required Reading Behrooz Parhami, Computer Arithmetic: Algorithms and Hardware Design Chapter 15 Variation in Dividers 15.3, Combinational and Array Dividers Chapter 16, Division
More informationHow fast can we add (or subtract) two numbers n and m?
Addition and Subtraction How fast do we add (or subtract) two numbers n and m? How fast can we add (or subtract) two numbers n and m? Definition. Let A(d) denote the maximal number of steps required to
More informationHeterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry
Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)
More informationComplex Logarithmic Number System Arithmetic Using High-Radix Redundant CORDIC Algorithms
Complex Logarithmic Number System Arithmetic Using High-Radix Redundant CORDIC Algorithms David Lewis Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario, Canada M5S
More informationWelcome to MCS 572. content and organization expectations of the course. definition and classification
Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson
More informationRemainders. We learned how to multiply and divide in elementary
Remainders We learned how to multiply and divide in elementary school. As adults we perform division mostly by pressing the key on a calculator. This key supplies the quotient. In numerical analysis and
More informationContinued fractions and number systems: applications to correctly-rounded implementations of elementary functions and modular arithmetic.
Continued fractions and number systems: applications to correctly-rounded implementations of elementary functions and modular arithmetic. Mourad Gouicem PEQUAN Team, LIP6/UPMC Nancy, France May 28 th 2013
More informationTwo case studies of Monte Carlo simulation on GPU
Two case studies of Monte Carlo simulation on GPU National Institute for Computational Sciences University of Tennessee Seminar series on HPC, Feb. 27, 2014 Outline 1 Introduction 2 Discrete energy lattice
More informationCSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2019
CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis Ruth Anderson Winter 2019 Today Algorithm Analysis What do we care about? How to compare two algorithms Analyzing Code Asymptotic Analysis
More informationCRYPTOGRAPHIC COMPUTING
CRYPTOGRAPHIC COMPUTING ON GPU Chen Mou Cheng Dept. Electrical Engineering g National Taiwan University January 16, 2009 COLLABORATORS Daniel Bernstein, UIC, USA Tien Ren Chen, Army Tanja Lange, TU Eindhoven,
More informationUNIT V FINITE WORD LENGTH EFFECTS IN DIGITAL FILTERS PART A 1. Define 1 s complement form? In 1,s complement form the positive number is represented as in the sign magnitude form. To obtain the negative
More informationAccurate Solution of Triangular Linear System
SCAN 2008 Sep. 29 - Oct. 3, 2008, at El Paso, TX, USA Accurate Solution of Triangular Linear System Philippe Langlois DALI at University of Perpignan France Nicolas Louvet INRIA and LIP at ENS Lyon France
More informationTowards Fast, Accurate and Reproducible LU Factorization
Towards Fast, Accurate and Reproducible LU Factorization Roman Iakymchuk 1, David Defour 2, and Stef Graillat 3 1 KTH Royal Institute of Technology, CSC, CST/PDC 2 Université de Perpignan, DALI LIRMM 3
More informationNumbering Systems. Computational Platforms. Scaling and Round-off Noise. Special Purpose. here that is dedicated architecture
Computational Platforms Numbering Systems Basic Building Blocks Scaling and Round-off Noise Computational Platforms Viktor Öwall viktor.owall@eit.lth.seowall@eit lth Standard Processors or Special Purpose
More informationBranch Prediction based attacks using Hardware performance Counters IIT Kharagpur
Branch Prediction based attacks using Hardware performance Counters IIT Kharagpur March 19, 2018 Modular Exponentiation Public key Cryptography March 19, 2018 Branch Prediction Attacks 2 / 54 Modular Exponentiation
More informationFaster arbitrary-precision dot product and matrix multiplication
Faster arbitrary-precision dot product and matrix multiplication Fredrik Johansson LFANT Inria Bordeaux Talence, France fredrik.johansson@gmail.com arxiv:1901.04289v1 [cs.ms] 14 Jan 2019 Abstract We present
More informationEfficient Reproducible Floating Point Summation and BLAS
Efficient Reproducible Floating Point Summation and BLAS Peter Ahrens Hong Diep Nguyen James Demmel Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No.
More informationParallel Polynomial Evaluation
Parallel Polynomial Evaluation Jan Verschelde joint work with Genady Yoffe University of Illinois at Chicago Department of Mathematics, Statistics, and Computer Science http://www.math.uic.edu/ jan jan@math.uic.edu
More informationHigh-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA
High-Performance FV Somewhat Homomorphic Encryption on GPUs: An Implementation using CUDA Ahmad Al Badawi ahmad@u.nus.edu National University of Singapore (NUS) Sept 10 th 2018 CHES 2018 FHE The holy grail
More informationTopic 17. Analysis of Algorithms
Topic 17 Analysis of Algorithms Analysis of Algorithms- Review Efficiency of an algorithm can be measured in terms of : Time complexity: a measure of the amount of time required to execute an algorithm
More informationChapter 4 Number Representations
Chapter 4 Number Representations SKEE2263 Digital Systems Mun im/ismahani/izam {munim@utm.my,e-izam@utm.my,ismahani@fke.utm.my} February 9, 2016 Table of Contents 1 Fundamentals 2 Signed Numbers 3 Fixed-Point
More informationParallel Reproducible Summation
Parallel Reproducible Summation James Demmel Mathematics Department and CS Division University of California at Berkeley Berkeley, CA 94720 demmel@eecs.berkeley.edu Hong Diep Nguyen EECS Department University
More informationS XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library
S6349 - XMP LIBRARY INTERNALS Niall Emmart University of Massachusetts Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library High Performance Modular Exponentiation A^K mod P Where A,
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationVectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier
Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier Espen Stenersen Master of Science in Electronics Submission date: June 2008 Supervisor: Per Gunnar Kjeldsberg, IET Co-supervisor: Torstein
More informationGPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications
GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign
More informationSoftware implementation of ECC
Software implementation of ECC Radboud University, Nijmegen, The Netherlands June 4, 2015 Summer school on real-world crypto and privacy Šibenik, Croatia Software implementation of (H)ECC Radboud University,
More informationAccurate Multiple-Precision Gauss-Legendre Quadrature
Accurate Multiple-Precision Gauss-Legendre Quadrature Laurent Fousse Université Henri-Poincaré Nancy 1 laurent@komite.net Abstract Numerical integration is an operation that is frequently available in
More informationCSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018
CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis Ruth Anderson Winter 2018 Today Algorithm Analysis What do we care about? How to compare two algorithms Analyzing Code Asymptotic Analysis
More informationALU (3) - Division Algorithms
HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Lecture 12 ALU (3) - Division Algorithms Sommersemester 2002 Leitung: Prof. Dr. Miroslaw Malek www.informatik.hu-berlin.de/rok/ca CA - XII - ALU(3)
More informationHybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS
Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge González-Domínguez*, Bertil Schmidt*, Jan C. Kässens**, Lars Wienbrandt** *Parallel and Distributed Architectures
More informationTunable Floating-Point for Energy Efficient Accelerators
Tunable Floating-Point for Energy Efficient Accelerators Alberto Nannarelli DTU Compute, Technical University of Denmark 25 th IEEE Symposium on Computer Arithmetic A. Nannarelli (DTU Compute) Tunable
More informationCSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018
CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis Ruth Anderson Winter 2018 Today Algorithm Analysis What do we care about? How to compare two algorithms Analyzing Code Asymptotic Analysis
More informationNewton-Raphson Algorithms for Floating-Point Division Using an FMA
Newton-Raphson Algorithms for Floating-Point Division Using an FMA Nicolas Louvet, Jean-Michel Muller, Adrien Panhaleux Abstract Since the introduction of the Fused Multiply and Add (FMA) in the IEEE-754-2008
More information14:332:231 DIGITAL LOGIC DESIGN. Why Binary Number System?
:33:3 DIGITAL LOGIC DESIGN Ivan Marsic, Rutgers University Electrical & Computer Engineering Fall 3 Lecture #: Binary Number System Complement Number Representation X Y Why Binary Number System? Because
More informationAccelerating Model Reduction of Large Linear Systems with Graphics Processors
Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex
More informationProposal to Improve Data Format Conversions for a Hybrid Number System Processor
Proposal to Improve Data Format Conversions for a Hybrid Number System Processor LUCIAN JURCA, DANIEL-IOAN CURIAC, AUREL GONTEAN, FLORIN ALEXA Department of Applied Electronics, Department of Automation
More informationAutomated design of floating-point logarithm functions on integer processors
23rd IEEE Symposium on Computer Arithmetic Santa Clara, CA, USA, 10-13 July 2016 Automated design of floating-point logarithm functions on integer processors Guillaume Revy (presented by Florent de Dinechin)
More informationThe Euclidean Division Implemented with a Floating-Point Multiplication and a Floor
The Euclidean Division Implemented with a Floating-Point Multiplication and a Floor Vincent Lefèvre 11th July 2005 Abstract This paper is a complement of the research report The Euclidean division implemented
More informationA Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures
A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationFaithful Horner Algorithm
ICIAM 07, Zurich, Switzerland, 16 20 July 2007 Faithful Horner Algorithm Philippe Langlois, Nicolas Louvet Université de Perpignan, France http://webdali.univ-perp.fr Ph. Langlois (Université de Perpignan)
More informationChapter 5: Solutions to Exercises
1 DIGITAL ARITHMETIC Miloš D. Ercegovac and Tomás Lang Morgan Kaufmann Publishers, an imprint of Elsevier Science, c 2004 Updated: September 9, 2003 Chapter 5: Solutions to Selected Exercises With contributions
More informationOptimizing Scientific Libraries for the Itanium
0 Optimizing Scientific Libraries for the Itanium John Harrison Intel Corporation Gelato Federation Meeting, HP Cupertino May 25, 2005 1 Quick summary Intel supplies drop-in replacement versions of common
More informationHydra. A library for data analysis in massively parallel platforms. A. Augusto Alves Jr and Michael D. Sokoloff
Hydra A library for data analysis in massively parallel platforms A. Augusto Alves Jr and Michael D. Sokoloff University of Cincinnati aalvesju@cern.ch Presented at NVIDIA s GPU Technology Conference,
More informationNUMBERS AND CODES CHAPTER Numbers
CHAPTER 2 NUMBERS AND CODES 2.1 Numbers When a number such as 101 is given, it is impossible to determine its numerical value. Some may say it is five. Others may say it is one hundred and one. Could it
More informationQuantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm)
Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm) Alexander Smith & Khashayar Khavari Department of Electrical and Computer Engineering University of Toronto April 15, 2009 Alexander
More informationarxiv: v1 [hep-lat] 7 Oct 2010
arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA
More informationChapter 1 Error Analysis
Chapter 1 Error Analysis Several sources of errors are important for numerical data processing: Experimental uncertainty: Input data from an experiment have a limited precision. Instead of the vector of
More informationCS 310 Advanced Data Structures and Algorithms
CS 310 Advanced Data Structures and Algorithms Runtime Analysis May 31, 2017 Tong Wang UMass Boston CS 310 May 31, 2017 1 / 37 Topics Weiss chapter 5 What is algorithm analysis Big O, big, big notations
More informationElements of Floating-point Arithmetic
Elements of Floating-point Arithmetic Sanzheng Qiao Department of Computing and Software McMaster University July, 2012 Outline 1 Floating-point Numbers Representations IEEE Floating-point Standards Underflow
More informationLogic and Computer Design Fundamentals. Chapter 5 Arithmetic Functions and Circuits
Logic and Computer Design Fundamentals Chapter 5 Arithmetic Functions and Circuits Arithmetic functions Operate on binary vectors Use the same subfunction in each bit position Can design functional block
More informationSolving PDEs with PGI CUDA Fortran Part 4: Initial value problems for ordinary differential equations
Solving PDEs with PGI CUDA Fortran Part 4: Initial value problems for ordinary differential equations Outline ODEs and initial conditions. Explicit and implicit Euler methods. Runge-Kutta methods. Multistep
More informationWhat s the Deal? MULTIPLICATION. Time to multiply
What s the Deal? MULTIPLICATION Time to multiply Multiplying two numbers requires a multiply Luckily, in binary that s just an AND gate! 0*0=0, 0*1=0, 1*0=0, 1*1=1 Generate a bunch of partial products
More informationAre numerical studies of long term dynamics conclusive: the case of the Hénon map
Journal of Physics: Conference Series PAPER OPEN ACCESS Are numerical studies of long term dynamics conclusive: the case of the Hénon map To cite this article: Zbigniew Galias 2016 J. Phys.: Conf. Ser.
More informationIsolating critical cases for reciprocals using integer factorization. John Harrison Intel Corporation ARITH-16 Santiago de Compostela 17th June 2003
0 Isolating critical cases for reciprocals using integer factorization John Harrison Intel Corporation ARITH-16 Santiago de Compostela 17th June 2003 1 result before the final rounding Background Suppose
More information1 ERROR ANALYSIS IN COMPUTATION
1 ERROR ANALYSIS IN COMPUTATION 1.2 Round-Off Errors & Computer Arithmetic (a) Computer Representation of Numbers Two types: integer mode (not used in MATLAB) floating-point mode x R ˆx F(β, t, l, u),
More informationLinear System of Equations
Linear System of Equations Linear systems are perhaps the most widely applied numerical procedures when real-world situation are to be simulated. Example: computing the forces in a TRUSS. F F 5. 77F F.
More informationDIVISION BY DIGIT RECURRENCE
DIVISION BY DIGIT RECURRENCE 1 SEVERAL DIVISION METHODS: DIGIT-RECURRENCE METHOD studied in this chapter MULTIPLICATIVE METHOD (Chapter 7) VARIOUS APPROXIMATION METHODS (power series expansion), SPECIAL
More informationComputer Arithmetic. MATH 375 Numerical Analysis. J. Robert Buchanan. Fall Department of Mathematics. J. Robert Buchanan Computer Arithmetic
Computer Arithmetic MATH 375 Numerical Analysis J. Robert Buchanan Department of Mathematics Fall 2013 Machine Numbers When performing arithmetic on a computer (laptop, desktop, mainframe, cell phone,
More informationResearch on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method
NUCLEAR SCIENCE AND TECHNIQUES 25, 0501 (14) Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method XU Qi ( 徐琪 ), 1, YU Gang-Lin ( 余纲林 ), 1 WANG Kan ( 王侃 ),
More informationExtended precision with a rounding mode toward zero environment. Application on the CELL processor
Extended precision with a rounding mode toward zero environment. Application on the CELL processor Hong Diep Nguyen (hong.diep.nguyen@ens-lyon.fr), Stef Graillat (stef.graillat@lip6.fr), Jean-Luc Lamotte
More informationA Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters
A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!
More informationApproximation of inverse Poisson CDF on GPUs
Approximation of inverse Poisson CDF on GPUs Mike Giles Mathematical Institute, University of Oxford Oxford-Man Institute of Quantitative Finance 38th Conference on Stochastic Processes and their Applications
More informationECE260: Fundamentals of Computer Engineering
Data Representation & 2 s Complement James Moscola Dept. of Engineering & Computer Science York College of Pennsylvania Based on Computer Organization and Design, 5th Edition by Patterson & Hennessy Data
More informationComputation of the error functions erf and erfc in arbitrary precision with correct rounding
Computation of the error functions erf and erfc in arbitrary precision with correct rounding Sylvain Chevillard Arenaire, LIP, ENS-Lyon, France Sylvain.Chevillard@ens-lyon.fr Nathalie Revol INRIA, Arenaire,
More informationError Bounds for Iterative Refinement in Three Precisions
Error Bounds for Iterative Refinement in Three Precisions Erin C. Carson, New York University Nicholas J. Higham, University of Manchester SIAM Annual Meeting Portland, Oregon July 13, 018 Hardware Support
More informationChapter 1. Numerical Errors. Module No. 1. Errors in Numerical Computations
Numerical Analysis by Dr. Anita Pal Assistant Professor Department of Mathematics National Institute of Technology Durgapur Durgapur-73209 email: anita.buie@gmail.com . Chapter Numerical Errors Module
More informationCMP 334: Seventh Class
CMP 334: Seventh Class Performance HW 5 solution Averages and weighted averages (review) Amdahl's law Ripple-carry adder circuits Binary addition Half-adder circuits Full-adder circuits Subtraction, negative
More informationImplementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System
Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System G.Suresh, G.Indira Devi, P.Pavankumar Abstract The use of the improved table look up Residue Number System
More informationEAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science
EAD 115 Numerical Solution of Engineering and Scientific Problems David M. Rocke Department of Applied Science Computer Representation of Numbers Counting numbers (unsigned integers) are the numbers 0,
More informationAnalysis of Algorithms [Reading: CLRS 2.2, 3] Laura Toma, csci2200, Bowdoin College
Analysis of Algorithms [Reading: CLRS 2.2, 3] Laura Toma, csci2200, Bowdoin College Why analysis? We want to predict how the algorithm will behave (e.g. running time) on arbitrary inputs, and how it will
More informationAlgorithms for Experimental Mathematics I David H Bailey Lawrence Berkeley National Lab
Algorithms for Experimental Mathematics I David H Bailey Lawrence Berkeley National Lab All truths are easy to understand once they are discovered; the point is to discover them. Galileo Galilei Algorithms
More informationLeone B.Bosi, Loriano Storchi et al. CCR Workshop Stato e Prospettive del Calcolo Scientifico
Leone B.Bosi, Loriano Storchi et al MaCGO Project - Einstein Telescope Project INFN Perugia CCR Workshop 2011 - Stato e Prospettive del Calcolo Scientifico Laboratori Nazionali di Legnaro 16-18 Febbraio
More informationScalable and Power-Efficient Data Mining Kernels
Scalable and Power-Efficient Data Mining Kernels Alok Choudhary, John G. Searle Professor Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management Director of the
More informationI. Numerical Computing
I. Numerical Computing A. Lectures 1-3: Foundations of Numerical Computing Lecture 1 Intro to numerical computing Understand difference and pros/cons of analytical versus numerical solutions Lecture 2
More informationCost/Performance Tradeoff of n-select Square Root Implementations
Australian Computer Science Communications, Vol.22, No.4, 2, pp.9 6, IEEE Comp. Society Press Cost/Performance Tradeoff of n-select Square Root Implementations Wanming Chu and Yamin Li Computer Architecture
More informationFloating Point Number Systems. Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le
Floating Point Number Systems Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le 1 Overview Real number system Examples Absolute and relative errors Floating point numbers Roundoff
More informationNUMERICAL METHODS C. Carl Gustav Jacob Jacobi 10.1 GAUSSIAN ELIMINATION WITH PARTIAL PIVOTING
0. Gaussian Elimination with Partial Pivoting 0.2 Iterative Methods for Solving Linear Systems 0.3 Power Method for Approximating Eigenvalues 0.4 Applications of Numerical Methods Carl Gustav Jacob Jacobi
More informationElements of Floating-point Arithmetic
Elements of Floating-point Arithmetic Sanzheng Qiao Department of Computing and Software McMaster University July, 2012 Outline 1 Floating-point Numbers Representations IEEE Floating-point Standards Underflow
More informationMidterm Review. Igor Yanovsky (Math 151A TA)
Midterm Review Igor Yanovsky (Math 5A TA) Root-Finding Methods Rootfinding methods are designed to find a zero of a function f, that is, to find a value of x such that f(x) =0 Bisection Method To apply
More information