Optimizing Scientific Libraries for the Itanium

Similar documents
Formal verification of IA-64 division algorithms

Fast and accurate Bessel function computation

Formal Verification of Mathematical Algorithms

Isolating critical cases for reciprocals using integer factorization. John Harrison Intel Corporation ARITH-16 Santiago de Compostela 17th June 2003

Fast and Accurate Bessel Function Computation

Newton-Raphson Algorithms for Floating-Point Division Using an FMA

Computer Arithmetic. MATH 375 Numerical Analysis. J. Robert Buchanan. Fall Department of Mathematics. J. Robert Buchanan Computer Arithmetic

The final is cumulative, but with more emphasis on chapters 3 and 4. There will be two parts.

Automating the Verification of Floating-point Algorithms

Lecture 7: Indeterminate forms; L Hôpitals rule; Relative rates of growth. If we try to simply substitute x = 1 into the expression, we get

Automated design of floating-point logarithm functions on integer processors

JUST THE MATHS UNIT NUMBER ORDINARY DIFFERENTIAL EQUATIONS 1 (First order equations (A)) A.J.Hobson

ROOT FINDING REVIEW MICHELLE FENG

Computing Machine-Efficient Polynomial Approximations

function independent dependent domain range graph of the function The Vertical Line Test

McBits: Fast code-based cryptography

Floating-Point Verification by Theorem Proving

Faster arithmetic for number-theoretic transforms

7.5 Partial Fractions and Integration

SUFFIX PROPERTY OF INVERSE MOD

Designing a Correct Numerical Algorithm

Computing correctly rounded logarithms with fixed-point operations. Julien Le Maire, Florent de Dinechin, Jean-Michel Muller and Nicolas Brunie

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 5

NUMERICAL MATHEMATICS & COMPUTING 6th Edition

Partial Fractions. Calculus 2 Lia Vas

Interval Arithmetic An Elementary Introduction and Successful Applications

MA1131 Lecture 15 (2 & 3/12/2010) 77. dx dx v + udv dx. (uv) = v du dx dx + dx dx dx

Tunable Floating-Point for Energy Efficient Accelerators

Accurate polynomial evaluation in floating point arithmetic

Lecture 11. Advanced Dividers

Number Representation and Waveform Quantization

Accelerating Correctly Rounded Floating-Point. Division When the Divisor is Known in Advance

JUST THE MATHS UNIT NUMBER 1.4. ALGEBRA 4 (Logarithms) A.J.Hobson

What if the characteristic equation has complex roots?

Algebra II Honors Final Exam Review

Math 180, Exam 2, Spring 2013 Problem 1 Solution

ALGEBRA+NUMBER THEORY +COMBINATORICS

Mathematics 136 Calculus 2 Everything You Need Or Want To Know About Partial Fractions (and maybe more!) October 19 and 21, 2016

Lecture 2: Number Representations (2)

Computing logarithms and other special functions

4ccurate 4lgorithms in Floating Point 4rithmetic

Cost/Performance Tradeoffs:

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017

July 21 Math 2254 sec 001 Summer 2015

Q 2.0.2: If it s 5:30pm now, what time will it be in 4753 hours? Q 2.0.3: Today is Wednesday. What day of the week will it be in one year from today?

Math 3 Unit 5: Polynomial and Rational Representations and Modeling

Partial Fractions. June 27, In this section, we will learn to integrate another class of functions: the rational functions.

Table-based polynomials for fast hardware function evaluation

Optimizing MPC for robust and scalable integer and floating-point arithmetic

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Exponential and logarithm functions

Factoring and Algebraic Fractions

6.1. Rational Expressions and Functions; Multiplying and Dividing. Copyright 2016, 2012, 2008 Pearson Education, Inc. 1

Commutative Rings and Fields

3.3 Dividing Polynomials. Copyright Cengage Learning. All rights reserved.

Acommon practice for computing elementary transcendental

An Introduction to Differential Algebra

Exercise on Continued Fractions

f(g(x)) g (x) dx = f(u) du.

Math 31A Discussion Session Week 1 Notes January 5 and 7, 2016

How do computers represent numbers?

CPSC 467: Cryptography and Computer Security

Computation of dot products in finite fields with floating-point arithmetic

Introduction CSE 541

1 Lecture 8: Interpolating polynomials.

How to Ensure a Faithful Polynomial Evaluation with the Compensated Horner Algorithm

The Not-Formula Book for C2 Everything you need to know for Core 2 that won t be in the formula book Examination Board: AQA

Chapter 13: Integrals

The tangent FFT. D. J. Bernstein University of Illinois at Chicago

On various ways to split a floating-point number

Intel s Successes with Formal Methods

Part 3.3 Differentiation Taylor Polynomials

Calculus. Integration (III)

Complex Logarithmic Number System Arithmetic Using High-Radix Redundant CORDIC Algorithms

1.2 Supplement: Mathematical Models: A Catalog of Essential Functions

Advanced Math Quiz Review Name: Dec Use Synthetic Division to divide the first polynomial by the second polynomial.

MATH Dr. Halimah Alshehri Dr. Halimah Alshehri

Math 1310 Section 4.1: Polynomial Functions and Their Graphs. A polynomial function is a function of the form ...

Efficient random number generation on FPGA-s

Solution of Algebric & Transcendental Equations

Partial Fractions. Combining fractions over a common denominator is a familiar operation from algebra: 2 x 3 + 3

Optimizing the Representation of Intervals

1.4 Techniques of Integration

Number Systems III MA1S1. Tristan McLoughlin. December 4, 2013

Elementary Properties of the Integers

Part VI Function Evaluation

1. Introduction to commutative rings and fields

Aim: How do we prepare for AP Problems on limits, continuity and differentiability? f (x)

INTERPOLATION. and y i = cos x i, i = 0, 1, 2 This gives us the three points. Now find a quadratic polynomial. p(x) = a 0 + a 1 x + a 2 x 2.

Integration of Rational Functions by Partial Fractions

Chapter 1 Mathematical Preliminaries and Error Analysis

Getting tight error bounds in floating-point arithmetic: illustration with complex functions, and the real x n function

Reliable real arithmetic. one-dimensional dynamical systems

CHAPTER 1: Functions

8/13/16. Data analysis and modeling: the tools of the trade. Ø Set of numbers. Ø Binary representation of numbers. Ø Floating points.

Chapter. Algebra techniques. Syllabus Content A Basic Mathematics 10% Basic algebraic techniques and the solution of equations.

Prelim Examination 2010 / 2011 (Assessing Units 1 & 2) MATHEMATICS. Advanced Higher Grade. Time allowed - 2 hours

Practice Differentiation Math 120 Calculus I Fall 2015

Infinite series, improper integrals, and Taylor series

Continued fractions and number systems: applications to correctly-rounded implementations of elementary functions and modular arithmetic.

Transcription:

0 Optimizing Scientific Libraries for the Itanium John Harrison Intel Corporation Gelato Federation Meeting, HP Cupertino May 25, 2005

1 Quick summary Intel supplies drop-in replacement versions of common libraries optimized for particular (micro-)architectures. We ll focus on our version of the standard math library libm for the Intel Itanium architecture. Provides superior speed and/or accuracy to typical generic versions (FDLIBM etc.) Special features of Itanium make for some particularly high-quality implementations.

2 What makes Itanium special? Parallelism, many registers, predication and explicit scheduling can consider sophisticated algorithms with parallel threads Fused multiply-add and extended precision useful for polynomials and eases many accuracy-preserving techniques Non-atomic division and square root allows optimized integration into more complicated algorithms Multiple status fields intermediate computations can use higher intermediate precision with no performance hit Predication and register renaming can eliminate most control flow and pipeline multiple instances

3 Naive table-driven algorithm A naive algorithm for e x illustrates the key steps of the typical table-driven algorithm: Reduction: split x = N + r where N is an integer and r 1/2. Approximation: e r = 1 + r + r 2 /2! + r n /n! Reconstruction: e x = e N e r with e N taken from a table. Same pattern is used in most real algorithms, but with careful attention to numeric accuracy and the tradeoff between speed and table size.

4 Real table-driven algorithm The input number X is first reduced to r with approximately r π/4 such that X = r + Nπ/2 for some integer N. We now need to calculate ±tan(r) or ±cot(r) depending on N modulo 4. If the reduced argument r is still not small enough, it is separated into its leading few bits B and the trailing part x = r B, and the overall result computed from tan(x) and pre-stored functions of B, e.g. tan(b + x) = tan(b) + 1 sin(b)cos(b) tan(x) cot(b) tan(x) Now a polynomial (plus reciprocal) approximation is used for tan(r), cot(r) or tan(x) as appropriate.

5 Polynomials using fma A traditional approach to computing polynomials is Horner s rule : a 0 + x(a 1 + x(a 2 + x(a 3 + + x(a n 1 + xa n ) ))) Minimizes the total number of operations and usually has good numerical properties. Even better on Itanium. The core FP operation is the fused multiply-add, which computes a b + c in one operation. Horner s rule is used in most of the throughput-optimized algorithms.

6 Binary splitting of polynomials If our concern is latency, we can do better using binary splitting. This arrangement yields 3 fma latencies instead of 6: 1 + d + d 2 + + d 6 = 1 + (1 + d + d 2 )(d 4 + d) More generally, we can do most n-degree polynomials in around log 2 (n) latencies, thanks to pipelining and two floating-point units. Using extended precision, there are few accuracy worries with algebraic rearrangement. (Need a little care over monotonicity!) So our algorithms often use longer polynomials than the norm, given that they can be computed quickly and accurately.

7 Exploiting non-atomic division On Itanium, there s no atomic division operation, but frcpa returns a reciprocal approximation good to about 8 bits. Techniques largely due to Markstein allow this to be refined to an IEEE-correct quotient just using standard fma operations. Automatically inherits pipelining from core operations, giving high throughput and ability to schedule in parallel with other work Simpler hardware Can more flexibly use specially optimized algorithms, e.g. not force IEEE rounding of intermediate results Can exploit the reciprocal approximation itself for various tricks

8 Using parallelism and non-atomic division In computing atan(x) for x > 1 we actually compute π/2 atan(1/x). It seems we have a division in the critical path. But eventually we are computing a polynomial in 1/x, and we can factor out the highest term: p(1/x) = 1/x 45 q(x) Now let c = frcpa(x) and b = c x 1. We can actually compute the following, where all three terms in the product can be evaluated in parallel p(1/x) = c 45 r(b)q(x) where r(b) is a power series expansion of 1 (1+b) 45.

9 The frcpa trick Though designed for starting a division, frcpa can speed up range reduction for some multiplicative functions. Let c = frcpa(x) and r = c x 1. Then: ln(x) = ln((1 + r)/c) = ln(1 + r) ln(c) We have a precomputed table of values for ln(c), and ln(1 + r) is approximated by a short power series. Thanks to this, the logarithm is about the fastest of our transcendentals!

10 Some results Figures for libm latency and VML throughput, double precision. Function Latency Throughput Accuracy (ulps) atan 56 13.3 0.501 cbrt 34 10.2 0.501 exp 43 6.2 0.502 log 31 11.2 0.501 sin 49 8.2 0.503 tan 63 11.3 0.507 All figures for double precision common path. Some cases may be quicker (exp(0)) or slower (sin(2 1000 )). Accuracy figures are for libm; those for VML may differ slightly.

11 Conclusions Using Itanium s architectural features to the full, we can provide a libm with industry-leading speed and accuracy. Almost every special architectural feature of Itanium is fully exploited here, occasionally in unexpected ways. Some current work: Investigate perfectly rounded transcendental functions (with reasonable performance) Fill out functionality with high-quality versions of more obscure functions (e.g. Bessel functions J n (x) and Y n (x)).

12 Further reading For quick surveys: New Algorithms for Improved Transcendental Functions on IA-64, Shane Story and Peter Tang, 14th IEEE Computer Arithmetic Conference, 1999. The Computation of Transcendental Functions on the IA-64 Architecture, John Harrison, Ted Kubaska, Shane Story and Peter Tang, Intel Technology Journal Q4 1999. Much more detailed information: IA-64 and elementary functions: speed and precision, Peter Markstein, Hewlett-Packard Professional Books, 2000. Scientific Computing on Itanium-based systems, Marius Cornea, John Harrison and Peter Tang, Intel Press 2003.