Pre-Scaling to Avoid Overflow and Maintain Maximum Precision in Fixed-Point Multiplication

Similar documents
DSP Configurations. responded with: thus the system function for this filter would be

14:332:231 DIGITAL LOGIC DESIGN. 2 s-complement Representation

First Derivative Test

Multiplication of signed-operands

" W I T H M I A L I O E T O W A R D istolste A N D O H A P l t T Y F O B, A I j L. ; " * Jm MVERSEO IT.

Discrete Mathematics and Probability Theory Fall 2018 Alistair Sinclair and Yun Song Note 6

9 Forward-backward algorithm, sum-product on factor graphs

Q SON,' (ESTABLISHED 1879L

1 Direct Proofs Technique Outlines Example Implication Proofs Technique Outlines Examples...

T i t l e o f t h e w o r k : L a M a r e a Y o k o h a m a. A r t i s t : M a r i a n o P e n s o t t i ( P l a y w r i g h t, D i r e c t o r )

Section 3.1: Definition and Examples (Vector Spaces), Completed

Introduc)on to Bayesian Methods

Basic Linear Algebra in MATLAB

Introduction to Machine Learning

Existence Postulate: The collection of all points forms a nonempty set. There is more than one point in that set.

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

14 EE 2402 Engineering Mathematics III Solutions to Tutorial 3 1. For n =0; 1; 2; 3; 4; 5 verify that P n (x) is a solution of Legendre's equation wit

What Every Programmer Should Know About Floating-Point Arithmetic DRAFT. Last updated: November 3, Abstract

Lecture 10: Powers of Matrices, Difference Equations

Math 343 Lab 7: Line and Curve Fitting

The General Linear Model. How we re approaching the GLM. What you ll get out of this 8/11/16

ECE531 Lecture 8: Non-Random Parameter Estimation

Lecture 10: F -Tests, ANOVA and R 2

First Order ODEs, Part I

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

CS 124 Math Review Section January 29, 2018

Lecture Notes on Partial Dierential Equations (PDE)/ MaSc 221+MaSc 225

Systems I: Computer Organization and Architecture

Introduction to Machine Learning

Power series solutions for 2nd order linear ODE s (not necessarily with constant coefficients) a n z n. n=0

What is a random variable

Mean-Shift Tracker Computer Vision (Kris Kitani) Carnegie Mellon University

Ordinary Least Squares Linear Regression

Quadratic Equations Part I

MA 1128: Lecture 08 03/02/2018. Linear Equations from Graphs And Linear Inequalities

Finding Limits Graphically and Numerically

Expansion of Terms. f (x) = x 2 6x + 9 = (x 3) 2 = 0. x 3 = 0

Polyphase filter bank quantization error analysis

Section 5-7 : Green's Theorem

Lecture 7: Differential Equations

DSP Design Lecture 2. Fredrik Edman.

Lecture 8 CORRELATION AND LINEAR REGRESSION

ECE4270 Fundamentals of DSP Lecture 20. Fixed-Point Arithmetic in FIR and IIR Filters (part I) Overview of Lecture. Overflow. FIR Digital Filter

Math Lecture 36

DIFFERENTIAL EQUATIONS COURSE NOTES, LECTURE 2: TYPES OF DIFFERENTIAL EQUATIONS, SOLVING SEPARABLE ODES.

Analysis of Finite Wordlength Effects

ECE531 Lecture 4b: Composite Hypothesis Testing


Lecture : Probabilistic Machine Learning

Practical Considerations in Fixed-Point FIR Filter Implementations

Feature selection. Micha Elsner. January 29, 2014

Elliptic Curves. Dr. Carmen Bruni. November 4th, University of Waterloo

Lecture 5: Special Functions and Operations

Review Sheet on Convergence of Series MATH 141H

QUADRATIC PROGRAMMING?

Lesson 21 Not So Dramatic Quadratics

Generative adversarial networks

Legendre s Equation. PHYS Southern Illinois University. October 18, 2016

Lecture 9: Elementary Matrices

Fall 2011, EE123 Digital Signal Processing

Forces of Constraint & Lagrange Multipliers

Lecture 8 Analyzing the diffusion weighted signal. Room CSB 272 this week! Please install AFNI

Multivariate Distributions (Hogg Chapter Two)

Math 138: Introduction to solving systems of equations with matrices. The Concept of Balance for Systems of Equations

College Algebra. Chapter 5 Review Created by: Lauren Atkinson. Math Coordinator, Mary Stangler Center for Academic Success

4.4 Uniform Convergence of Sequences of Functions and the Derivative

Quizzam Module 1 : Statics

Partial Fractions. (Do you see how to work it out? Substitute u = ax + b, so du = a dx.) For example, 1 dx = ln x 7 + C, x x (x 3)(x + 1) = a

Lecture 5: Channel Capacity. Copyright G. Caire (Sample Lectures) 122

Notes: Pythagorean Triples

JUSTIFYING BOOLE S ALGEBRA of LOGIC for CLASSES

Sections 6.1 and 6.2: Systems of Linear Equations

HEAGAN & CO., OPP. f>, L. & W. DEPOT, DOYER, N. J, OUR MOTTO! ould Iwv ia immediate vltlui. VEEY BEST NEW Creamery Butter 22c ib,

Mathematics 426 Robert Gross Homework 9 Answers

CPSC 467: Cryptography and Computer Security

ECE353: Probability and Random Processes. Lecture 5 - Cumulative Distribution Function and Expectation

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

One-sample categorical data: approximate inference

Solving Differential Equations Using Power Series

Solving Differential Equations Using Power Series

Preface. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

UNSIGNED BINARY NUMBERS DIGITAL ELECTRONICS SYSTEM DESIGN WHAT ABOUT NEGATIVE NUMBERS? BINARY ADDITION 11/9/2018

6.825 Techniques in Artificial Intelligence. Logic Miscellanea. Completeness and Incompleteness Equality Paramodulation

One-to-one functions and onto functions

POL 681 Lecture Notes: Statistical Interactions

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 5

Lecture 10. (2) Functions of two variables. Partial derivatives. Dan Nichols February 27, 2018

LECTURES 14/15: LINEAR INDEPENDENCE AND BASES

Can ANSYS Mechanical Handle My Required Modeling Precision?

2.2 Separable Equations

Ch. 8 Math Preliminaries for Lossy Coding. 8.5 Rate-Distortion Theory

EE123 Digital Signal Processing

Numerical Analysis and Computing

4 Exact Equations. F x + F. dy dx = 0

Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics )

Intermediate Algebra. Gregg Waterman Oregon Institute of Technology

Digital Circuit And Logic Design I. Lecture 4

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Calculus I Homework: The Tangent and Velocity Problems Page 1

Lecture 11: Extrema. Nathan Pflueger. 2 October 2013

Transcription:

Pre-Scaling to Avoid Overflow and Maintain Maximum Precision in Fixed-Point Multiplication D.R. Brown III November 5, 01 1 Problem Setup and Notation In almost all real-time DSP algorithms, we need to perform multiplication and addition. Those operations are straightforward when done in floating point, but are trickier when done in fixed point. This note discusses the special considerations that must be taken when performing fixed-point multiplication. Consider the real numbers x and y. These numbers are quantized to a fixed-point representation as x q and y q where x q = x+ǫ x (1) y q = y +ǫ y () where ǫ x and ǫ y are the quantization errors in x q and y q, respectively. The quantized variables x q and y q are assumed to have N x and N y total bits as well as M x and M y fractional bits, respectively. We wish to compute the product z q = x q y q (3) as accurately as possible given the constraint that z q is a fixed-point variable with N z total bits. In the next section, we determine the maximum number of fractional bits M z that we can have for z q while avoiding overflow/saturation and we also determine the optimal amount of pre-scaling on the multiplicands x q and y q to maintain maximum precision. Analysis As discussed in lecture, to avoid overflow, we require the largest positive value of z q to be at least as large as the largest possible positive product of x q and y q, i.e., Mz ( N x 1)( N y 1 Mx My ). (4) 1

In order to maintain full precision, we also require M z M x +M y. (5) If both conditions can be satisfied, then you should set M z = M x +M y and you are done. There is no need to determine optimum pre-scaling factors. In many cases, both of these conditions can t be satisfied since setting M z = M x + M y causes the first condition to simplify to N z N x + N y 1. For example, if x q, y q, and z q are all short variables (N x = N y = N z = 16), then both conditions can t be satisfied. When both conditions can t be satisfied, it is much better to throw away precision, i.e. violate the second condition in (5), than have overflow. This means we need to determine the best value of M z (which will not satisfy the second condition) and then determine the best prescaling factors for x q and y q to maintain as much precision as possible in the product. The following sections describe how to do this..1 Determining the Largest M z That Avoids Overflow At this point, we know we need to throw away precision, but we don t want to throw away too much precision. Our goal is to find the largest value of M z such that the product x q y q will not overflow in the container z q. To do this, we compute the largest possible positive product of x q and y q, i.e., Mz ( N x 1)( N y 1 Mx My ). (6) We are given M x, M y, N x, N y, and N z, and we need to solve for M z. We can rearrange this last equation to write ( )( ) Mz (7) Nx 1 Ny 1 Mx My and then assuming that Nz 1 1, we can simplify this last expression to which implies an upper bound on M z as Nz Nx Ny+Mx+My+1 Mz (8) M z N z N x N y +M x +M y +1. (9) As an example, suppose M x = 11, M y = 13, and N x = N y = N z = 16. Then we should select M z = 9 to maintain maximum precision while avoiding overflow. This means that the LSB of the product has a value of Mz = 9 which is significantly less precise than the full-precision product which would have an LSB with a value of (Mx+My) = 4. Nevertheless, this is the best we can do while avoiding overflow. Note that M z might be a negative number. This is ok; a negative number of fractional bits just means the implicit decimal point is moved to the right.

For example, suppose M x = 3, M y = 5, and N x = N y = N z = 16. Then we should select M z = 8. This just means that the LSB of product has a value of Mz = 56.. Determining Optimum Pre-Scaling Factors P x and P y Oncewe vedeterminedavalueform z accordingtotheprocedureintheprevious section, if it violates the condition in (5), we know that we will need to throw away some precision before computing the product. A bad way to do this is zq = (xq * yq) >> P; because overflow occurs before we throw away precision. Instead, we throw away precision before the product as zq = (xq >> Px)*(yq >> Py); where P x and P y are the pre-scaling factors that we need to determine. First off, how many total bits of precision do we need to throw away? The full-precision product has M x +M y fractional bits, but we ve determined that we can only have M z fractional bits to avoid overflow. So we must throw away a total of P = M x +M y M z fractional bits. Hence, we require P x +P y = P. Now, we will determine how to best allocate those bits between P x and P y. We can write the fixed-point product as z q = x q y q (10) = (x+ǫ x )(y +ǫ y ) (11) = xy }{{} +xǫ y +yǫ x +ǫ x ǫ y }{{} true answer error (1) Recall that x and y are the unquantized multiplicands. Since we quantized these values with N x and N y total bits and M x and M y fractional bits, respectively, we know these numbers are on the order of x Nx 1 Mx (13) y Ny 1 My. (14) We also know the quantization errors are on the order of an LSB value after pre-scaling. Since the number of fractional bits in x q and y q after pre-scaling is M x P x and M y P y, respectively, we can say ǫ x (Mx Px) (15) ǫ y (My Py). (16) Hence, the total error ǫ = xǫ y +yǫ x +ǫ x ǫ y in the product is on the order of ǫ Nx 1 Mx (My Py) + Ny 1 My (Mx Px) + (Mx Px) (My Py). (17) 3

Note that the third term in the sum on the right hand side is much smaller than the first two, since the product of two quantization errors should be much smaller than the product of a quantization error with an unquantized value. We will ignore that third term in the subsequent analysis since it is insignificant. Simplifying the remaining two terms, we can say the total quantization error of the product is on the order of ( ǫ (Mx+My+1)) ( N x+p y + Ny+Px). (18) We want to find P x and P y to minimize this quantity subject to the constraint that P x +P y = P. Note that the first term in (18) is a positive constant, i.e. not a function of P x or P y. This means it doesn t affect the minimization. Using this fact, and substituting P y = P P x into (18), we can write the minimization problem as ( P x = arg min N x+p k + Ny+k) (19) k=0,...,p where arg means give me the value of k that achieves the minimum. Note that P y is then straightforwardly computed as P y = P P x. Also note that the optimum values for P x and P y don t depend on the number of fractional bits, except indirectly through P since P = M x +M y M z. The easiest way to find this minimum is to just plot it in Matlab for given values of N x, N y, and P using the stem command. As an example, suppose M x = 11, M y = 13, and N x = N y = N z = 16. We know from the previous section that we should set M z = 9 to avoid overflow. This means that we must throw awayp = M x +M y M z = 15 fractional bits to avoidoverflow. If we plot (19) for k = 0,...,15 with the stem command in Matlab, we get the result shown in Figure 1. These results imply that P x = 7 or P x = 8 minimizes the error in the fixed-point product. Hence, either or zq = (xq >> 7)*(yq >> 8); zq = (xq >> 8)*(yq >> 7); would give the best results for this example. 3 Conclusion Fixed-point processing requires special care to maintain as much precision as possible while avoiding overflow. Throwing away precision is undesirable, but it isn t as catastrophic as overflow. Our goal is to maintain as much precision as possible for as long as possible while avoiding overflow. This note explains how to do this when performing fixed-point multiplication. 4

.5 x 109 1.5 1 0.5 0 0 5 10 15 k Figure 1: Total quantization error in product as a function of k for the example with M x = 11, M y = 13, N x = N y = N z = 16 and M z = 9. 5