Reduced-Error Constant Correction Truncated Multiplier

Similar documents
A Low-Error Statistical Fixed-Width Multiplier and Its Applications

FIXED WIDTH BOOTH MULTIPLIER BASED ON PEB CIRCUIT

Low Power, High Speed Parallel Architecture For Cyclic Convolution Based On Fermat Number Transform (FNT)

Efficient Function Approximation Using Truncated Multipliers and Squarers

Design of Low Power, High Speed Parallel Architecture of Cyclic Convolution Based on Fermat Number Transform (FNT)

Design and Comparison of Wallace Multiplier Based on Symmetric Stacking and High speed counters

An Area Efficient Enhanced Carry Select Adder

Design and Implementation of Efficient Modulo 2 n +1 Adder

Novel Modulo 2 n +1Multipliers

Novel Bit Adder Using Arithmetic Logic Unit of QCA Technology

A Bit-Plane Decomposition Matrix-Based VLSI Integer Transform Architecture for HEVC

Department of Electrical Engineering, Polytechnic University, Brooklyn Fall 05 EL DIGITAL IMAGE PROCESSING (I) Final Exam 1/5/06, 1PM-4PM

Proposal to Improve Data Format Conversions for a Hybrid Number System Processor

Design of Arithmetic Logic Unit (ALU) using Modified QCA Adder

Midterm Exam Two is scheduled on April 8 in class. On March 27 I will help you prepare Midterm Exam Two.

The equivalence of twos-complement addition and the conversion of redundant-binary to twos-complement numbers

Logic and Computer Design Fundamentals. Chapter 5 Arithmetic Functions and Circuits

Proposal to Improve Data Format Conversions for a Hybrid Number System Processor

An Effective New CRT Based Reverse Converter for a Novel Moduli Set { 2 2n+1 1, 2 2n+1, 2 2n 1 }

arxiv: v1 [cs.mm] 2 Feb 2017 Abstract

Design of A Efficient Hybrid Adder Using Qca

Cost/Performance Tradeoff of n-select Square Root Implementations

I. INTRODUCTION. CMOS Technology: An Introduction to QCA Technology As an. T. Srinivasa Padmaja, C. M. Sri Priya

Cs302 Quiz for MID TERM Exam Solved

Synthesis of Saturating Counters Using Traditional and Non-traditional Basic Counters

Image Compression - JPEG

Complex Logarithmic Number System Arithmetic Using High-Radix Redundant CORDIC Algorithms

VLSI Arithmetic. Lecture 9: Carry-Save and Multi-Operand Addition. Prof. Vojin G. Oklobdzija University of California

HARDWARE IMPLEMENTATION OF FIR/IIR DIGITAL FILTERS USING INTEGRAL STOCHASTIC COMPUTATION. Arash Ardakani, François Leduc-Primeau and Warren J.

Multimedia & Computer Visualization. Exercise #5. JPEG compression

High Speed Time Efficient Reversible ALU Based Logic Gate Structure on Vertex Family

Binary Multipliers. Reading: Study Chapter 3. The key trick of multiplication is memorizing a digit-to-digit table Everything else was just adding

DSP Configurations. responded with: thus the system function for this filter would be

FPGA IMPLEMENTATION OF 4-BIT AND 8-BIT SQUARE CIRCUIT USING REVERSIBLE LOGIC

THE newest video coding standard is known as H.264/AVC

Comparative analysis of QCA adders

Median architecture by accumulative parallel counters

Carry Look Ahead Adders

Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System

Image Compression. Fundamentals: Coding redundancy. The gray level histogram of an image can reveal a great deal of information about the image

Design and FPGA Implementation of Radix-10 Algorithm for Division with Limited Precision Primitives

Svoboda-Tung Division With No Compensation

International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research)

A Suggestion for a Fast Residue Multiplier for a Family of Moduli of the Form (2 n (2 p ± 1))

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

Converting DCT Coefficients to H.264/AVC

Lecture 2: Introduction to Audio, Video & Image Coding Techniques (I) -- Fundaments

ISSN (PRINT): , (ONLINE): , VOLUME-4, ISSUE-10,

SYDE 575: Introduction to Image Processing. Image Compression Part 2: Variable-rate compression

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017


Design and Implementation of Carry Tree Adders using Low Power FPGAs

Power Minimization of Full Adder Using Reversible Logic

Lecture 2: Introduction to Audio, Video & Image Coding Techniques (I) -- Fundaments. Tutorial 1. Acknowledgement and References for lectures 1 to 5

Enhanced SATD-based cost function for mode selection of H.264/AVC intra coding

Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier

Design and Implementation of REA for Single Precision Floating Point Multiplier Using Reversible Logic

EECS150 - Digital Design Lecture 24 - Arithmetic Blocks, Part 2 + Shifters

FAST FIR ALGORITHM BASED AREA-EFFICIENT PARALLEL FIR DIGITAL FILTER STRUCTURES

DESIGN AND IMPLEMENTATION OF EFFICIENT HIGH SPEED VEDIC MULTIPLIER USING REVERSIBLE GATES

A High-Speed Realization of Chinese Remainder Theorem

Chapter 5 Arithmetic Circuits

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte

Computer Architecture 10. Fast Adders

Numbering Systems. Computational Platforms. Scaling and Round-off Noise. Special Purpose. here that is dedicated architecture

Implementation of Reversible Control and Full Adder Unit Using HNG Reversible Logic Gate

FPGA accelerated multipliers over binary composite fields constructed via low hamming weight irreducible polynomials

PERFORMANCE IMPROVEMENT OF REVERSIBLE LOGIC ADDER

Tree and Array Multipliers Ivor Page 1

Analysis and Synthesis of Weighted-Sum Functions

KINGS COLLEGE OF ENGINEERING PUNALKULAM. DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING QUESTION BANK

A High-Yield Area-Power Efficient DWT Hardware for Implantable Neural Interface Applications

International Journal of Advanced Research in Computer Science and Software Engineering

Lecture 11. Advanced Dividers

Tunable Floating-Point for Energy Efficient Accelerators

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Successive approximation time-to-digital converter based on vernier charging method

DESIGN OF QUANTIZED FIR FILTER USING COMPENSATING ZEROS

Optimized Linear, Quadratic and Cubic Interpolators for Elementary Function Hardware Implementations

Design and Study of Enhanced Parallel FIR Filter Using Various Adders for 16 Bit Length

Polynomial Methods for Component Matching and Verification

DESİGN AND ANALYSİS OF FULL ADDER CİRCUİT USİNG NANOTECHNOLOGY BASED QUANTUM DOT CELLULAR AUTOMATA (QCA)

Over-enhancement Reduction in Local Histogram Equalization using its Degrees of Freedom. Alireza Avanaki

KEYWORDS: Multiple Valued Logic (MVL), Residue Number System (RNS), Quinary Logic (Q uin), Quinary Full Adder, QFA, Quinary Half Adder, QHA.

Overview. Arithmetic circuits. Binary half adder. Binary full adder. Last lecture PLDs ROMs Tristates Design examples

A Modified Moment-Based Image Watermarking Method Robust to Cropping Attack

T02 Tutorial Slides for Week 6

VLSI Arithmetic Lecture 10: Multipliers

GALOP : A Generalized VLSI Architecture for Ultrafast Carry Originate-Propagate adders

Reducing power in using different technologies using FSM architecture

Cost/Performance Tradeoffs:

A Novel Low Power 1-bit Full Adder with CMOS Transmission-gate Architecture for Portable Applications

4x4 Transform and Quantization in H.264/AVC

Volume 3, No. 1, January 2012 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at

Looking at a two binary digit sum shows what we need to extend addition to multiple binary digits.

KINGS COLLEGE OF ENGINEERING DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING QUESTION BANK

Robust and Energy-Efficient DSP Systems via Output Probability Processing

BSIDES multiplication, squaring is also an important

On LUT Cascade Realizations of FIR Filters

Number representation

Transcription:

This article has been accepted and published on J-STAGE in advance of copyediting. Content is final as presented. IEICE Electronics Express, Vol.*, No.*, 1 8 Reduced-Error Constant Correction Truncated Multiplier Dong Wang 1a), Peng Cao 2), Yang Xiao 1) 1 Institute of Information Science, Beijing Jiaotong University 2 National ASIC System Engineering Technology Research Center, Southeast University a) wangdong@bjtu.edu.cn Abstract: Constant correction truncated multiplier can minimize hardware cost and power dissipation by computing only the most significant bits of partial products. However, traditional schemes introduce large truncation errors that degenerate the accuracy of the target application. In this brief, we propose an improved scheme that can significantly reduce such truncation errors. The proposed scheme estimates an accurate compensating constant by correctly counting the probability of occurrence of the input operand and the product bits with a group of probability propagation formulas. The proposed truncated multiplier is coded in Verilog RTL, implemented in 65 nm standard cell technology, and applied in image compression and color space conversion applications. Experimental results show that our scheme achieves an average peak signal-to-noise ratio (PSNR) improvement of 4.09 db and 2.31 db over state-of-the-art truncation scheme in the two applications, respectively. When the same truncation error range is desired, the proposed scheme can save around 5.8% of circuit area. Keywords: constant correction truncated multiplier, digital signal processing Classification: Electron devices, circuits, and systems References IEICE 2014 DOI: 10.1587/elex.11.20140481 Received May 18, 2014 Accepted June 9, 2014 Publicized June 26, 2014 [1] H. J. Ko and S. F. Hsiao: IEEE Trans. Circuits and Sys. II, 58 [5] (2011) 304. [2] M. J. Schulte and E. E. Swartzlander: Proc. IEEE Workshop VLSI Signal Process., (1993) 388. [3] C. H. Chang and R. K. Satzoda: IEEE Trans. VLSI Systems, 18 [12] (2010) 1767. [4] E. J. King and E. E. Swartzlander: Proc. Asilomar Conference on Signals, Systems & Computers, (1997) 1178. [5] M. G. Solaz, W. Han and R. Conway: IEEE Trans. Circuits and Syst. I: 59 [11] (2012) 2555. [6] V. Mahalingam and N. Ranganathan: Proc. VLSI Design, (2006) 6. [7] M. G. Solaz, W. Han and R. Conway: IEEE Trans. Circuits Syst. I, 59 [11] (2012) 2555. [8] E. G. Walters III, M, G. Arnold and M. J. Schulte: SPIE Proc. Advanced Signal Processing Algorithms, Architectures, and Implementations XIII, (2003) 573. 1

[9] V. P. Hoang and C. K. Pham: IEICE Trans. Fund. Electron. Commun. Comput. Sci., Vol.E95-A [6] (2012) 999. [10] S. Yu and E. E. Swartziander: IEEE Trans. Computers, 50 [9] (2001) 985. 1 Introduction Multiplication is a fundamental arithmetic operation that is pervasively used in digital signal processing (DSP) applications, such as filtering, convolution and compression. From a hardware implementation perspective, an n n-bit multiplication yields a 2n-bit product. To maintain the full accuracy of the system, a DSP architecture requires an ever-growing bit width that would be impossible or impractical to implement. One commonly used method is truncating and rounding the results with the n most significant bits so that they fit the limits of the architectural bit width [1]. The circuit area and dynamic power dissipation can also be reduced proportionally. Several truncation schemes have been proposed for parallel digital multiplier designs [2, 3, 4, 5]. Schulte et al. introduced the constant correction truncated (CCT) scheme, in which (n k) columns of partial products are truncated, where k is a nonnegative integer between 0 and n 1, and a non-zero constant is added to the final result to compensate the expected value of the truncation error. In [3], Chang et al. extended the CCT scheme for multiplexor based array multiplier and introduced adaptive pseudo-carry to compensate the truncation error (PCT), which resulted in a carry-saved multiplexor array structure. In [4], King et al. proposed a variable correction scheme (VCT) which used information from the partial product bits of the column adjacent to the truncated parts to better estimate the truncation error. Solaz et al. [5] presented a full precision multiplier which was implemented in a manner that the effective partial product matrix is selected dynamically at run-time. This scheme allows a power reduction tradeoff against signal degradation which can be modified at run-time. Software compensation of truncation errors was used in this scheme. Other hybrid strategies [6, 7] have also been proposed in the literature. In the standard CCT scheme [2], the value of the compensating constant is estimated by the expected values of the reduction and rounding errors. We assume that multiplication of two n-bit numbers A and B results in a 2n-bit product P with the forms n 1 n 1 A = a i 2 n+i, B = b i 2 n+i, P = i=0 i=0 2n 1 i=0 p i 2 2n+i (1) where a i and b i denote each bit of A and B, and p i denotes each bit of P, respectively. The relationship between the partial product bit π i,j = a i b j and the final product bit p i in the multiplication matrix is shown in Fig 1- a. The least significant (n k) columns of partial products are truncated and only the (n + k) most significant columns are summed. Typically, the 2

a n-1 a n-k-1 a 1 a 0 0.6 x b n-1 b n-k-1 b 1 b 0 0.5 π n-1,n-1 p 2n-1 π 0,n-1 π 0,n-k-1 π 1,n-1 π 1,k π n-2,n-1 π n-2,1 π n-2,0 p n final n-bit result π n-1,1 p n-1 π n-1,0 p n-k round k bits (a) π 0,1 π 0,0 π 1,n-k-1 π 1,0 truncate (n-k) colum Probability of p i 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 i (b) Fig. 1. (a) The partial product matrix of a n n-bit multiplier. (b) The measured probability distribution of the final product bits p i (n = 8). final product is also rounded to n bits to avoid continuous growth of the word-length. The expected value (i.e., the statistical average value) of the reduction error, which is the sum of all partial products in columns 0 to (n k 1), is calculated as E reduct = 1 n k 1 4 (j + 1) 2 2n+j (2) whereas the expected value of the rounding error is estimated by E round = 1 2 n 1 j=n k j=0 2 2n+j = 2 n 1 (1 2 k ) (3) The expected value of the total truncation error E total is the sum of E reduct and E round, and consequently, the correction constant is selected as the inverse value of the total error with (n + k) bits precision, which is given as ) C = rnd (2 n+k E total 2 (n+k) (4) where function rnd(x) rounds x to its nearest integer. From a statistical perspective, the standard CCT scheme is expected to minimize the average error of the final products. However, two incorrect assumptions are made in deriving (2) and (3). It is assumed that both a i and b i have an equal probability of being 1 or 0, which leads to the conclusion that the probability of the partial product bit π ij being one is always 1/4 in (2). Unfortunately, in real-world applications, the probability of the input bits a i and b i are typically not evenly distributed. One good example is the discrete cosine transform (DCT) algorithm for digital image compression [8]. In the DCT formula, each 8-bit pixel value (as input A) is multiplied by a transform coefficient (as input B), which is usually an 8-bit integer constant. Obviously, the probability of b i being one does not always equal to 1/2. For instance, given B = (113) 10 = 3

(1110 0001) 2, b i=7,6,5,0 always equal one. Therefore, using formula (2) leads to an incorrect estimation of the reduction error. It is also assumed in (3) that the probability of each product bit, from p n 1 to p n k, being one is 1/2. As shown in Fig. 1-b, a real-valued simulation is conducted to measure the actual probability of each product bit p i. The input bits a i and b i are set with the same probability of 1/2 to be one. The result shows that this assumption is also incorrect. In this paper, we propose an improved scheme to correctly compute the probability of the partial and final product bits based on the actual probability of occurrence of input operand bits. By using this scheme, we can accurately estimate the compensating constant, and therefore, reduce the computation error introduced by truncation and rounding. 2 Operand Probabilistic Estimation-based CCT Multiplier For a n n-bit multiplication, the partial product matrix of Fig. 1-a can be implemented as an array multiplier consisting of half and full adders as shown in Fig. 2. Denoting p(b i ) and p(a j ) as the probability of the input bits b i and a j being one and p(i, j) as the probability of the partial product bit π i,j at column i and row j being one, we know straightforwardly that p(i, j) = p(b i ) p(a j ). However, the probability of p i can not be obtained directly. We propose using the following probability propagation formulas to compute p(i). First, we denote p s (i, j) and p c (i, j) as the probability of the sum and carry bits of the adder at the i-th column and j-th row being one, respectively. As the partial products are accumulated by the adders from the upper-left corner of the array to the bottom-right corner, p s (i, j) and p c (i, j) can be calculated in a row-by-row manner. For i = 1 and 0 j n 2 (the Probability Propagation Col. 7 Col. 6 Col. 0 a 7 a 6 a 5 a 4 a 3 a 2 a 1 a 0 p s(i-1,j+1) p c(i-1,j) Row 0 b 0 A A A A A A A A AFA Row 1 b 1 A AHA AHA AHA AHA AHA AHA AHA p 0 p c(i,j) p s(i,j) b 2 p 1 b 3 p 2 Truncated Partial Products b 4 p 3 b 5 p 4 Rounding Bits b 6 p 5 A AND Gate Row 7 b 7 p 6 AFA AND & Full Adder FA FA FA FA FA FA FA p 7 FA Full Adder p 15 p 14 p 13 p 12 p 11 p 10 p 9 p 8 Fig. 2. Block diagram of a n n-bit truncated array multiplier (n = 8, k = 2). 4

second row of the product array), the following equations are used p s (1, j) = p(1, j) q(0, j + 1) + q(1, j) p(0, j + 1) (5) p c (1, j) = p(1, j) p(0, j + 1) (6) where q(i, j) = 1 p(i, j), which represents the probability of the partial product π i,j being zero. For 2 i n 1 and 0 j n 2, the following equations are used p s (i, j) = p(i, j) p s (i 1, j + 1) p c (i 1, j) + p(i, j) q s (i 1, j + 1) q c (i 1, j) + q(i, j) q s (i 1, j + 1) p c (i 1, j) (7) + q(i, j) p s (i 1, j + 1) q c (i 1, j) p c (i, j) = p(i, j) p s (i 1, j + 1) p c (i 1, j) + p(i, j) p s (i 1, j + 1) q c (i 1, j) + p(i, j) q s (i 1, j + 1) p c (i 1, j) (8) + q(i, j) p s (i 1, j + 1) p c (i 1, j) where q s (i, j) = 1 p s (i, j) and q c (i, j) = 1 p c (i, j). For j = n 1, p s (i, j) = p(i, j). We note that p(i) = p s (i, 0). Consequently, for probability distribution parameters of p(b i ) and p(a j ) of given input operands, the p(i) (0 i 7) can be accurately calculated by using the probability propagation formulas from (5) to (8). Therefore, we obtain the new expected value of the reduction error as n k 1 E reduct = i=0 i p(i j, j) 2 2n+i (9) j=0 and the expected value of the rounding error, which is the sum of products p n 1 to p n k, as n 1 E round = p s (i, 0) 2 2n+i (10) i=n k Then, similar to the error in the standard CCT scheme, the total truncation error is E total = E reduct + E round, and the new correction constant C can be calculated by using (4). In the proposed scheme, the value of the compensating constant C is determined by the actual probability of the input operands. Therefore, the expected value of the computation errors are closer to zero than they are in traditional schemes. To verify the effectiveness of the proposed operand probabilistic estimation-based (OPEB) error compensation scheme, we performed software based simulations for 8 8-bit and 16 16-bit truncated multiplications with different configurations of k. The recorded average truncation errors E total (expected values) are reported in Figs. 3 and 4. For the 8 8-bit multiplication, we chose p(b i ) = p(a j ) = 1/2 for i = 0,, 7 (i.e., both are random inputs), whereas for the 16 16-bit multiplication, we changed the probability of operand B as p(b i ) = 1 for i = 3, 7, 8, 10, 12, 15 and p(b i ) = 0 5

The Average Truncation Error (n=8) 0.0045 0.004 0.0035 0.003 0.0025 0.002 0.0015 0.001 0.0005 0 CCT VCT PCT Proposed 1 2 3 4 5 6 7 8 k Fig. 3. Comparison of the averaged truncation errors between CCT, VCT, PCT and the proposed scheme for n = 8. The Average Truncation Error (n=16) 3.5e 05 3e 05 2.5e 05 2e 05 1.5e 05 1e 05 5e 06 0 CCT VCT PCT Proposed 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 k Fig. 4. Comparison of the averaged truncation errors between CCT, VCT, PCT and the proposed scheme for n = 16. for other i (i.e. operand B is a constant). It can be observed that the proposed scheme achieves considerably error reductions over the three reference schemes when k is smaller than 3. The highest error reduction rate is 50.8%, 38.6% and 38.1%, when n = 16, k = 1, respectively. Note that for very large k, all the schemes have close errors because only minimal parts of the productions are truncated. Hardware implementation results are presented in the following section. 3 Implementation results and comparisons The proposed OPEB-CCT scheme is applied to design truncated array multipliers (Fig. 2-a). Four 16 16-bit multipliers configured with k = 1, 2, 3, 4 were coded in Verilog HDL, functionally verified by Cadence NC-Sim 5.1 and implemented in TSMC 65nm CMOS standard cell technology. The tar- 6

get working frequency is 100 MHz at 1.2V. Table I reports the measured area, which is normalized to the equivalent gate count (i.e., total area versus the area of a 2-input NAND gate). Three most frequently cited schemes, including CCT, VCT and PCT, have also been implemented and compared in the table. For each of reference designs, the parameter k is selected with the smallest value which makes the truncation error E total closest to that of the OPEB-CCT multiplier. E total is normalized in terms of the unit in the last place (ulp) of the output result. The result shows that the proposed scheme achieves approximately 10%, 7.9% and 5.8% area reductions compared to the CCT, VCT and PCT approaches, respectively. Our scheme can estimate the truncation error more accurately, and therefore, uses a smaller k, which results in a hardware structure that requires fewer adders. Table I. Area (gate count) comparison between CCT, VC- T, PCT and the proposed scheme (n = 16). Scheme E total < 0.9823 [ulp] E total < 0.7321 [ulp] E total < 0.6070 [ulp] k Area [gates] k Area [gates] k Area [gates] CCT 3 1712 (100%) 4 1840 (100%) 5 1950 (100%) VCT 3 1803 (+4.6%) 3 1803 (-2.0%) 4 1921 (-1.5%) PCT 2 1644 (-3.9%) 3 1754 (-4.6%) 4 1883 (-3.4%) Proposed 1 1531 (-11.0%) 2 1660 (-9.8%) 3 1774 (-9.0%) Table II. PSNR (db) comparison for DCT/IDCT image processing. Scheme k 1 2 3 CCT (hardware) 24.43 28.32 30.92 VCT (hardware) 29.72 34.05 37.66 PCT (hardware) 29.34 33.89 37.83 Proposed (hardware) 33.25 38.35 41.73 Direct Truncated (hardware) 20.26 Floating-point (software) 46.47 Table III. PSNR (db) comparison for image CSC processing. Scheme k 1 2 3 CCT (hardware) 27.80 30.53 31.82 VCT (hardware) 30.02 32.35 34.77 PCT (hardware) 29.73 32.08 34.58 Proposed (hardware) 32.25 34.35 36.73 Direct Truncated (hardware) 22.17 Floating-point (software) 38.35 To exhibit the accuracy of the proposed scheme in real applications, we applied the OPEB-CCT multipliers to implement JPEG image compression [8] and image color space conversion (CSC) [9]. For image compression, two 8 8 2-D DCT and IDCT processing cores are built by using 8 8-bit 7

truncated multipliers. The test image (512 pixel 512 pixel Lena ) is first compressed by using the DCT core and then decompressed back by using the IDCT core. For CSC, a test image in 4 : 4 : 4 RGB format is first transformed into YCbCr color space (4 : 2 : 0 format) and then transformed back to the original format. Multiplication by constant coefficients are implemented by using truncated multipliers instead of a distributed arithmetic scheme based on look-up table [10]. In both applications, direct truncated, CCT, VCT, PCT and OPEB-CCT multipliers are tested with the parameters set to k = 1, 2, 3. The accuracy performance is evaluated based on the peak signalto-noise ratio (PSNR). Software based (i.e., using floating-point arithmetic) results are also listed for reference. The measured results are reported in Tables II and III. Compared with the PCT scheme, which has the best performance among the reference schemes, the proposed approach achieves an average PSNR improvement of 4.09 db for DCT/IDCT application and 2.31 db for CSC application, respectively. Compared with directly truncated multiplication, the proposed design with the lowest hardware cost (k=1) improves the PSNR by 12.99 db and 10.08 db. The results also show that to achieve the same PSNR performance, the reference schemes generally have to use a larger k, which is consistent with the experimental results presented in the previous section. 4 Conclusion In this brief, a reduced-error CCT multiplier design scheme is proposed. By correctly computing the probability of occurrence of the input operands, partial and final product bits, the proposed scheme estimates the reduction and rounding errors more accurately than traditional schemes. Therefore, the proposed scheme can reduce the computation error by applying a more appropriate correction constant. 8 8-bit and 16 16-bit truncated array multipliers are finally implemented in 65nm standard cell technology and tested in real image compression and color space conversion applications. Experimental results show that the proposed scheme achieves a 4.09 db P- SNR improvement in the DCT/IDCT based image compression application and a 2.31 db PSNR improvement for the CSC application over the PCT multipliers. The proposed scheme can also achieve 5.8% hardware cost reduction over the PCT scheme when the same truncation error performance is desired. Therefore, the proposed reduced-error CCT multiplier is an ideal choice for cost-aware digital signal processing and multimedia processing designs of integrated circuits and microchips. Acknowledgments This work was supported by the National Natural Science Foundation of China Grant No. 61106022 and Beijing Natural Science Foundation NO. 4143066. 8