Proc. IEEE Workshop on Signal Processing Systems (SIPS), Tampere, Finland, Oct. 7-9, 2009

Similar documents
w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Module 5 EMBEDDED WAVELET CODING. Version 2 ECE IIT, Kharagpur

Optimum LMSE Discrete Transform

Introduction to Signals and Systems, Part V: Lecture Summary

1 Hash tables. 1.1 Implementation

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

IP Reference guide for integer programming formulations.

On Random Line Segments in the Unit Square

CS284A: Representations and Algorithms in Molecular Biology

Invariability of Remainder Based Reversible Watermarking

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution

Orthogonal Gaussian Filters for Signal Processing

Analysis of Algorithms. Introduction. Contents

Practical Spectral Anaysis (continue) (from Boaz Porat s book) Frequency Measurement

The Discrete Fourier Transform

THE KALMAN FILTER RAUL ROJAS

Olli Simula T / Chapter 1 3. Olli Simula T / Chapter 1 5

6.3 Testing Series With Positive Terms

Filter banks. Separately, the lowpass and highpass filters are not invertible. removes the highest frequency 1/ 2and

4.3 Growth Rates of Solutions to Recurrences

Classification of problem & problem solving strategies. classification of time complexities (linear, logarithmic etc)

THE ASYMPTOTIC COMPLEXITY OF MATRIX REDUCTION OVER FINITE FIELDS

Vector Quantization: a Limiting Case of EM

Number Representation

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Signals & Systems Chapter3

Infinite Sequences and Series

REGRESSION (Physics 1210 Notes, Partial Modified Appendix A)

Discrete-Time Signals and Systems. Signals and Systems. Digital Signals. Discrete-Time Signals. Operations on Sequences: Basic Operations

Structuring Element Representation of an Image and Its Applications

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

OBJECTIVES. Chapter 1 INTRODUCTION TO INSTRUMENTATION FUNCTION AND ADVANTAGES INTRODUCTION. At the end of this chapter, students should be able to:

ADVANCED DIGITAL SIGNAL PROCESSING

x a x a Lecture 2 Series (See Chapter 1 in Boas)

Frequency Domain Filtering

CS / MCS 401 Homework 3 grader solutions

UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 17 Lecturer: David Wagner April 3, Notes 17 for CS 170

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Information-based Feature Selection

FIR Filter Design: Part II

Oblivious Transfer using Elliptic Curves

Output Analysis (2, Chapters 10 &11 Law)

Chapter 9: Numerical Differentiation

CALCULATION OF FIBONACCI VECTORS

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

ECE 564/645 - Digital Communication Systems (Spring 2014) Final Exam Friday, May 2nd, 8:00-10:00am, Marston 220

Discrete Orthogonal Moment Features Using Chebyshev Polynomials

Encoding-Assisted Temporal Direct Mode Decision for B Pictures in H.264/AVC

The multiplicative structure of finite field and a construction of LRC

On a Smarandache problem concerning the prime gaps

SOME TRIBONACCI IDENTITIES

Sequences, Mathematical Induction, and Recursion. CSE 2353 Discrete Computational Structures Spring 2018

4.1 Sigma Notation and Riemann Sums

Chapter 7 z-transform

Optimally Sparse SVMs

SEQUENCES AND SERIES

Chapter 9 - CD companion 1. A Generic Implementation; The Common-Merge Amplifier. 1 τ is. ω ch. τ io

A statistical method to determine sample size to estimate characteristic value of soil parameters

Sensitivity Analysis of Daubechies 4 Wavelet Coefficients for Reduction of Reconstructed Image Error

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

TEACHER CERTIFICATION STUDY GUIDE

Properties and Hypothesis Testing

Analysis of Experimental Measurements

arxiv: v1 [math.nt] 10 Dec 2014

EE260: Digital Design, Spring n MUX Gate n Rudimentary functions n Binary Decoders. n Binary Encoders n Priority Encoders

Discrete-Time Signals and Systems. Discrete-Time Signals and Systems. Signal Symmetry. Elementary Discrete-Time Signals.

2D DSP Basics: 2D Systems

OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES

TMA4205 Numerical Linear Algebra. The Poisson problem in R 2 : diagonalization methods

subject to A 1 x + A 2 y b x j 0, j = 1,,n 1 y j = 0 or 1, j = 1,,n 2

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Physics 324, Fall Dirac Notation. These notes were produced by David Kaplan for Phys. 324 in Autumn 2001.

Morphological Image Processing

The z-transform. 7.1 Introduction. 7.2 The z-transform Derivation of the z-transform: x[n] = z n LTI system, h[n] z = re j

Revision Topic 1: Number and algebra

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Hashing and Amortization

Problem Set 2 Solutions

Algorithm Analysis. Chapter 3

Lesson 10: Limits and Continuity

There is no straightforward approach for choosing the warmup period l.

Chapter 6 Infinite Series

Random Variables, Sampling and Estimation

Principle Of Superposition

BER results for a narrowband multiuser receiver based on successive subtraction for M-PSK modulated signals

Linear Regression Demystified

Parallel Vector Algorithms David A. Padua

Fall 2011, EE123 Digital Signal Processing

DIGITAL FILTER ORDER REDUCTION

End-of-Year Contest. ERHS Math Club. May 5, 2009

Let us consider the following problem to warm up towards a more general statement.

2.004 Dynamics and Control II Spring 2008

Warped, Chirp Z-Transform: Radar Signal Processing

EE260: Digital Design, Spring n Binary Addition. n Complement forms. n Subtraction. n Multiplication. n Inputs: A 0, B 0. n Boolean equations:

Rank Modulation with Multiplicity

Algebra of Least Squares

Problem Set 4 Due Oct, 12

FIR Filter Design: Part I

SEQUENCES AND SERIES

Transcription:

Proc IEEE Workshop o Sigal Processig Systems (SIPS), Tampere, Filad, Oct 7-9, 09 SOFTWAE DESIGNS OF IMAGE POCESSING TASKS WITH INCEMENTAL EFINEMENT OF COMPUTATION Davide Aastasia ad Yiais Adreopoulos Departmet of Electroic ad Electrical Egieerig, Uiversity College Lodo, UK ABSTACT We propose software desigs that perform icremetal computatio with mootoic distortio reductio for twodimesioal covolutio ad frame-by-frame block-matchig tasks I order to reduce the ru time of the proposed desigs, we combie plae-based computatio with a packig techique proposed recetly I the case of block matchig, we also utilize previously-computed motio vectors to perform localized search whe icremetig the precisio of the iput video frames The applicability of the proposed approach is demostrated by executio time measuremets o the xo-laptop ( 100$ laptop ) ad o a maistream laptop; our software is also made available olie I compariso to the covetioal (o-icremetal) software realizatio, the proposed approach leads to scalable computatio per iput frame while producig idetical (or comparable) precisio for the output results of each operatig poit I additio, the executio of the proposed desigs ca be arrarily termiated for each frame with the output beig available at the already-computed precisio Idex Terms complexity-scalable image processig, icremetal refiemet of computatio, programmable processors 1 INTODUCTION Several popular applicatios, such as media players, image ad video post-processig, ad motio estimatio ad compesatio, are beig implemeted today via software solutios i geeral-purpose processors New geeratios of processors are icreasigly powerful ad eable more dedicated resource allocatio to such real-time multimedia tasks due to multicore desigs [1] At the same time, ew geeratios of software compilers ow automatically geerate platform-specific optimized assembly code from C++ code [2], thereby eablig platform-idepedet C++ software solutios to achieve high processor utilizatio factors Existig algorithmic-orieted research focuses o complexity reductio [3]-[5] or complexity scalability for image processig tasks [6]-[8], where computatioal complexity is decreased ad approximate results are produced Implemetatio-orieted research focuses o multimediadrive eergy scalig of processors via dyamic voltage scalig [9] [10] i a attempt to provide computatioal scalability with approximate results However, previous approaches ca oly obtai oe operatioal poit i the complexity-distortio curve [3] [7], without beig able to icremet the quality of the output with icreased computatio I additio, i practical image ad video codig systems, complexity does ot scale dow sigificatly with decreased source precisio (decreased rate) [7] Fially, existig image processig realizatios produce a all or othig output: oe caot iterrupt the computatio whe system resources suddely become uavailable (or whe delay costraits are about to be violated) ad retrieve a meaigful approximatio of the fial result 1 A exceptio is foud i proposals for icremetal computatio of trasforms ad saliet poit detectio algorithms [11] [12], where the mai priciple is: uder a refiemet of the source descriptio, the computatio of the image processig task refies the previously-computed result However, existig work [11] [12] is usig arithmetic complexity estimates ad o practical realizatios are give I this paper we address this aspect by proposig a practical software framework for image processig tasks exhiig icremetal refiemet of computatio Our software desigs of two-dimesioal (2D) covolutio ad block-matchig operatios combie icremetal computatio with a recetlyproposed packig approach that eables the calculatio of multiple limited-dyamic-rage iteger operatios via oe 32- or 64- arithmetic operatio The proposed software desigs are validated i two differet systems ad are also provided olie [13] Sectio 2 presets the overall framework of icremetal refiemet of image processig operatios i software Sectios 3 ad 4 preset the proposed algorithms, while Sectio 5 presets the experimetal comparisos ad Sectio 6 cocludes the paper 2 IMAGE POCESSING TASKS USING INCEMENTAL PACKING AND UNPACKING A geeral depictio of the proposed framework for icremetal computatio based o source refiemets is preseted i Figure 1 I the followig subsectio we discuss this framework i more detail, while Subsectio 22 presets 1 Oe could potetially obtai a partial computatio of the fial result, eg parts of the covolutio or some block matchig results, but ot the etirety of the result with graceful degradatio

Proc IEEE Workshop o Sigal Processig Systems (SIPS), Tampere, Filad, Oct 7-9, 09 the basic tradeoffs of the proposed packig approach Two useful defiitios of quatities used i the remaider of the paper are give below Defiitio 1: For ay quatity a used i the computatio of a algorithm, a, 0 < N, is the computed value of a whe the iput cosists of plaes N 1 dow to (ad icludig) plae Defiitio 2: For ay quatity a used i the computatio of a algorithm, a, 0 < N, is the computed value of a whe oly plae of the iput is used The otatioal covetios of Defiitio 1 ad Defiitio 2 are exteded to matrices 2, eg A is the matrix cotaiig the computed coefficiets of A whe oly plae of the iput image is used 21 Overall Framework As show i Figure 1, a iput image is iitially partitioed ito M of o-overlappig blocks, whose biary (plae-byplae) represetatio is show i the middle of the figure, from the most-sigificat s (MSBs) to the least-sigificat s (LSBs) A total of N icremet layers are formed by groupig together the s of all blocks belogig to the same plae, where = N 1 correspods to the MSBs ad = 0 correspods to the LSBs Hece, oe ca have maximally N = 8 icremet layers for a 8- iput image Each icremet layer is also a layer of computatio ad we calculate the results of M blocks of each layer together usig a icremetal packig approach I particular, M blocks B1,,, B M, of oe layer are placed together i oe block D by: M λ ( m 1) ρ D [, i j] = B 1 m,[, i j]2, (1) m= where Bm, [, i j ] is the (, ij) th value of block B m, (1 m M ) that cotais parts of icremet layer belogig to the m th spatial block ad λ = 1 if 64- floatig-poit arithmetic is used, or λ = 1 if 32- usiged iteger arithmetic is used The last equatio shows that the th plae of the m th block is scaled by λ 2 ( m 1) ρ, ρ > 0, ad is the added to the sum of the previous blocks 1,, m 1 of the same icremet layer This leads to a packed icremet layer havig all M blocks placed o oe block D ad usig iteger or floatig-poit arithmetic The best choice for the utilized arithmetic is system depedet, as it will be show by our experimets After the packig approach, the desired image processig task op is applied to D for each layer, 0 < N, eg covolutio with kerel K is performed by: ( ) = op D K (2) Depedig o the algorithm of iterest, oe could localize the 2 Boldface capital letters idicate matrices; the correspodig italicized letters idicate idividual matrix elemets, eg A ad Ai [, j ]; all idexes are itegers; superscripts i matrices or scalars idicate the plae umber ad the frame idex, the distictio betwee the two is idetifiable from the cotext calculatio of (2) aroud areas of iterest based o the previously-computed icremet layers (as idicated i Figure 1) This will be used i the block matchig task If a appropriate coefficiet ρ is chose for (1), it ca be show [14] that the results of all the blocks withi icremet layer ca be extracted from if: the processig kerel K cotais itegers; op is a liear operatio This is based o the so-called ivaders approach [14], where ay iteger liear operatio ca be performed by packig multiple operatios together [see (1)], ad the upackig them by the reverse operatio performed recursively for all values of all blocks For floatig-poit arithmetic ( λ = 1 ): 1, m = U i j = [ i, j ] + 05 (3) ρ m {2, M} : [ i, j] 2 ( [ i, j] Um 1,[ i, j] ) Um,[ i, j] = [ i, j] + 05 (4) where U m, is the output icremet of the result for block m, a + 05 performs roudig to the earest iteger, ad a f() a assigs fa () to a For iteger arithmetic ( λ = 1 ), the upackig is performed by: m = 1 : U [ i, j ] = mod [ i, j ],2 ρ (5) 1, ρ m, = ρ ρ ρ ( ) m {2, M} : [ i, j] 2 [ i, j] (6) ρ U [ i, j] mod ( [ i, j],2 ) where mod( a,2 ) = a a 2 2 is the modulo operatio The selectio of the appropriate packig coefficiet ρ depeds o the specific algorithm beig cosidered I additio, eve though Figure 1 shows all blocks of the iput image beig packed together, i practice the value of M depeds o the dyamic rage of the result of each icremet layer These aspects are elaborated further i the followig sectios As show i Figure 1, after upackig, the fial stage of the proposed computatio icremets the previously-computed results of icremet layers N 1,, + 1 by addig to them the results of the curret layer, U1,,, U M, : + 1 m {1, M} : Um, = Um, + U m,, (7) N with Um, 0 This leads to computatio of the processig task with icreasig precisio for icreasig icremet layers, as show i the visual examples of Figure 1 Due to the utilizatio of the packig techique, the results of all M blocks are computed simultaeously by (2) Depedig o the overhead of the packig ad upackig, we expect to save operatios i compariso to the direct computatio of each layer 22 Discussio The parameters cotrollig the proposed approach of Figure 1 are: the total umber of icremet layers ( N ), the total umber of blocks ( M ), ad parameter ρ that affects the packig of multiple icremet layers i oe operad D [, i j ] Ideally, we would like to maximize the packig capability i order to perform as may operatios simultaeously as possible [14]

Proc IEEE Workshop o Sigal Processig Systems (SIPS), Tampere, Filad, Oct 7-9, 09 Iput image S p a t i a l I m a g e P a r t i t i o i g Image block plaes Image block 1 Image block M Icremet layer Packig esults from previous layers: N-1,N-2,,+1 Upackig Algorithm computatio usig layer esults of layer Processed iput image up to layer =6 =4 + =2 Figure 1 Icremetal refiemet of computatio usig packig ad upackig of icremet layers extracted progressively from the iput image data The output result is progressively refied via the computatio of more icremet layers The computatio of each layer ca also utilize results from previous layers to reduce complexity As aalyzed i the origial ivaders algorithm, the packig capability depeds o the dyamic rage of the operatios Furthermore, if packig with iteger arithmetic is desired, ρ has to be iteger The maximum packig coefficiet caot be smaller tha 2 for 64- floatig-poit arithmetic [14], ad it caot be larger tha 2 31 for 32- usiged iteger arithmetic, which leads to M 05( λ 1) + ρ = ω, with ω = or ω = 31, respectively If the rage of all outputs ( Bm, op K ) (for every plae ad block m ) is cotaied i the iterval [ Amax, Amax ], the, for loose packig, ρ log2a max + 1 Selectig the miimum value of ρ satisfyig the iequality, we reach ω M 05( λ 1) log2a max + 1 (8) As expected, the umber of packed blocks decreases with the icrease of the output s dyamic rage The output dyamic rage of each layer depeds o the algorithm of iterest ad it will be discussed i the followig sectio Fially, i order to esure there is o umerical error i the calculatio whe packig with floatig-poit arithmetic, the magitude of the maximum possible error [14] must allow for correct roudig of the fial results, ie: ω ( M 1) ρ Amax 2 2 < 05 (9) I our desigs, M is iitially derived by (8) ad the decreased (if eeded) so that (9) holds 3 INCEMENTAL 2D CONVOLUTION For a image cosistig of i C i pixels, the block partitioig of this case separates the image ito M partially overlappig horizotal stripes, each of which is the 0 cosidered to be the iput block of samples, B m (1 m M ), havig C i colums The umber of rows i each block is cotrolled by the iput image rows ad the packig capability (ie the value of M ) The covolutio filter is give by a V kerel C -coefficiet kerel kerel T ad cov covolutio of the m th block is performed by: 0 0,, cov m {1, M} : Um = Bm T (10) I order to produce the correct result with the block-based calculatio of (10), cosecutive blocks share a commo subset of rows V overlap = V kerel 2, ie the first block ( stripe ) is overlappig with the secod block vertically by V overlap rows, all subsequet blocks overlap with their previous ad ext blocks by V overlap rows (above ad below the block), ad the last block overlaps with its previous block by V overlap rows Whe plaes of the iput are used, the process ca be performed for each plae of the m th block by: m, m, cov m {1, M} : U = B T (11) ad the results are added to the previous oes by (7) If we cosider packig the results i order to accelerate the icremetal computatio, the D is formed by (1) ad it is used to compute the packed result of all M blocks by: = cov D T (12) The results are upacked from usig (3) ad (4) ad the fial results per plae are derived by (7) Visual examples of Gaussia filterig whe {6, 4, 2} are give i Figure 1 The packig capability depeds o the worst-case dyamic rage A max of the calculatio of oe icremet layer This ca be pre-calculated whe the covolutio kerel is kow to the system by isertig as iput the worst-case sub-block: 0 i < V q kerel 1, if ( 1) Tcov[ i, j] > 0 : Bmax, q[, i j] = (13) 0 j < Ckerel 0, q if ( 1) Tcov[ i, j] 0 for q = {0,1} ad the: Vkerel kerel { 1 C 1 i= 0 j= 0 q } A = max B [ i, j ] T [ i, j ] max max, cov q

Proc IEEE Workshop o Sigal Processig Systems (SIPS), Tampere, Filad, Oct 7-9, 09 Kerels with o-iteger coefficiets ca be approximated by a fixed-poit (FXP) represetatio with the appropriate umber of fractioal s Hece, they ca be computed by covolutio with a iteger kerel followed by iverse scalig after the termiatio of the calculatio ad ca be accommodated by our framework Fially, whe usig packig with iteger arithmetic, the icremetal approach preseted i this sectio oly covers the use of covolutio kerels with o-egative coefficiets Some additioal calculatios are required i order to preserve the sig iformatio correctly whe packig with iteger arithmetic (this is ot a issue with floatig-poit [14]) This will be detailed i future work 4 INCEMENTAL BLOCK MATCHING The problem of block matchig betwee two successive images 0, t 1 I ad I 0, t (of C pixels) ca be abstracted as i i follows Give the m th o-overlappig block of C C pixels i 0, t I (1 m M ) ad a search area of 2W 2W overlappig blocks i 0, t I 1, fid the C C block i the search area that is closest to the m th block Covetioal search algorithms are usig o-liear distace criteria, such as the mea-square error or the sum of absolute differeces [5] Hece, they caot be performed based o the proposed packig framework However, previous research [4] [] has show that approximatios of the highly-complex SAD-based -search motio estimatio usig simpler wise criteria ca derive block matchig schemes with comparable matchig accuracy but with much lower complexity Sice a wise distace criterio is a atural fit for the proposed plae-based computatio, i this paper we focus o icremetal approximatios of -search block matchig uder such frameworks I particular, we cosider here the popular oe- motio estimatio of Nataraja et al [4], where a biarizatio of the iput image is performed prior to block matchig ad the exclusive-o (XO) fuctio is used as a matchig criterio This block matchig method starts with the applicatio of a iteger high-pass 2D mask T high to the iput images: τ { t 1, t} : high, = I T high, (14) where we use the 19x19 mask proposed by Erturk []: 1 16, if ( ij, ) {(0, 9),(3, 6),(3,12),(6, 3), (6, 9),(6,),(9, 0),(9, 6), T (9,12),(9,18),(12, 3),(12, 9), high[, i j] = (12,),(, 6),(,12),(18, 9) } 0, otherwise After filterig, the oe- represetatio of the filtered image is formed by thresholdig the high-frequecy images [4]: 0 i < i 1,if high,[ i, j] I[ i j] : bi,[ i, j]= () 0 j < C i 0, otherwise The, block matchig is performed betwee all M ooverlappig blocks of C C (biarized) pixels i 0, t bi, (1 m M ) ad their correspodig (biarized) search t For each block m at positio ( ) areas i 0, 1 bi, im, j m withi image 0, t bi,, this derives the optimal locatio { x 0,* 0,* m,, y m, } of the matchig block withi its search area 0, t 1 ( W x, y < W ) i frame by: bi, C 1C 1 0,* 0,* 0, t m, m, bi, m m xy, i = 0 j = 0 0, t 1 bi, m m { } { x, y =argmi [ i + i, j + j] (16) [ i + i+ x, j + j+ y] The distace fuctio of (16) is simply the summatio of the result of the XO operatio betwee the curret block ad each block of the search area, which is the Hammig weight This ca be parallelized by packig 32 values of 0, t bi, ad 0, t 1 bi, i two usiged 32- operads ad usig a specific processor istructio or a few low-cost operatios for the calculatio of their Hammig weight [4] Whe all plaes { N 1, N 2,,0} of the iput images I are processed idepedetly, the first part ca be performed by the icremetal covolutio approach of the, τ previous sectio Oce the results high, of the covolutio have bee produced, the biarizatio of () is applied i order, τ to produce The, for every, the best match per bi,, τ bi, block is foud by (16) usig ad 32- iteger packig However, the above techique is expected to icrease the executio time for the icremetal block matchig process, as each icremet layer applies the search algorithm I order to accelerate this case, we utilize the kowledge of the best match foud for each block durig the previous icremet layers N 1,, + 1 This is performed as follows For the first icremet layer = N 1, we perform a fast search usig logarithmic-step search [5] Subsequetly, for each block m we oly search i the eighborhood of the previously-foud best match for this block The localized search patter is a spiral search with a fixed distace limit of W spiral pixels horizotally ad vertically (see [13] for more details) The use of log-search ad the localizatio of the search aroud the previously-foud best match will produce approximate results per icremet layer Comparisos agaist the o-icremetal search algorithm i terms of predictio quality (via block matchig) vs complexity are show i the ext sectio 5 EXPEIMENTAL ESULTS For our experimets we used the xo-laptop of the OLPC foudatio ruig its ative Liux operatig system (deoted as low-ed profile) ad a Dell Latitude D6 maistream laptop with a Itel Core 2 Duo processor (at 22GHz ad with 2Gb AM) ruig Microsoft Widows XP (deoted as maistream profile) All programs were writte i C++ ad compiled with the gcc412 compiler i Liux ad with the Microsoft Visual Studio 08 compiler i Widows, with all default optimizatios of -o2 (maximize speed) mode i both cases To achieve stable executio-time measuremets with high precisio i both platforms, we used the Widows QueryPerformaceCouter() fuctio ad the Liux gettimeofday() fuctio ad ru all programs i highest priority Oly the executio time required for the computatio }

Proc IEEE Workshop o Sigal Processig Systems (SIPS), Tampere, Filad, Oct 7-9, 09 was measured (ad coverted to millisecods based o systemspecific timig measuremet) We excluded all I/O time from/to the disk, sice it produced fixed overhead We utilized three CIF video sequeces of 0 frames each For the low-ed profile, we dowsampled the sequeces to QCIF format at 10Hz i order to achieve real-time (or ear real-time) processig with the xo-laptop For each task, the results preset either the sigal-to-oise ratio (SN) or the peak-sigal-to-oise ratio (PSN) for the CIF sequeces (similar SN/PSN results were obtaied for the QCIF sequeces) SN was measured for all covolutio experimets usig as referece (oise-free) the result whe processig up to the LSBs of each frame ( = 0 ) PSN was measured for the block matchig experimet by usig the predictio error of frame-by-frame motio compesatio (usig the origial frames) with the produced motio vectors of each algorithm 51 Icremetal 2D Covolutio Experimets We preset results with 12x12 ad 6x6 Gaussia kerels with their coefficiets approximated by FXP represetatio with fractioal part set to 8 s ad 6 s, respectively The small kerel is applied i the low-ed profile ad the large oe i the maistream profile We also performed a experimet of block cross-correlatio usig radom image blocks of 8x8 ad 4x4 pixels as kerel T cov for the two profiles Idicative results are show i Figure 2 ad Figure 3, where we also report the umber of packed blocks ( M ) achieved by the icremetal approach esults for the covetioal approach are show whe usig floatig-poit ad iteger arithmetic I this way we demostrate that the low-ed profile has better performace with 32- usiged itegers, while the maistream profile is faster with 64- floatig-poit arithmetic For the results of the icremetal approach, istead of isertig each plae separately i the icremetal computatio, we iserted pairs of plaes together Per video frame, this provides four termiatio poits for the algorithm s executio, which are idicated by the termiatig plaes of the figures The covetioal (o-icremetal) approach was executed four times, each time usig the source precisio idicated by the termiatig plaes i the figures The experimets of Figure 2 ad Figure 3 idicate that, ulike the covetioal approach, icremetal computatio ca achieve extremely scalable complexity with varyig source precisio Idetical SN results were obtaied for both covetioal ad icremetal algorithms Importatly, the icremetal approach ca produce all executio-time vs distortio measuremets via oe sigle executio I other words, if, for ay frame, the computatio is termiated arrarily at a give poit by a task scheduler, the results based o the already computed plaes of that frame are readily available i the program s allocated memory Whe computatio is performed for all plaes ( = 0 ), the icremetal approach is less efficiet i compariso to the best case for the covetioal approach Our experimets demostrate that the overhead is miimized whe the umber of termiatig plaes is smaller or equal to the umber of packed blocks M ; this is achieved i the maistream profile If more termiatig plaes tha M are requested, the the proposed approach becomes less efficiet at precisio 52 Icremetal Block Matchig Experimets Idicative results of the block matchig algorithms are show i Figure 4 We preset the case of C = W = 16 Sice the covolutio kerel T high of Erturk [] oly requires 16 additios, it was foud experimetally that i the icremetal algorithm ca perform the covolutio directly per iput set of plaes (rather tha use the packig approach) without sigificat overhead Similar to the previous case, istead of always isertig idividual plaes, we iserted the last plaes together (as idicated by the termiatig plaes of Figure 4) This occurred because the results demostrate that the predictio quality icreases margially whe < 5 Executio Time (ms) Executio Time (ms) 60 40 Average time per frame for 6x6 covolutio (low-ed, M=3) Executio Time (ms) Covetioal (it) Covetioal (it) 10 Icremetal 10 Icremetal Covetioal (float) Covetioal (float) 5 6 4 2 0 6 4 2 0 Termiatig Bitplae Termiatig Bitplae Average time per frame for 12x12 covolutio (maistream, M=4) SN (db) - agaist ref at =0 38 34 26 22 18 14 10 Average SN per frame for 12x12 covolutio All Algorithms 6 4 2 Termiatig Bitplae Figure 2 Executio times for FXP covolutio ad correspodig SN per CIF frame i fuctio of termiatig plaes 75 65 55 45 35 Average time per frame for 4x4 cross-correlatio (low-ed, M=3) Executio Time (ms) Covetioal (it) 10 Icremetal Covetioal (float) 5 6 4 2 0 6 4 2 0 Termiatig Bitplae Termiatig Bitplae Average time per frame for 8x8 cross-correlatio (maisteam, M=4) SN (db) - agaist ref at =0 38 34 26 22 Covetioal (it) 18 Icremetal 14 Covetioal (float) 10 Average SN per frame for 8x8 cross-correlatio All Algorithms 6 4 2 Termiatig Bitplae Figure 3 Executio times for cross-correlatio ad correspodig SN per CIF frame i fuctio of termiatig plaes

Proc IEEE Workshop o Sigal Processig Systems (SIPS), Tampere, Filad, Oct 7-9, 09 The experimets demostrate that the log-search performed for the first termiatig biplae ( = 7 ) provides sigificatly iferior predictio result for the icremetal method as compared to the covetioal approach that performs search However, the predictio quality of the icremetal algorithm becomes comparable to the covetioal approach oce more plaes are processed ad the spiral search refies the motio vector per block (06dB iferior at = 0 ) Sice the performace seems to saturate whe < 5, this ca be exploited by the proposed approach to termiate the computatio earlier ad achieve ear real-time performace for both profiles, somethig that the covetioal approach caot take advatage of, sice its executio time does ot scale dow sigificatly with decreased source precisio Our approach presets successively-refied precisio of block matchig with additioal computatio as more plaes are processed, without the eed to re-compute the result for each ew plae This eables the arrary termiatio of block matchig per frame whe delay costraits are met (or whe resources suddely become uavailable) ad retaiig the vectors of all blocks with the already-computed precisio 6 CONCLUSION A ovel, software-based, icremetal computatio approach for 2D covolutio ad block matchig algorithms was proposed ad made available olie [13] The measured rutime of the proposed desigs decreases five-fold (or more) with decreased iput/output SN or PSN, by utilizig a plaebased packig techique ad reusig previously-computed results The covetioal computatio does ot achieve such quality vs complexity scalability, as show by experimetal results o two differet platforms Our method also allows for early termiatio with graceful degradatio ad has these features without platform-specific customizatio Well-kow techiques, eg multithreadig, ca be applied to our approach to improve its executio time without affectig its scalability 7 ACKNOWLEDGEMENT This work was supported by EPSC, grat EP/F00/1 EFEENCES [1] D Geer, Chip makers tur to multicore processors, IEEE Computer, vol 38, o 5, pp 11-13, May 05 [2] A J C Bik, D L Kreitzer, ad X Tia, A case study o compiler optimizatios for the Itel Core 2 Duo processor, It J Parallel Prog, vol 36, Apr 08 [3] V K Goyal, ad M Vetterli, Computatio-distortio characteristics of block trasform codig, Proc IEEE It Cof o Accoust, Speech, ad Sigal Proc, vol 4, pp 2729-2732, April 1997 [4] B Nataraja, et al, Low-complexity block-based motio estimatio via oe- trasforms, IEEE Tras Circ ad Syst Video Techol, vol 7, o 4, Aug 1997 [5] B Zeg, et al, Optimizatio of fast block motio estimatio algorithms, IEEE Tras Circ ad Syst Video Techol, vol 7, o 6, Dec 1997 [6] K Legwehasatit ad A Ortega, Scalable variable complexity approximate forward DCT, IEEE Tras Circ ad Syst Video Techol, vol 14, o 11, pp 1236-1248, Nov 04 [7] D Turaga, M va der Schaar, ad B Pesquet-Popescu, Complexity scalable motio compesated wavelet video ecodig, IEEE Tras Circ Syst Video Techol, vol, o 8, Aug 05 [8] S H Nawab, A V Oppeheim, A Chadrakasa, J Wiograd, ad J T Ludwig, Approximate Sigal Processig, J of VLSI Sigal Process, vol, o pp 177-0, 1997 [9] W Yua ad K Nahrstedt, Practical voltage scalig for mobile multimedia devices, ACM Iterat Cof Multimedia, pp 924-931, 04 [10] E Akyol ad M va der Schaar, Complexity model based proactive dyamic voltage scalig for video decodig systems, IEEE Tras o Multimedia, vol 9, o 7, pp 1475-1492, Nov 07 [11] J Wiograd ad S H Nawab, Icremetal refiemet of DFT ad STFT approximatios, IEEE Sigal Process Letters, vol 2, o 2, pp -27, Feb 1995 [12] Y Adreopoulos ad I Patras, "Icremetal refiemet of image saliet-poit detectio," IEEE Tras o Image Process, vol 17, o 9, pp 1685-1699, Sept 08 [13] http://wwweeuclacuk/~iadreop/oiphtml [14] A Kadyrov ad M Petrou, The Ivaders algorithm: rage of values modulatio for accelerated correlatio, IEEE Tras PAMI, vol 28, o 11, pp 1882-1886, Nov 06 [] S Erturk, Multiplicatio-free oe- trasform for lowcomplexity block-based motio estimatio, IEEE Sigal Process Letters, vol 14, o 2, Feb 07 Executio Time (ms) 2 0 175 0 1 100 75 Average time per frame for 16x16 block matchig (low-ed) Covetioal Icremetal 7 6 5 3 0 Termiatig Bitplae Executio Time (ms) 55 45 40 35 10 5 Average time per frame for 16x16 block matchig (maistream) Covetioal Icremetal 7 6 5 3 0 Termiatig Bitplae PSN (db) 290 280 270 260 0 240 Average PSN per frame for 16x16 block matchig Covetioal Icremetal 7 6 5 3 0 Termiatig Bitplae Figure 4 Executio times for block matchig ad average PSN per predicted CIF frame i fuctio of termiatig plaes