Optimizing Motion Vector Accuracy in Block-Based Video Coding

Size: px

Start display at page:

Download "Optimizing Motion Vector Accuracy in Block-Based Video Coding"

Ralph Holt
5 years ago
Views:

1 revised and resubmitted to IEEE Trans. Circuits and Systems for Video Technology, 2/00 Optimizing Motion Vector Accuracy in Block-Based Video Coding Jordi Ribas-Corbera Digital Media Division Microsoft Corporation One Microsoft Way Redmond, WA 98052, USA. David L. Neuhoff EECS Dept. University of Michigan 1201 Beal Ave. Ann Arbor, MI, 48109, USA Abstract 1 In classical block-based video coding, one motion vector per image block is used to improve the prediction of the frame to be coded. These motion vectors and the resulting motion-compensated difference frame must be encoded with bits. All motion vectors are encoded with the same fixed accuracy, typically 1 or 1/2 pixel accuracy, but the best motion vector accuracies are not known. In this paper, we present a theoretical framework to find the motion vector accuracies that minimize the total encoding rate with this type of coder, for the classical case where all motion vectors are encoded with the same accuracy and for new cases where the accuracy is adapted on a frame-by-frame or block-by-block basis. To do this, we analytically model the effect of motion vector accuracy and show that the energy in a block of the difference frame is approximately quadratic in the accuracy of the block's motion vector. This energy-accuracy model is then used to obtain expressions for the total bit rate (motion rate plus difference frame rate) in terms of the blocks' motion accuracies and other key parameters. Minimizing these expressions leads to simple formulas that indicate how to choose the best motion vector accuracies for this type of coder. These formulas also show that the motion accuracy must increase where more texture is present and decrease when there is much scene noise or when the level of compression is high. We implement several entropy and MPEG-like video coders based on our analysis and present experimental results on synthetic and real video sequences. These results suggest that our formulas are accurate and that significant bit rate savings can be achieved when our optimization procedures are used. Keywords: video coding, motion estimation, motion compensation, motion vector accuracy, bit allocation, difference frame energy, rate modeling. 1 The first author was formerly with the EECS Dept. of the University of Michigan. This work was supported in part by NSF Grant NCR

2 I. Introduction Block-based, motion-compensated video coders are widely used because of their good performance and reasonable complexity. For example, they are the basis of the H.263 and MPEG standards [1-3]. As illustrated in Figure 1, in a typical coder of this class, the current frame to be encoded is divided into small blocks of the same size (typically, 8 8 or pixels per block), and for each block a motion vector is found that points to a position in the previous frame (actually, in the decoded reproduction thereof) where a good prediction of the block can be found. Aggregating the predictions of all blocks yields a prediction of the current frame. Subtracting this from the current frame yields a prediction error or difference frame that is encoded into bits, typically, with a DCT based technique. In addition, the motion vectors must themselves be encoded. The payoff for this investment in motion compensation is a savings in the number of bits required to encode the difference frame that substantially exceeds the number of bits required to encode the motion vectors -- thus significantly reducing the video encoding rate. To elaborate further, the encoding rate R (in bits per pixel) of such a video coder is the sum of the encoding rate R D for the difference frame plus the encoding rate R M for the motion vectors. It is intuitively clear that with the quality of the video frame reproduction held roughly constant, increasing R M will usually improve the motion compensation, which in turn will decrease the energy of the difference frame and hence decrease R D to some nonzero limit. Clearly, there must be an optimal value of R M, i.e., a value that minimizes R = R D + R M. The motion rate R M is principally determined by the number of motion vectors, which is inversely proportional to the size of the blocks, and the accuracy with which motion vectors are represented. To explain the latter, we note that though in some coders motion vectors are constrained to point only to pixels in the previous frame, i.e. to be integer valued, it is also possible to have noninteger valued motion vectors [1-10]. These point to blocks in an interpolation of the previous frame. For example, motion compensation with pixel accuracy means that the previous frame has been interpolated by the factor 1/ (both horizontally and vertically) and that components of the motion vectors are multiples of. In H.263, MPEG-1 and MPEG-2 [1,2], the previous frame is interpolated by a factor of 2 (for motion compensation purposes), and so =1/2 pixel, i.e., is half the distance between two adjacent pixels, and the motion vectors are said to have 1/2 pixel (or subpixel) accuracy. Clearly, the number of bits required to describe the motion vectors (i.e., R M ) increases with higher motion vector accuracy (i.e., smaller ). In this paper we analyze the effect of motion vector accuracy on the overall rate of blockbased, motion-compensated video coding. Our goal is to find the best possible accuracy and to explore the benefits of adapting the accuracy on a per frame or per block basis. A companion paper [11] uses a key result developed here and similar optimization methods to analyze the effect of block size. To be a bit more concrete, we note that both motion and differ- 1

3 ence frame rates may be viewed as functions of the motion accuracy: R M ( ) and R D ( ). As illustrated in Figure 2, the former increases and the latter decreases with smaller. We seek to find the accuracy that minimizes their sum. To accomplish this we need approximations to R M ( ) and R D ( ) that, in addition to being reasonably accurate, are sufficiently tractable that one may minimize their sum by differentiation. Indeed, we wish to be able to perform this minimization for each frame, so that one might adapt the motion accuracy on a per frame basis. Moreover, to analyze the potential benefits of adapting the accuracy on a per block basis, we also seek expressions for the motion and difference frame rates as functions of the individual motion accuracy for each block. Such expressions must also be sufficiently accurate and tractable. To approximate the dependence of the motion rate R M on the accuracy, we take the viewpoint that an "ideal" motion vector is quantized with a uniform scalar quantizer with level spacing, as in [6,10]. Since the number of bits per motion vector component is not ordinarily small, it is straightforward to find simple approximate expressions for the motion rate function, R M ( ). We consider cases where the quantized motion vectors are coded with and without entropy coding and prediction from previous motion vectors. Moreover, for the situation where the motion vectors and their x (horizontal) and y (vertical) components can have different accuracies, we find an expression R M ( ) for the motion rate in terms of = (( x,1, y,1 ),...,( x,n, y,n )), the individual motion vector accuracies of the N blocks of the frame. To approximate the difference frame rate R D, we assume as in previous studies [4,5,7] that it is a function R ~ D (S D ) of the difference frame energy S D, that the difference frame coder is simply a uniform scalar quantizer with entropy coding, and that the difference frame has a Laplacian (two-sided exponential) distribution. In this case, an expression for R ~ D (S D ) can be straightforwardly derived. (It will be shown later that the results also apply fairly well to DCT type difference frame coders.) A key step in our work is the development in Section 2 of a simple quadratic expression for the difference frame energy as a function of the motion vector accuracies. Indeed, we obtain an expression for the difference frame energy for each block, based on the accuracies x and y of the x and y components of its motion vectors. Summing these yields a formula for S D ( ), the total difference frame energy, which clearly shows the effects of characteristics such as the texture of the frame and the energy of the interframe noise. The latter is introduced in the difference frame by the blockbased, motion compensation as a result of illumination changes, camera noise, coding distortion in the previous frame, occlusions, nontranslational motion, and other related phenomena. We then obtain in Section 3 a fairly simple expression of the form R( ) = R M ( ) + R ~ D (S D ( )) (1) In Section 4, we use the above to find closed form expressions for the optimal motion vector accuracies in four cases: 1) the components of all motion vectors of all frames have the same 2

4 accuracy, as in most previous methods [1-5,7-10], i.e. in all frames = x,1 = y,1 = = x,n = y,n ; 2) the motion accuracies are adapted to each frame, but constant throughout the frames, i.e., in the j-th frame j = x,1 = y,1 =...= x,n = y,n ; 3) the motion accuracies are adapted to each block, however, x,i = y,i for each block; and 4) the accuracy of each component of each motion vector is individually adapted to the block. Using these expressions, we separately explore the cases of lossless and lossy difference frame coding. In the lossless case, where the difference frame quantizer has level spacing 1, the motion vector coder uses fixed-rate nonpredictive coding of the motion vector components. Although this lossless codec has limited practical significance, it allows us to verify our analysis in a very simple setting and to explore the performance of optimal motion accuracies when there is no coding distortion. In the lossy case, where the difference frame rate is much lower, the motion vector coder uses predictive and entropy coding to increase efficiency. Though the expressions for the optimal motion vector accuracies are in closed form, they involve certain parameters that must usually be estimated from the frames of the video, namely the coefficients of the quadratic expression for the difference frame energy S D ( ). Several methods for doing the estimation are described in Section 2 -- some more suited to the lossless case and some to the lossy case. Section 5 of this paper presents the results of experiments that use the expressions mentioned above to predict the overall rate and to adapt the motion vector accuracies when coding real video sequences. The results indicate that the expressions, though tractable, are nevertheless fairly accurate, and that the adaptation of the motion vector accuracies to individual frames results in significant rate savings -- for example, up to 0.4 bits/pixel in the lossless case and up to 35% in the lossy case over the typical =1/2 pixel accuracy choice of most video coding schemes. For the tested video sequences, it was not found that adapting motion accuracy on a per block basis yielded significant rate savings, though it is possible that it would for certain sequences. Nevertheless, in related work [12] we found that using blockadaptive motion accuracy can provide a significant reduction of computational complexity in motion estimation. Section 6 presents concluding remarks. It is our belief that an important contribution of the present work is that the analytical expressions quantify phenomena that have been qualitatively observed and intuitively understood in other studies. For example, using idealized models, Girod [4] observed qualitatively that the motion vector accuracy should increase with lower interframe noise. Recently, Benzler [9] showed with extensive empirical experiments that using motion vectors with =1/4 can often result in significant coding improvements over the typical =1/2, particularly in highly textured video sequences and at higher bit rates, when coding noise (which is a kind of interframe noise) is low. Our formulas for the optimal motion accuracies show "quantitatively" how the accuracies increase (i.e. the 's become smaller) when there is less interframe noise and also when more texture is present. 3

5 There has been previous work that has formulated equations like (1) in order to optimize motion vector accuracies. For instance, Buschmann [10] derived a formula for R M in terms of motion accuracy (and other parameters), but assumed ideal coding of motion vectors (i.e., used optimal rate-distortion functions) and did not consider the effect of motion accuracy on R D. In [5], the difference frame energy S D was measured empirically at each step of a topdown, quadtree-based technique that attempted to find the block sizes and acccuracies for the motion vectors that minimized an expression similar to (1). In regards to the motion accuracy aspect, since the work in [5] lacked an analytical expression for S D, no formulas for the optimal motion accuracies were derived. In fact, the motion accuracies were simply heuristically increased while growing the quadtree and hence were not globally optimized. In a related work [6], the difference frame energy S D was also measured empirically for several motion vector accuracies, but for fixed block-size motion compensation. At a given block, this method selected the motion accuracy that produced the largest decrease of difference energy per motion bit. As in [5], the motion accuracies were not globally optimized and no analytical expression for S D or for the optimal motion accuracies were derived. Other previous work found and made use of an analytical expression for the difference frame energy S D. Specifically, the seminal work of Girod [4] developed an analytical expression for S D as a function of the probability distribution of the errors in the motion vectors (which is essentially determined by the motion accuracy), the Fourier transform of the frame, and the power spectral density of the interframe noise. Girod's work was extended to interlaced video frames in [7] and to multihypothesis prediction in [13]. However, Girod's expression for S D is not sufficiently tractable to permit analytical optimization of motion vector accuracy, let alone adaptation on a block-by-block basis. In fact, his work focused only on studying the effect of motion accuracy (and other parameters) on the difference frame energy S D, but did not explore the effect on difference frame rate R D or on motion rate R M. Hence, no expressions were found for the (adaptive or nonadaptive) optimal motion accuracies. Nevertheless, Girod gained interesting insights by modeling the spectrum of image data with an isotropic Gaussian distribution and plotting S D for different values of. Using these plots and plots of the empirical S D ( ) on a few video frames, he reached the important conclusion that for many video sequences the best (nonadaptive) motion vector accuracy is between =1/2 and =1/4. However, it is evident that the optimal motion accuracy needs to depend on the nature of the video sequence and the distortions that are present. In fact, in this paper we show cases where the best motion accuracies are outside Girod's interval. Even when the best accuracy is in that interval, if the optimal is close to =1/2, using =1/4 not only would increase bit rate, but would also significantly increase computational complexity if the typical block matching technique was used for motion estimation. The latter occurred during the MPEG4 experiments [9], where it was observed that using =1/2 often worked better than =1/4 when coding low-textured scenes at low bit rate (i.e. high distortion). It is largely to predict the best nonadaptive value of (given distortion, image texture, and other 4

6 characteristics of the scene) and for adapting motion accuracy on a frame-by-frame or blockby-block basis that our more tractable analysis is developed. Other work related to the analysis of the difference frame energy may be found in [8,14,15]. 2. Estimating the Energy in Blocks of the Difference Frame This section develops approximate expressions for the energy of the difference between a block of the current frame and its prediction from the previous frame, as functions of its motion vector and the motion vector accuracy. Summing over all blocks in the current frame yields an approximate expression for S D ( ), the difference frame energy as a function of the motion vector accuracies. Let F[n] and F - [n] denote the present frame and decoded reproduction of the previous frame, respectively, where n = (n x,n y ) Z 2 denotes a pixel location with integer-valued horizontal and vertical positions n x and n y, and F[0,0] is below and to the left of F[1,1]. That is, we use sampled rather than matrix indexing of the pixel locations. Let F(x) and F - (x) denote continuous-space interpolated versions of F[n] and F - [n], respectively, where x = (x,y) R 2. We wish to consider the prediction of a block of F[n] by a block of F - (x) as pointed to by a motion vector. For convenience we assume that the block of F[n] to be considered covers a B B square 2 and place the origin of the (x,y) coordinate axes at its lower left corner. That is, the current block occupies the square of pixels B = {0,1,...,B-1} {0,1,...,B-1} and, for instance, blocks to the left and below will be at negative (x,y) locations. Given a motion vector v = (v x,v y ) R 2, the prediction for block B in F[n] is ^F[n] = F - (n + v), n B, (2) and the energy of the resulting prediction error is S(v) = (F[n]-^F[n]) 2 = n B (F[n]-F - (n+v)) 2. (3) n B In the analysis to follow, we approximate the above by an integral: S(v) (F(x) - ^F(x)) 2 dx = (F(x) - F - (x+v)) 2 dx, (4) B c B c where B c = [0,B] [0,B] denotes the block corresponding to B in continuous space and ^F(x) denotes the continuous space version of ^F[n]. 2.1 Ideal Motion Vector, Ideal Prediction, and Interframe Noise Let v * R 2 denote the motion vector that minimizes the prediction error energy S(v) in (3). We consider v * and its associated block ^F * (x) = F - (x+v * ), x B c, to be the ideal 2 This analysis could be easily generalized to rectangular blocks of B x B y pixels. 5

7 motion vector and the ideal prediction, respectively, for the current block. An example (for the one-dimensional case) of the current block and its ideal motion vector and prediction are illustrated in Figure 3. In practice, the current block and its ideal prediction are similar (in both discrete and continuous space), since they typically correspond to the same physical image element, moved v * units from the previous frame. Accordingly, we model the ideal prediction as ^F*(x) = F - (x+v*) = F(x) + N(x), x B c (5) where N(x) is interpreted as interframe noise produced by light changes, camera noise, coding distortion (in the previous frame), nontranslational motion, occlusions, etc. Without such interframe noise, the current block and its ideal prediction would be identical. 2.2 Effect of Motion Vector Errors In practice, motion estimation can only produce motion vectors with limited accuracy. Hence, we assume there is some motion error vector u = (u x,u y ) = v - v*, and we model the prediction of the current block as a shifted version of the noiseless ideal prediction plus the interframe noise; i.e. as ^F(x) = F(x+u) + N(x), x B c. (6) The motion error and resulting prediction are illustrated in Figure 3. We seek an approximate expression for the prediction error energy as a function of the motion vector error u. That is, we seek the form of We decompose the above into where ~ S(u) = S(v*+u) ~ S o (u) = B c (F(x) - F(x+u) - N(x)) 2 dx, (7) ~ S(u) ~ S o (u) + ~ S n (u), (8) B c (F(x) - F(x+u)) 2 dx, (9) ~ S n (u) = -2 B c (F(x) - F(x+u)) N(x) dx + B c N(x) 2 dx. (10) We approximate the term ~ S n (u) as a constant 3 C n ; i.e. it has no significant dependence on u. Since the noise N(x) and the difference F(x)-F(x+u) are at most very weakly correlat- 3 The first term in ~ S n (u) is approximately zero; the second is approximately the variance of the noise. 6

8 ed, it is anticipated that the second integral term in (10) will dominate. For this reason we consider C n to be the energy or amount of interframe noise for the current block. We now focus on the noise-free difference energy ~ S o (u). Our principal result is that ~ S o (u) is, approximately, quadratic in u. One way to demonstrate this is simply by expanding ~ S o (u) in a Taylor series about the point u=(0,0). In doing so, one finds that for small u ~ S o (u) a u 2 x + b u 2 y + c u x u y, (11) where B B a = 0 ( F(u 0 u x,u y )) 2 B B du x du y, b = x 0 ( F(u 0 u x,u y )) 2 du x du y, (12) y B c = B These coefficients may in turn be approximated as B-1 B-1 a (F[nx,n y ]-F[n x -1,n y ]) 2 B-1, b nx =1 n y =0 u x F(u x,u y ) u y F(u x,u y ) du x du y. (13) nx =0 B-1 (F[nx,n y ]-F[n x,n y -1]) 2, (14) n y =1 B-1 B-1 c 2 (F[nx,n y ]-F[n x -1,n y ])(F[n x,n y ]-F[n x,n y -1]). (15) nx =1 n y =1 From the above, we see that the coefficients a and b are essentially measures of the texture of the current frame along x and along y, respectively, while c is related to the correlation of the texture along x and y. Though (11) shows the basic form of ~ S o (u), it does not indicate how small u needs to be in order for the approximation to be accurate. This could be studied by analyzing the error term in the Taylor series expansion, but such approach appears to be somewhat complex. Instead, we undertake a direct derivation of ~ S o (u) based on a Fourier series representation of F(x). As additional benefits, we will see how a, b and c depend on the frequency components of F(x), and we will see that c is usually small enough that it can be ignored. In the vicinity of the block B c =[0,B] [0,B], we approximate F(x) as periodic with period B, both horizontally and vertically. Accordingly, it has the Fourier series representation F(x) = K K n cos(ω o (x,n) + θ n ), (16) n L where n=(n x,n y ), L={n: n x =0 & 0<n y <, or n x >0 & - <n y < }, ω o =2π/B, (x,n)=xn x +yn y, K n e jθn = 1 B 2 F(x) e -jω o(x,n) dx, (17) B c 7

9 and K n 0. Substituing (16) into (9), which is the definition of ~ S o (u), simplifying, and neglecting small terms yields ~ S o (u) 4B 2 K 2 n (1-cos ω o (u,n)). (18) n L Using the approximation cos β 1 - β2/2 for β π/2 in the above and simplifying, we find that when, as often happens 4, K n 0 for all n such that n x /B + n y /B >1/2, ~ S o (u) 8π 2 K 2 n (n 2 xu 2 x + 2n x n y u x u y + n 2 yu 2 y ), u x 1/2, u y 1/2. (19) n L From the above we see how the coefficients of the quadratic approximation depend on the components of F(x): a 8π 2 K 2 n n2 x, b 8π 2 K 2 n n2 y, c 16π 2 K 2 n n x n y. (20) n L n L n L Finally, we note that the coefficient c will ordinarily be small enough to ignore, because for typical image blocks [16, p. 39], the larger values of K n are along the x and y frequency axes (where the product n x n y is zero) or close to the origin (where n x n y is small). In summary, we obtain the approximations ~ S o (u) a u 2 x + b u 2 y, (21) ~ S(u) a u 2 x + b u 2 y + C n, (22) when u x 1/2, u y 1/2. Though one can also derive (20) by substituting the Fourier series representation (16) directly into (12) and (13), in this case one will not learn that the approximation is accurate when the magnitude of u x and u y are at most 1/2, which happens when the motion vector accuracy is less than 1 (pixel or subpixel accuracy), the usual case. Additional derivations of (21) are given in [17,18]. The quadratic dependence of difference frame energy on motion vector errors was observed in the empirical experiments reported in [19]. 2.3 Effect of Motion Vector Accuracies As mentioned in the introduction, we model the effect of motion accuracy as a uniform quantization of ideal motion vectors. Viewing the latter as random, we compute the average energy of the difference between a block and its prediction as E[ ~ S(U)] a E[u 2 x] + b E[u 2 y] + E[C n ] 4 This means that the amplitudes of the components of F(x) at block frequencies above 1/2 cycles/pixel are small. 8

10 α 2 x + β 2 y + γ, (23) where E[ ] is the expectation operator and a α = 12, β = b 12, γ = E[C n], (24) and where α and β were obtained assuming that the components of the motion error u = (u x,u y ) are, approximately, uniformly distributed over quantization cells of width x and y, respectively. It is interesting to note that the relationship in (23) will hold even when the value of c in (11) is not negligible, because the realizations of the motion errors u x and u y will normally be uncorrelated and have zero mean, i.e., E[cu x u y ] = c E[u x ]E[u y ] = 0. Summing (23) over all blocks of the current frame yields the following expression for the difference frame energy in terms of the motion vector accuracies specified for its blocks: S D ( ) 1 1 B 2 N N (α i 2 x,i + β i 2 y,i + γ i ), (25) i=1 where α i, β i, γ i are the quadratic coefficients for the ith block. This is a key result of the paper. Basically, it presumes a kind of linear relationship between the average size of the motion vector error u and the motion vector accuracy. 2.4 Estimating the Quadratic Coefficients. In order to use (25) to optimize the choice of the 's, we need estimates for the quadratic coefficients α, β and γ for each block. Here we mention several that can be useful in different situations. We assume that current and previous frames, F[n] and F - [n], and a block with coordinates B are given. A: Estimate α=a/12 and β=b/12 from the formulas for a and b in (12) or (14). (Notice that (12) requires finding the interpolated image F(x), at least approximately.) B: Find the Fourier series representation (16) of the interpolation F(x) and estimate α and β from the formulas for a and b in (20). Methods A and B estimate the α and β coefficients using only the present frame. As a result, they are not influenced by the actual motion or interframe noise in the present frame relative to the previous, as shown for example in (5) or (6). However, since we are actually trying to model the energy in the difference between the present frame and a motion compensated prediction based on the previous frame, we can do better, in some situations, by using curve fitting procedures based on both frames. While it is possible to use such a method to estimate a,b,c, and then compute α,β,γ from them, the following two methods directly estimate α,β,γ in (23). They exploit the fact that the resulting quadratic expression will be used to determine the best motion accuracies from a finite set of candidates. Let Γ= {δ 1,δ 2,,δ J } denote the set of candidate values for the component accuracies x and y. Without loss of generality, assume δ 1 <δ 2 < <δ J. We also assume δ 1 <<1. 9

11 C: Measure S(v), as defined by (3), for all v in some square grid of points {v 1,v 2,,v M } in the neighborhood of the block, with the horizontal and vertical resolution of the grid equal to δ 1. Let v* be the grid point v i for which S(v i ) is smallest. For every candidate pair of accuracies =( x, y ) with components in Γ, quantize v * x and v * y with uniform scalar quantizers with level spacings x and y, respectively, obtaining the motion vector ^v and motion vector error u = ^v - v *. Measure ~ S(u ) = S(^v ). Find the least squares fit of a polynomial of the form α 2 x + β 2 y + γ to the set of points (, ~ S(u )). Alternatively, one can let γ = S(v*), without much difference. D: Let the candidate set of accuracies be Γ = {δ 1,2δ 1,4δ 1,,2 J-1 δ 1 }, where as before δ 1 <<1. Find v * = (v * x,v * y) as in Method C. Then for every candidate pair of accuracies =( x, y ) with components in Γ, measure S(v*+(i o,j o )) for all integers i and j such that - x /2 i o x /2 and - y /2 j o y /2. Let ^S( ) be the average of the S(v*+(i o,j o ))'s. Find the least squares fit of a polynomial of the form α 2 x + β 2 y + γ to the set of points (,^S( )). Alternatively, one can let γ = (^S(δ 1,δ 1 ) + ^S(2δ 1,δ 1 ) + ^S(δ 1,2δ 1 ) + ^S(2δ 1,2δ 1 ))/4, without much difference. We found Methods C and D to be very accurate, and for this reason we have used them in this study, even though they are somewhat computationally complex. In practice, they are more appropriate for off-line or long-delay coding (e.g., on-demand video streaming, VCD). However, if the motion vector accuracies are restricted to be the same for the x and y components (e.g., see Modes 1, 2 and 3 later in Section 4), the computation is still reasonable for real-time coding, since only a few difference frame energy measurements per block are required. (For real-time coding, a low complexity approach should be used to find the ideal motion vectors, as in [6,9]). We do not present results with Methods A and B in this paper, but since they can be found in related work [11,12,18], we discuss them here briefly. Method A is well suited for very low complexity codecs, and it is still quite effective, as was shown in [11,12]. Method B is appropriate if the Fourier coefficients are available for each block. Some comparisons of the performance of these estimation methods were presented in [18, p. 51]. 3. Modeling Rate This section finds expressions for the motion and difference frame rates in terms of the motion vector accuracies and other key parameters. Ordinarily, the motion vectors are found by a simple motion estimation procedure, such as block matching, that computes the motion vectors on a grid with the chosen accuracy. However, as mentioned before, in this study, it is assumed that the ideal motion vectors are computed first and then quantized to the desired accuracy. 10

12 3.1 Motion Vector Coding and Rate We consider two kinds of motion vector coding: uniform scalar quantization with fixedrate coding and DPCM with entropy coding. In the first case, if the motion vector accuracies for the ith block of the current frame are x,i and y,i, then the x and y components of the ideal motion vector for that block are quantized with, respectively, N x = 2V m / x,i and N y = 2V m / y,i uniformly spaced levels, where V m is the maximum anticipated displacement between two adjacent frames. The outputs of these quantizers are assumed to be encoded with log 2 2V m / x,i and log 2 2V m / y,i bits, respectively. As a result, the overall rate (in bits per pixel) invested in motion vectors is R M ( ) = 1 1 B 2 N N 2V m 2V m log 2 i=1 + log 2 x,i (26) y,i If, as we wish to consider, the motion vector accuracies change with every frame or block, then they must also be encoded and sent to the decoder. However, since in this work x,i and y,i will take values in a relatively small set, the rate for this is usually negligible in comparison to the motion and difference frame rates. DPCM is a more popular technique for encoding motion vectors, due to its lower rate [20]. In fact, it has been adopted by a number of current video coding standards [1,2,3]. With DPCM, the quantized motion vector for the ith block is v i = v i-1 + q i (v * i -v i-1), where v * i is the ideal motion vector, v i-1 is the quantized motion vector for the previous block (in scan order), which serves as a prediction of v * i, and q i () denotes the operation of uniform scalar quantization of the x and y components of the prediction error with level spacings x,i and y,i, respectively. Because we consider entropy coding, the numbers of quantization levels are assumed to be large and the quantized prediction errors are assumed to be encoded with variable-length binary codes. Since the probability distribution of the quantized prediction errors depends on the level spacings, which in turn depend on the motion vector accuracies, and since in this study the motion vector accuracies may vary (with frame or block), we assume there is a different variable-length code for every possible accuracy and that when the accuracy is, the corresponding variable-length code produces on the average approximately H bits, where H is the entropy of the quantized prediction error when the level spacing is. When, as usually happens, the level spacing is considerably smaller than the standard deviation of the prediction errors, one may use the well known approximation [21, p. 228] H h - log 2, where h = - p(w) log2 p(w) dw is the differential entropy of the (unquantized) prediction errors, which are modelled as having probabil- - ity density p(w). As a result, the motion rate for DPCM with entropy coding is R M ( ) 1 B 2 1 N i=1 N (2h - log2 x,i y,i ) (27) We summarize the two expressions (26) and (27) for motion rate as 11

13 R M ( ) 2H B B 2 N N log 2 x,i y,i (28) i=1 where H = log 2 2V m in the case of uniform scalar quantization and H = h in the case of DPCM with entropy coding. 3.2 Difference Frame Coding and Rate The difference frame pixels are encoded by a uniform scalar quantizer with level spacing Q, followed by an entropy coder, where the latter is adapted to the individual frame being encoded. Since image pixels take values in the set {0, 1,, 255}, the difference frame pixels take values in {-255,,255}. The encoding is lossless when Q = 1, and lossy when Q > 1. Such a coder is adequate for this study because of the low correlation between the pixel values in the difference frame [4,5]. The mean squared error (MSE) distortion in the Q-quantized difference frame can be approximated by the well known expression Q 2 /12 [22, p. 152], assuming Q is neither too small nor too large. Equivalently, the peak signal-to-noise ratio (PSNR) is approximately log 10 Q. For example, Q = 25 and 20 correspond to approximately 31 and 33 db, respectively. When Q = 1, the lossless coding case, the MSE is 0 rather than 1/12. Note that the PSNR is affected little by the choice of motion vector accuracies. Letting ^p Q (d) denote the frequency of the value d in the Q-quantized difference frame, the rate (in bits/pixel) produced when encoding a particular difference frame is given, approximately, by the entropy H(^p Q ). Furthermore as in [5,11], we assume that ^p Q is the distribution resulting from quantizing a Laplacian density p D (d) with a uniform scalar quantizer with level spacing Q. When σ 2 /Q2 is large (i.e., Q and the distortion are small relative to the variance σ2 of p D ), H(^p Q ) h(p D ) - log 2 Q, (29) where h(p D ) = (1/2) log 2 2 e2 σ2 is the differential entropy of p D [21, p. 228]. On the other hand when σ2/q2 is small (i.e., Q and distortion are large), H(^p Q ) is approximately linear in σ2. We combine these two approximations to obtain H(^p Q ) 1 2 log 2 2e2 σ2 Q2, σ2 Q2 > 1 2e e ln 2 σ2 Q 2, σ 2 Q 2 1 2e (30) where the linear function of σ 2 is the unique tangent of (1/2) log 2 (2 e2 σ2/q2) that passes through the origin, and 1/2e is the point of tangency. (Note that the right side of (30) is a continuous differentiable function of σ2/q 2.) Figure 4 compares H(^p Q ) to the above approximation for different values of σ and Q. Though distributions other than Laplacian could also be considered for the difference frame, for example generalized Gaussian as in [23], we 12

14 find that our results are not very sensitive to the assumption about p D (d). For example, when σ 2 /Q 2 is large (the low distortion case), the differential entropies of Gaussian and Laplacian differ only by a constant, which does not affect the minimizations performed later. As our estimate of σ 2 we use the average of the per pixel energies of the blocks in the difference frame, and use (25) to give the approximate dependence of the latter on the motion vector accuracy. That is, we choose σ 2 = S D ( ) 1 P N (α i 2 i,x + β i 2 i,y + γ i ) (31) i=1 where α i,β i,γ i are the quadratic coefficients for the ith block of the current frame. Substituting (31) into (30) gives an expression for R D ( ), the difference frame rate as a function of the specified motion vector accuracies. 4. Optimizing Motion Vector Accuracies In this section we optimize the choice of motion vector accuracies. We assume that the the level spacing Q of the difference frame quantizer remains fixed at a value giving a satisfactory PSNR. We are then free to choose the motion vector accuracies to minimize the overall encoding rate, which by summing the expressions for motion and difference frame rates may be expressed as R( ) 1 2 log 2 2e2 σ 2 Q 2 + 2H B B 2 N N i=1 e ln 2 σ2 Q2 + 2H B B 2 N N log 2 x,i y,i, i=1 log 2 x,i y,i, σ 2 Q 2 > 1 2e σ2 Q2 1 2e, (32) where σ2 is given by (31). In optimizing the motion vector accuracies, we consider four modes of operation, corresponding to four increasing levels of adaptation. Mode 1: the nonadaptive, classical approach The components of all motion vectors of all frames have the same accuracy, as in most previous methods [1-4,7-10,]; i.e. x,1 = y,1 = = x,n = y,n =. Mode 2: frame-by-frame adaptation The same as the above except that is individually chosen for each frame. Mode 3: block-by-block adapation The motion vector accuracies x,1, y,1,, x,n, y,n are individually chosen for each block of each frame; however, it is required that x,i = y,i, i =1,,N. Mode 4: component-by-component adaptation 13

15 There are no constraints on the motion vector accuracies, so that each component of each motion vector of each block of each frame can be individually tailored. For each of these modes, we fix Q, which essentially fixes distortion 5 at the value D Q 2 /12, and minimize the total rate (32) subject to the constraints of the mode. The results given below are obtained by equating to zero the partial derivatives of (32) with respect to the motion accuracies. Optimized Mode 2: The optimal motion vector accuracy for the motion vectors in a given frame is = 2 1/2 B 2-2 µ 1/2 ν Q 2, µ < 2e B 2 B 2-2, (33) 1 1/2 B 2 e Q 2 1/2 ν, otherwise where µ = 1 1 B 2 N i=1 N γ i is the average interframe noise energy (per pixel) for the given frame (recall that γ i is the estimated interframe noise for the ith block of the frame), and ν = 1 1 B 2 N i=1 N (α i +β i ) is a measure of the texture (per pixel) of the given frame. By substituting (33) into (32) and using the fact that the condition Q2/µ<2eB 2 /(B 2-2) insures that σ2/q2>1/2e for the optimal *, and vice versa, one may obtain an expression for R( * ), which we omit since it is long and not particularly enlightening. Note that the coding distortion D, which is present in the previous frame, is one of the components of the interframe noise µ. Among other things this implies that µ D Q2/12 and that µ increases with Q. Optimized Mode 1: The optimized motion accuracy for Mode 1, the nonadaptive case, is the same as (33), except that µ and ν are the average interframe noise and texture over all blocks in all frames of the sequence, and Ν is the total number of such blocks. Optimized Mode 3: The optimal motion accuracy for the ith block in a given frame is * i = 2 1/2 B 2-2 µ 1/2 νi Q 2, µ < 2e B 2 B 2-2. (34) 1 1/2 B 2 e Q 2 1/2 νi, otherwise where ν i = 1 B 2 (α i +β i ) is a measure of the texture in the ith block. Optimized Mode 4: 5 Recall, D Q 2 /12, when Q is neither too large nor too small. For example, in the lossless case, Q=1 and D=0. 14

16 The optimal motion accuracies for the ith block in a given frame are * x,i = 2 1/2 B 2-2 µ 1/2 2α Q 2, i /B 2 µ < 2e B 2 B /2 B 2 e Q 2 1/2 α, otherwise i /B 2 (35a) * y,i = 2 1/2 B 2-2 µ 1/2 2β Q 2, i /B 2 µ < 2e B 2 B 2-2. (35b) 1 1/2 B 2 e Q 2 1/2 β, otherwise i /B 2 Formulas (33)-(35) show how the optimal motion vector accuracies depend on the parameters α i, β i, γ i and Q. Since they are based on the quadratic model (31) for difference frame energy (derived in Section 2), they apply when the desired motion accuracies x,i, y,i are less than or equal to 1, i.e., for pixel or subpixel accurate motion vectors. The block texture measures α i, β i are large for high texture blocks and small for low textured blocks, as explained in Section 2. Since they appear in the denominators of (33)- (35), the formulas show that the motion vectors of blocks (or frames in Modes 1 and 2) with more texture must be encoded more accurately (i.e., with smaller 's) than those with less. Recall that the interframe noise µ corresponds to the per pixel energy of the difference frame in the ideal case where the motion vectors were encoded with infinite accuracy (i.e., the 's are zero). Hence, this noise energy cannot be reduced by more accurate block motion compensation. The interframe noise µ would be zero only when the motion-compensated prediction could be made identical to the current frame, which, as we mentioned earlier, is not possible in real scenes because of the camera noise, light changes, occlusions, nontranslational motion, encoding distortion D, etc.. Because µ appears in the numerators of (33)-(35), these formulas show that for larger values of µ, lower motion vector accuracies are needed, a fact that was previously observed by Girod [4]. For large values of Q, µ is replaced by Q 2 in the numerators. The formulas indicate that less accurate motion vectors are needed at higher levels of compression. 5. Experimental Results We implemented several video coders and present results of their performance on synthetic and real (gray-level) video sequences. There were two synthetic sequences: "low texture" and "high texture", which are moving low-frequency and high-frequency sinusoids, respectively. The real video sequences are the well-known "caltrain" (a scene with a moving train and a calendar 6 ) and "miss america" (a video conferencing scene). The frame resolutions were pixels for both synthetic sequences, for "caltrain", and for 6 The caltrain sequence is a version of Mobile & Calendar, although not the typical MPEG4 version. It is available by anonymous ftp to ipl.rpi.edu. 15

17 "miss america". The frame rates were 30 frames/second. Figure 5 shows a frame of each of these sequences. The results described here are a representative subset of those in [18] and are presented for different levels of distortion and different modes of operation. We used classical full-search block matching [24, p. 335] on each 8 8 block with the minimum absolute error criteria to compute the (approximately) ideal motion vectors with o =1/64 subpixel accuracy 7. The subpixel values were computed using the commonly used bilinear interpolation, although other more advanced interpolation filters, such as those in [9,10,15,27], could have been used instead. In each case, the first frame of the sequence was intracoded by a simple uniform scalar quantizer with level spacing Q and each of the following frames was predicted by their respective (encoded) previous frames. 5.1 Video Coder 1: Lossless Entropy Coding of Difference Frames, Scalar Quantization of Motion Vectors. In this first video coder, the difference frame pixels are simply losslessly encoded with a first-order entropy coder and the ideal motion vectors are uniform scalar quantized with the desired motion accuracies x,i, y,i as the level spacings. Hence, the total encoding rate is modeled by the top term of (32), with Q=1 (lossless coding) and H = log 2 2V m (recall (26) and (28)). The maximum value or velocity in the scalar quantizer is set to V m =32 pixels. We ran this coder first in Mode 1 (fixed, rather than adaptive, motion vector accuracy ) on 9 frames of "caltrain" with equal to each value in the set Γ = {1, 1/2, 1/4, 1/8, 1/16, 1/32, 1/64}. The solid line in Figure 6 shows the resulting (empirically measured) rates R (in bits per pixel) versus. The dashed line shows the (empirically measured) difference frame rate R D. The distance between the solid and dashed lines is the rate R M devoted to motion, which as expected, increases as decreases (recall Figure 2). Next we ran the coder in Mode 2 (frame-by-frame adaptation of ) on the same 9 frames of "caltrain", using (33) to estimate the best motion vector accuracy * for each frame and Method C of Section 2.4 to estimate the quadratic coefficients α i, β i, γ i in (31) for each frame. This is a good choice because it measures the prediction error energies using the same uniform scalar quantizer that is used for encoding, but other estimation methods could have been used instead. The estimated *'s for the nine frames ranged from to 0.116; i.e. they differed little. The "o" on the -axis of Fig. 6 plots their average. Since each was closer to 1/8 than to any other member of Γ, the performance of the coder with =1/8, marked with an " " in Fig. 6, provides an indication of how this coder performs with Mode 2 adaptation. Though need not change significantly over these frames, one cannot conclude that Mode 1 is as good as Mode 2. This is because when Mode 1 is used, as in typical blockbased video coders, is usually fixed at 1 or 1/2, as a compromise for many different types 7Any motion estimation technique that computes fractional-pel motion vectors could be used as well. For example, the low-complexity methods in [6,19] could be applied. The computational complexity of the full-search approach with different motion accuracies is addressed in [12]. 16

18 of scenes. As one can see from Fig. 6, this would result in significantly higher rate on these 9 frames (approximately 0.35 bits/pixel larger for =1/2). As a further example, in [18, p. 73,74] we ran the coder in Mode 2 on 9 frames of "SRI trees" (a camera panning across a wooded scene) and found that for each frame * 1/4. This is indicative of the need to adapt, and of how (33) can be used to choose. In summary, the results suggest there are benefits to adapting * to scenes. Frame-by-frame adaptation accomplishes this, but since it appears that the adaptation may need only operate on a scene-by-scene basis, there might be simpler methods for adapting than the frame-by-frame method that we used. One may also see from Fig. 6 that the loss due to using too large a value of is much greater than that due to using too small a value of, an observation that should be useful when using Mode 1. We also ran the coder in Modes 3 and 4 (block-by-block adaptation of x and y, with and without the constraint that x = y ). The "+" and "*" in Fig. 6 plot ( * avg,r) for Modes 3 and 4, respectively, where * avg is the average of the *'s found using (34) and (35), respectively, and R is the empirical rate of the coder. For practical reasons, each was rounded to the closest value in the set Γ. One can see that Mode 4 gained little over Mode 3, and neither gained substantially over Mode 2 (frame-by-frame adaptation). On the other hand, their use of larger 's, might reduce the average complexity of block matching. To gain a sense of the goodness of the predicted * 's, consider Figure 7, which is like Fig. 6, but for only three blocks of one frame of "caltrain", specially chosen so the parameters α i, β i, γ i were distinctly different for each block. In addition to showing the same lines and marks as Fig. 6, the " " plots the average accuracy and minimum rate obtained by a full search over all motion accuracy combinations for all three blocks, i.e. Mode 4, with ( x,i, y,i ) Γ Γ, for i=1,2,3. No other blocks were included due to the huge search time that would have been required. One can see that using our optimized accuracies produced encoding rates close to the minimum found by full search. Finally, we mention that a comparison in [18, p. 76] indicates that the rate formulas we use provide accurate predictions. 5.2 Video Coder 2: Lossy Entropy Coding of Difference Frames, DPCM with Entropy Coding of Motion Vectors. In the second video encoder, the difference frame is lossy encoded with a uniform scalar quantizer with level spacing Q, followed by a first-order entropy coder. The ideal motion vectors are encoded using conventional DPCM plus entropy coding, as described in Sec The desired block motion accuracies x,i, y,i determine the level spacings used in the DPCM quantizer. Here, the encoding rate is modeled by (32) with H = h (recall (27),(28)). Figure 8 is like Fig. 6, but for this lossy video coder and for several levels of distortion, as determined by values of Q. The solid lines show the total empirical rate R for encoding 4 frames of "caltrain" with different values of. (Since in the lossy case the optimal motion accuracies tend to be larger, we show the rate also for =2, and omit =1/64.) Q=1 operates this coder losslessly, and Q=25 corresponds to PSNR 31 db. As in Fig. 6, each "o" shows 17

19 the average of our predictions for the value * yielding minimum rate for each frame (from right to left, the "o"'s correspond to Q=1,5,15,25). The quadratic coefficients α i, β i, γ i were computed using Method D described in Sec. 2.3, although as in the previous coder, other estimation methods could have been used as well. The " ", "+" and " " have the same meanings as in Fig. 6. As with the previous coder, there was little variation in the best from frame to frame; Modes 3 and 4 performed similarly; neither gained significantly over Mode 2; and the performance was less sensitive to an overly small than to an overly large. Notice also that, as expected, the optimal 's increased as Q (i.e. distortion) increases. Due to the DPCM motion vector encoder, it was not feasible to produce a plot like Figure 7. Figure 9 plots "rate-distortion" curves for the different modes of this coder. The solid line is for Mode 1 with motion vector accuracy =1 (i.e., pixel accurate motion vectors), and the dashed line is for Mode 1 with =1/2 (i.e., half pixel accuracy). The " ", "+", and " " signs indicate the rate-distortion points for optimized Modes 2, 3, and 4, respectively. Figures 10 (a), (b), (c) are like Fig. 6, but for two frames of "low texture" (Q=25), "high texture" (Q=5), and "miss america" (Q=25), respectively. Fig. 10 (d) is the same as Fig. 8 ("caltrain"), but just for Q=5. These curves illustrate how the rate-accuracy curves change substantially depending on scene texture and compression level, suggesting that adaptation will have benefits. In particular, the optimized *'s decrease when the image texture increases and the compression level decreases. The effect of compression level on the optimal motion accuracies is also reflected in Fig. 11, showing histograms of the x,i values for optimized Mode 4 in "caltrain" when Q=1 (top) and Q=25 (bottom). 5.3 Video Coder 3: DCT and Variable-length Coding of Difference Frames, DPCM with Entropy Coding of Motion Vectors. Although our model for the difference frame rate in Sec. 3.2 was derived for a coder based on uniform scalar quantization plus entropy coding, from rate-distortion theory or high-resolution theory we note that the rates of more advanced coders are expected to be lower than ours by approximately a constant. Therefore, equations (33)-(35) can still be used to find the optimal motion vector accuracies, because our minimization procedure is not affected by a constant difference. To confirm this, we implemented another video coder that uses a block-by-block DCT for the difference frame pixels. Specifically, this third video coder is essentially the same as second, except that we replaced the difference frame coder with a JPEG encoder [25]. This is quite similar to MPEG [2] because the latter does JPEGlike coding of difference frames, and DPCM plus entropy coding of motion vectors. Figure 12 is also like Fig. 6, but for encoding 30 frames of "caltrain" with video coder 3 at a PSNR of approximately 33 db. To achieve this PSNR at each frame, we searched for the JPEG quality factor that produced the PSNR closest to 33 db. Accordingly, we used Q=20 in our formulas. Every fifth frame was intracoded by a JPEG encoder at the same PSNR, 18

20 but the rate of the intracoded frames is not included in the results. The plot suggests that our formulas also work well with this more advanced, MPEG-like video coder. Note that we have mainly discussed results on optimizing motion vector accuracies according to (33)-(35), which is the primary focus of this paper. However, results on other aspects of this work (e.g., the precision of our energy and rate models) can be found in [18]. 5.4 Discussion of Experimental Results From our analysis and empirical experiments we conclude the following: If the motion accuracy is high enough (i.e., small enough ), lossless and lossy video coding are relatively insensitive to increases in motion accuracy. This is because the curves of empirical rate versus in Figures 6, 7, 8, 10, 12 are fairly flat in the vicinity of their minima. Moreover, the performance is generally less sensitive to an overly small than to an overly large. The rate for optimized Mode 2 (which fixes the same accuracy for all motion vectors in a frame) is as low as that of optimized Modes 3 and 4. Hence, considering different accuracies for different motion vectors did not usually help much. Only for "miss america" in Figure 10 (c) did the adaptive modes achieve significant rate savings over the best nonadaptive case. Nevertheless, block-based accuracy adaptation may have other benefits. For example, in a related work [12], we found that using block-adaptive motion accuracy produces significant computational savings when finding pixel and subpixel accurate motion vectors with typical block matching. Optimized Modes 2, 3, and 4 are consistently superior to Mode 1 with =1 and =1/2 (the typical motion vector accuracy choices of most video coding schemes). The optimized modes give significant savings in real scenes when Q takes values 1, 5, and 15. For example, when Q=1, the optimized modes saved up to 0.8 bits per pixel with respect to =1, as shown in Figures 6 and 8. Optimized Modes 3 and 4 also saved significant rate for Q=25 in Fig. 9 (up to about 17 percent of savings respect to =1/2). The rate savings of the optimized modes are expected to further improve in scenes with higher texture content. For instance, notice that for "high texture" in Figure 10 (b) these modes save up to about 0.8 bits per pixel or 35 percent with respect to Mode 1 with =1/2. The results also indicated that within a scene, there was little need to adapt, suggesting that the adaptation can proceed slowly, e.g. over a number of frames, which could, perhaps, be performed in a simpler fashion that that which we have used. Optimized Mode 3 is slightly superior to optimized Mode 4 for large Q, probably because the two-dimensional least-squares fit to the quadratic coefficients α i, β i, γ i for Mode 4 is more sensitive to distortion than the one-dimensional least-squares fit for Mode 3. 19

21 In general, the experimental results obtained with video coders suggest that our predictions and assumptions are quite accurate. For example, (33) predicts fairly well the value * where the minimum of the empirical rate for Mode 2 occurs, as seen in Figures 6, 7, 8, 10, and 12. The exact value of that minimizes each of the underlying continuous curves is not known, but when we round each of our * 's to the nearest value in the set of interest Γ, the rate that we obtain is either the minimum or very close to the minimum. Additionally, in the special cases where we made a test, our optimized motion vector accuracies achieved an empirical rate that was very close to that achieved by a full search on all possible accuracy combinations (recall Figure 7). On the other hand, for video coder 2 with Q=25, Fig. 8 shows that the predicted * 's were close to 1/2, whereas the coder would have had smaller rate if they were closer to 1/4. This is probably because our assumptions and models are less accurate for very large distortions. Our formulas (33)-(35) show that high motion vector accuracies are required in scenes with high texture, and lower accuracies are needed in scenes where the prediction is corrupted by camera noise, occlusions, or other phenomena, or if the level of compression is high. This is confirmed by our empirical experiments, particularly those in Fig Summary and Concluding Remarks The central theme of this paper is the development of a theoretical framework for finding the best motion accuracies in block-based video coding. Previous research provided interesting insights on this topic, but did not provide concrete formulas for the optimal motion vector accuracies that minimized rate in this type of coding. In this work, we presented effective difference frame and motion vector rate models that are tractable enough that analytical formulas for the optimal motion vector accuracies can be derived. To do this, we studied the effect of motion vector accuracy on the energy of a block in the difference frame and concluded that such energy is approximately quadratic in the block motion vector errors and accuracies. Adding the block energies, we obtained an expression for the difference frame energy in terms of the blocks' motion accuracies. We then derived formulas for the difference frame rate (which used the energy-accuracy expression) and the motion vector rate, in terms of the motion accuracies. Minimizing these formulas, we obtained concrete, analytical expressions for the optimal motion accuracies for different modes of operations (sequence, frame, or block adaptive). We implemented three block-based video coders and obtained experimental results on a variety of video sequences and distortion levels. The results suggest that our equations and assumptions are fairly accurate; in general, there is not much benefit to adapting motion accuracy on a block-by-block basis; the optimal motion vector accuracies depend on the texture and interframe noise of the particular scene (and they can be outside the 1/2-1/4 pixel interval [4]), and that optimizing motion vector accuracies with our procedures can provide significant bit rate savings over the typical pixel and half-pixel accuracy 20

22 choices of current video coders. Specifically, in our tests, most savings happened for high textured scenes at low compression levels, and were about percent. For some time, it has been known that using several block sizes in video coding provides some coding benefits (e.g., H263 [1] can use 16x16 or 8x8 blocks), but most video codecs still use the same, fixed motion accuracy. The results of this paper motivated the study and experiments in [27], in which multiple video sequences were coded (with an H.26L codec) at a variety of bit rates with several motion accuracies. This study also confirmed that some accuracies are significantly better than others, according to the specific sequence content and distortion level. As a result, H.26L has recently incorporated several motion accuracies in the test model TML-2 [28]. Our framework for optimizing motion accuracy can be potentially applied to quadtree or object-based video coders and can be extended to optimize other coding parameters such as block size [11]. Also, our rate equations can potentially be used to predict the bit rate and the potential coding gains when using different accuracies (c.f, [18, p. 76]). Finally, our models can be applied to other topics in video coding such as rate control [23] or motion estimation [19]. References [1] ITU-T, ``Video coding for low bitrate communication,'' ITU-T Recommendation H.263; version Nov. 1995; version 21, Jan [2] D. Le Gall, ``MPEG: A video compression standard for multimedia applications,'' Commun. ACM, Vol. 34, pp , Apr. 91. [3] Video Group, ``Text of ISO/IEC MPEG4 video VM,'' ISO/IEC JTC1/SC29/ WG11 Coding of Moving Pictures and Assoc. Audio, MPEG 97/W1796, Stockholm, July 97. [4] B. Girod, ``The efficiency of motion-compensating prediction for hybrid coding of video sequences,'' IEEE J. Sel. Areas Commun., Vol. 5, pp , Aug. 87. [5] F. Moscheni, F. Dufaux and H. Nicolas, ``Entropy criterion for optimal bit allocation between motion and prediction error information,'' Proc. SPIE VCIP, pp , Cambridge, Nov. 93. [6] S. Gupta and A. Gersho, ``On fractional pixel motion estimation,'' Proc. SPIE VCIP, pp , Cambridge, Nov. 93. [7] L. Vandendorpe, L. Cuvelier and B. Maison, ``Statistical properties of prediction error images in motion compensated interlaced image coding'', Proc. IEEE ICIP, Vol. 3, pp , Washington, D.C., Oct. 95. [8] H. Ito and N. Farvardin, ``On motion compensation of wavelet coefficients,'' Proc. IEEE ICASSP, Vol. 4, pp , Detroit, May 95. [9] U. Benzler, ``Results of core experiment P8: motion an aliasing compensating prediction,'' ISO/IEC JTC1/SC29/WG11 Coding of Moving Pictures and Associated Audio, MPEG 96/1512, Maceio, Nov. 96. [10] R. Buschmann, ``Efficiency of displacement estimation techniques,'' Signal Processing: Image Communication, Vol. 10, pp ,

23 [11] J. Ribas-Corbera and D. L. Neuhoff, ``Optimizing block size in motion-compensated video coding,'' Journal of Electronic Imaging, Vol. 7, pp , Jan. 98. [12] J. Ribas-Corbera and D.L. Neuhoff, ``Reducing rate/complexity in video coding by motion estimation with block adaptive accuracy,'' Proc. SPIE VCIP, pp , Orlando, Mar. 96. [13] B. Girod, ``Why B-pictures work: a theory of multi-hypothesis motion-compensated prediction,'' Proc. IEEE ICIP, Vol. II, pp , Chicago, Oct. 98. [14] M. Hötter, ``Optimization and efficiency of an object-oriented analysis-synthesis coder,'' IEEE Trans. Circuits and Systems for Video Technology, Vol. 4, pp , Apr. 94. [15] K. Illgner and F. Müller, ``Analytical analysis of subpel motion compensation,'' Proc. Picture Coding Symposium, pp , Berlin, [16] J.S. Lim, Two-Dimensional Signal and Image Processing. Prentice Hall Signal Processing Series, [17] J. Ribas-Corbera and D.L. Neuhoff, ``On the optimal motion vector accuracy for blockbased motion-compensated video coders,'' Proc. IST/SPIE Dig. Video Compr.: Alg. and Tech., San Jose, Feb. 96. [18] J. Ribas-Corbera, ``Optimizing the motion vector accuracies in block-based video coding,'' Ph.D. Thesis, University of Michigan, Ann Arbor, Sept [19] X. Li and C. Gonzales, ``A locally quadratic model for the motion estimation error criterion function and its application to subpixel interpolations,'' IEEE Trans. Circuits and Systems for Video Technology, Vol. 6, pp , Feb. 96. [20] P. Guillotel and C. Chevance, ``Comparison of motion vector coding techniques,'' Proc. SPIE Image and Video Compression, pp , Feb. 94. [21] T.M. Cover and J.M Thomas, Elements of Information Theory. John Wiley & Sons, Inc., 91. [22] A. Gersho and R.M. Gray, Vector Quantization and Signal Compression. Kluwer Academic Publishers, 92. [23] K. Sharifi and A. Leon-Garcia, ``Estimation of shape parameter for generalized gaussian distributions in subband decompositions of video,'' IEEE Trans. Circuits and Systems for Video Technology, Vol. 5, pp , Feb. 95. [24] A. Netravali and B. Haskell, Digital pictures. Representation and compression, Plenum, [25] G.K. Wallace, ``The JPEG still picture compression standard,'' Commun. ACM, Vol. 34, pp , Apr. 91. [26] J. Ribas-Corbera and S. Lei, ``Rate control in DCT video coding for low-delay communications,'' IEEE Trans. Circuits and Systems for Video Technology, Vol. 9, pp , Feb. 99. [27] J. Shen and J. Ribas-Corbera, More experiments and low-complexity AMA for H.26L, ITU-T SG16/Q15, doc. Q15-I-38, Red Bank, Oct. 99. [28] Test model TML-2 (G. Bjontegaard, ed.), ITU-T SG16/Q15, doc. Q15-I-36, Red Bank, Oct

24 CURRENT FRAME PREDICTION DIFFERENCE FRAME DIFFERENCE FRAME CODER R D (bits/pixel) MOTION ESTIMATION Previous Frame PREVIOUS DECODED FRAME Current Frame MOTION COMPENSATION V1 Image Blocks Motion Vectors B V i B MOTION VECTOR CODER R= R D R M R M (bits/pixel) Figure 1: A typical block-based video coder. 23

25 R typical optimal R R M R D R* R M R D 1 1/2 * motion vector accuracy Figure 2: The typical eect of motion vector accuracy (e.g., = 1=2 corresponds to half pixel accuracy) on encoding bit rate R, when distortion is xed. More accurate motion vectors (smaller ) result in higher motion rate R M and lower dierence frame rate R D. The motion accuracy that minimizes R is denoted. CURRENT FRAME CURRENT BLOCK F(x) PREVIOUS FRAME IDEAL PREDICTION PREDICTION ^* ^ F F (x) (x) u = v* v ideal motion vector computed motion vector motion vector accuracy u = v - v* = motion vector error Figure 3: Illustration of a block's motion vector error u and accuracy, for the one-dimensional case. The location of the prediction for the block is shifted by u units from the ideal prediction, because of the limited accuracy of the computed motion vector. 24

6 5 Q=1 4 ENTROPY 3 Q=3 2 1 Q=15 0 0 1 2 3 4 5 6 7 8 9

are the entropy of a Q-quantized Laplacian distribution

The dashed lines are the respective approximations

$(a) \Low Texture" (b) \High Texture" (c) \Miss America"$

26 6 5 Q=1 4 ENTROPY 3 Q=3 2 1 Q= SIGMA Figure 4: From top to bottom, the solid lines are the entropy of a Q-quantized Laplacian distribution with respect to standard deviation, when Q =1; 3 and 15. The dashed lines are the respective approximations obtained using (30). (a) \Low Texture" (b) \High Texture" (c) \Miss America" (d) \Caltrain" Figure 5: A frame from each of the video sequences used in the experiments. 25

On the optimal block size for block-based, motion-compensated video coders. Sharp Labs of America, 5750 NW Pacic Rim Blvd, David L.

On the optimal block size for block-based, motion-compensated video coders Jordi Ribas-Corbera Sharp Labs of America, 5750 NW Pacic Rim Blvd, Camas, WA 98607, USA. E-mail: jordi@sharplabs.com David L.