Motion Vector Prediction With Reference Frame Consideration

Size: px

Start display at page:

Download "Motion Vector Prediction With Reference Frame Consideration"

Mervin Cook
5 years ago
Views:

1 Motion Vector Prediction With Reference Frame Consideration Alexis M. Tourapis *a, Feng Wu b, Shipeng Li b a Thomson Corporate Research, 2 Independence Way, Princeton, NJ, USA 855 b Microsoft Research Asia, 3F Sigma Center, 49 Zhichun Road, Beijing, 18, China ABSTRACT In this paper, we introduce a new motion vector prediction method that could be used within multiple picture reference codecs, such as the H.264 (MPEG-4 AVC) video coding standard. Our method considers for each candidate motion vector the temporal distance of its corresponding reference picture compared to the current one for the generation of the predictor motion vector. This allows for more accurate motion vector prediction, and better exploitation of the temporal correlation that may exist within a video sequence. Furthermore, we also introduce a modification to the SKIP motion vector macroblock mode, according to which not only the motion vectors but also the reference indices are adaptively generated. Simulation results suggest that our proposed methods, combined with an improved Rate Distortion optimization strategy, if implemented within the existing H.264 codec, can allow for a considerable performance improvement of up to 8.6% bitrate reduction compared to the current H.264 standard. Keywords: H.264, Motion Vector Prediction, reference pictures/frames, skip mode, spatial prediction, temporal prediction 1. INTRODUCTION The new upcoming H.264 standard (or MPEG-4 AVC, JVT,and H.26L) has managed to achieve considerably higher coding efficiency compared to older standards like MPEG-2 [2] and MPEG-4 [3], partly due to the adoption of more refined and complicated motion models and modes. These include increased sub-pixel accuracy down to a quarter (¼) pixel level, multiple-referencing, the introduction of the tree structured macroblock [2] based on a quad-tree (Figure 1) concept according to which different sub areas of a Macroblock (MB) can be assigned to different motion information, and Multiple-frame/picture indexing of the motion vectors (MVs). A macroblock can essentially have up to 16 MVs since the tree macroblock structure enables the macroblock to be coded in 4 different modes and partitions of shape sizes equal to 16 16, 16 8, 8 16, and 8 8, while when in the 8 8 partition mode, each 8 8 partition can be further split into 8 8, 8 4, 4 8, and 4 4 blocks (Figure 1). Considering the high cost of transmitting motion parameters, motion vectors are coded differentially versus a motion vector predictor (MVP or MV pred ). This predictor is calculated by taking in consideration that adjacent blocks/macroblocks and their motion tend to have very high spatial correlation and by using the median value of the MVs of the three adjacent, on the left, top, and top-right (or top-left if top-right is not available) blocks to the current block. This MV pred is basically calculated as: where replaced by ( MV A, MV B MV C ) MV pred Median, = (1) MV A, MV B, and MV C are the three adjacent predictors as shown in Figure 2. In this computation, MV C is MV D if it is not available (edge of a slice or picture). * alexismt@ieee.org; phone +1-(69) ; fax +1-(69)

2 1 macroblock partition of 16*16 luma samples and 2 macroblock partitions of 16*8 luma samples and 2 macroblock partitions of 8*16 luma samples and 4 sub-macroblocks of 8*8 luma samples and Macroblock partitions sub-macroblock partition of 8*8 luma samples and 2 sub-macroblock partitions of 8*4 luma samples and 2 sub-macroblock partitions of 4*8 luma samples and 4 sub-macroblock partitions of 4*4 luma samples and Sub-macroblock partitions Figure 1: Macroblock and sub-macroblock partitions as defined in H.264. Nevertheless, considering that H.264 allows the usage of multiple reference frames (or pictures), which are usually pictures at different time instants and are very likely spatially unrelated, to improve performance of this prediction an additional consideration is made based on the reference pictures of each spatial predictor and that of the current block. Instead of always using the median prediction, if only a single predictor uses the same reference index as the current block, then only this predictor is used for prediction, while all others are immediately discarded. This method, which we will call the single equal reference condition, can strengthen somewhat the correlation between adjacent motion vectors and their references, but nevertheless does not account for the case where two or even none of the reference indices are the same as the current block's. In these cases the median method is still used, which could reduce prediction efficiency. D B C A Current Macroblock Figure 2: Spatial predictors used for Motion Vector Prediction and the generation of the SKIP mode MV parameters Furthermore, the H.264 standard benefits considerably from the adoption of the MotionCopy SKIP macroblock mode which is used within Predictive (P) pictures. This mode is nevertheless strongly related to the efficacy of the motion vector prediction. More precisely, this mode essentially signals that the motion vectors and reference index of an entire MB can be completely derived by the location of the MB within a slice or picture, and from the motion information of its spatial neighbors. To be more exact, if the MB is not on the edge columns or rows of a picture or slice, and both it's top and left spatial predictors are not zero or do not use the zero reference index, then the zero reference and the MVP are used as the actual motion information of this MB. Otherwise the motion vectors for this MB are set equal to zero. This process is called zero partitioning of the SKIP mode. Apparently if the generation of the MVP is not accurate enough, the efficacy also of SKIP mode will also be affected. The zero reference is also always used for SKIP mode. In this paper we present an alternative method for generating the MVP through the consideration of the temporal distances of the reference indices and by accordingly scaling the motion vectors. An additional, very simple process of selecting the reference index, instead of always using zero, within SKIP mode is also introduced. Our methods can lead

3 to further improvement in the motion vector prediction process within the H.264 standard, and thus better coding efficiency. In Section 2 we will first introduce the details of our proposed modifications, while experimental results will then be given in Section 3, followed by our conclusions in Section REFERENCE CONSIDERATION WITHIN THE MOTION VECTOR PREDICTION AND SKIP As we have previously discussed, motion vector prediction within the H.264 standard basically considers the spatial correlation that may exist between adjacent blocks or macroblocks, in an effort to reduce the cost of the motion vectors even further. On the other hand, temporal correlation could also be of some use and could allow us for further benefits. In particular, the temporal direct mode, currently used within Bi-predictive (B) pictures, assumes that there exists a temporal relationship between co-located blocks and allows the prediction of the motion vectors for this mode using simple scaling operations. More specifically, following the assumption that an object is moving with constant speed the motion vectors of a co-located block are scaled according to the temporal distances (Figure 3) of the reference pictures involved to generate two new motion vectors MV L and MV L1 that will be used for the prediction. These motion vectors are calculated as follows compared to the co-located block's motion vector MV : DistScaleF actor = ( TD B 256) / TD D (2) MV L = ( DistScaleFactor MV + 128) >> 8 (3) which are approximations of: MV L1 = MV L MV, (4) TRB MV L = MV (5) TR D ( TRB TRD ) MV L 1 = MV, (6) TR D but can essentially reduce the number of divisions since the variable DistScaleFactor can be precomputed at the Slice/Picture level. In the above TD B and TD D are the temporal distances of the reference pictures used for the prediction compared to the current picture.

4 List Reference Current B List 1 Reference... current block MV... co-located block MV L MVL1 TD D TD B Time Figure 3: Temporal Direct Prediction in B picture coding We observe that a similar scaling approach could be beneficial within the motion vector prediction process as well. As we have previously discussed, the reference indices of the adjacent neighbors are considered only under certain conditions, and are indirectly used as a decision mechanism on whether the median prediction will be used or not. It is, nevertheless, possible to use these reference indices, and more specifically the temporal distances of these references compared to the current picture, with a more direct impact on the motion vector prediction. Similar to temporal direct, we propose scaling the MVs from each predictor according to these temporal distances. More specifically, the MVP is now calculated as MV A MV B MV C MV pred = TD ref Median,,, (7) TDA TDB TDC where MV A, MV B, and MV C are the three predictor MVs, TD A, TD B, and TD C are their corresponding temporal distances, and TD ref is the temporal distance of the current reference. Division can also be very easily replaced with binary shifts without any loss in efficiency, and more specifically with the following equations: Z = ( TD 256) / TD (8) A ref A Z = ( TD 256) / TD (9) B ref B Z = ( TD 256) / TD (1) C MV ref C ( Z A MV A + 128) >> 8, ( Z B MV B + 128) >> 8, ( ZC MV + 128) >> 8) pred = Median C. (11) Z A, Z B, and Z C, can be pre-calculated at the picture/slice level, and thus the increase in complexity compared to the original method without the divisions/scaling is very minor. Obviously the basis of this concept is very similar to that of Temporal Direct, since we are again assuming that adjacent pictures may follow the constant speed rule (Figure 4).

5 Ref 1 Ref Current Predictor block B MV = TD P, MV B TD1 MV = MV TD1 P, 1 B TD1 current block MV B MV P, MV P,1 TD 1 TD 2 Time Figure 4: Predictors are generated according to their corresponding reference picture distances compared to the current reference picture Although it could be argued that by introducing the above scaling method could allow us to remove the single equal reference condition and simplify the prediction process, we have found from experimental results that this rule is still advantageous, especially considering that that the reference pictures involved may not always be temporally correlated. In this sense this condition can be advantageous, since only the most related picture (which in this case is the same as the current reference) is considered, while all other pictures would instead introduce motion noise and could hurt the prediction. Reference indices could also be used within SKIP mode as well. It is well known that the SKIP macroblock mode is probably the most efficient mode within Predictive (P) pictures in H.264. As previously discussed, this mode does not require the transmission of any residual data, and tries to further exploit spatial correlation between motion of adjacent MBs by signaling, under certain conditions, that the current MB has a MV equal to the MVP of the Macroblock type or zero. A major drawback though of this mode is that SKIP always considers the zero reference, and does not consider the case that it is possible that by using other references might be more beneficial. As an example, this mode does not consider the case where all neighbor predictors are not equal to zero reference. In this case the median prediction would still be used without making any other considerations. Although the scaling process discussed previously can indirectly improve the performance of the SKIP mode (in the previous example, the predictors would all be scaled towards the zero reference), a different method can also be used that can improve the prediction. In particular, similar to the spatial direct mode also used in B pictures, we observe that we may also perform a reference index prediction according to the reference indices of the adjacent macroblocks that are already used within the MVP process. Instead of always using the zero reference, the smallest non-negative reference (which usually implies the closest in terms of time) from the three adjacent predictors is selected and used for the prediction process. If no such reference is available, then zero is used by default (e.g. beginning of a slice or all adjacent blocks are intra coded). This method obviously enhances the relationship between adjacent pictures and the current one, while also could be rather useful especially when considering that the H.264 standard allows reordering of the references which may imply that the zero reference may not itself have the highest correlation with the current reference.

6 To summarize, the pseudocode for this scheme is as follows: SKIP_MV_Calculation() // Note that UpRight will be replaced by UpLeft at picture boundaries // if a reference is not available then its value is equal to -1. Skip_Reference=min(reference_fw_Left&255,reference_fw_Up&255, referenceb_fw_upright&255); if (Skip_reference!=255) if ((Skip_reference == ) && (reference_colocated == ) && ((abs(mvpx)>>1) == ) && ((abs(mvpy)>>1) == ) ) Skip_MV = ; Reference_Skip = ; else Skip_MV =SpatialPredictor(16x16,FW, Skip_reference); Reference_Skip=Skip_reference; else Skip_MV = ; Reference_Skip = ; Apart from these two semantic modifications to the H.264 codec, we introduce an additional modification within the mode decision of H.264 to further enhance performance. H.264 is based on a Rate Distortion Optimization (RDO) model using lagrangian (λ) parameters considering that these methods lead to considerably higher performance than other simpler, rate or distortion only methods. Mode decision is instead performed in H.264 by minimizing the equation: J mode ( mode mode s, c, MODE λ ) = SSD( s, c, MODE ) + λ R( s, c, MODE ) (12) where SSD denotes the Sum of Square Difference between the original and reconstructed signals, MODE indicates a mode out of a set of potential macroblock modes and more specifically SKIP, 16 16, 16 8, 8 16, Tree8 8, Intra4 4, Intra16 16, λ mode is the Lagrangian multiplier and is quantizer dependent and R(s,c,MODE) is the number of bits associated with choosing MODE, including the bits for the macroblock header, the motion and all DCT coefficients. As we have previously pointed out, SKIP mode itself can be considered as a special case of the mode but for which no motion and DCT coefficients need to be transmitted, which is basically a coefficient thresholding concept. Obviously, thresholding may be used for all other macroblock modes as well. In our case, we consider, in addition to the original modes, all INTER modes without coefficients (forced Coded Block Pattern equal to ) within the mode decision as well. This now means that we will have to examine 4 additional modes instead of 7. More specifically we will not have to examine modes SKIP, 16 16, 16 16nocoeff, 16 8, 16 8nocoeff, 8 16, 8 16nocoeff, Tree8 8, Tree8 8nocoeff, Intra4 4, Intra16 16.

7 3. SIMULATION RESULTS All of the above concepts were introduced within version 4.3a of the H.264 reference software [7]. For our simulations we have selected 5 sequences, namely QCIF resolution sequences Container and News coded at 1fps, and CIF sequences Mobile, Bus, and Stefan at 3fps. The CAVLC entropy coder was used for all our tests, with quantizer values of 28, 32, 36, and 4, a search range of ±32, and 5 references. Rate Distortion Optimization was enabled in our simulations. To simplify our comparisons we have used average PSNR gain (dpsnr) and bitrate reduction (dbitrate) results, based on the above quantizers, as is also recommended by [8]. This method was also the required comparison method to all proponents to the H.264 standard, since it allows for a quantitative RD performance estimate of a proposed algorithm. We observe that our proposed methods lead to a bitrate reduction of -1.43%, -1.41%, -8.65%, -3.5%, and -4.11% for sequences container, news, mobile, bus, and stefan respectively (Table 1). This equivalently corresponds for each respective sequence to a gain of.75db,.85db,.413db,.168db, and.214db. The Rate Distortion curves for sequences container and mobile are also shown in Figures 5 and 6 respectively. We particularly observe the considerable improvement on the 3 CIF sequences and more specifically in sequence mobile. The result on this sequence is somewhat expected considering that this sequence is well known to receive a considerable improvement benefit from the use of multiple references, and is characterized by relatively smooth and constant motion. Considering also the RD curves, we further observe that the gains are more prominent at higher bitrates, which is to be expected considering that the RDO mode decision tends to be more biased towards lower bitrate when the quantization parameters increase, resulting also to fewer non-zero reference indices. It is of course obvious that our modifications in the motion vector prediction process and the reference index prediction used within SKIP have no impact if a single reference is used. 4. CONCLUSION In this paper, two semantic changes were proposed for usage within the H.264 standard, or other multiple reference codecs, and could improve performance if multiple references are used. More specifically we have introduced an alternative motion vector prediction method that considers the reference indices and the associated temporal distances of the spatial neighbors within the motion vector prediction process, while a reference picture selection process is proposed to be used for the generation of the SKIP macroblock mode parameters. These methods allow for more accurate motion vector prediction, and better exploitation of temporal correlation within a multiple reference motion compensated framework. Combined with a minor modification in the Rate Distortion Optimized Mode Decision of the H.264 codec, our simulation results show that we can achieve considerable improvement compared to the existing H.264 standard. REFERENCES 1. Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, "Joint Video Specification (ITU-T Rec. H.264 ISO/IEC AVC) - Joint Committee Draft", document JVT-E22d3.doc, Sep'2. 2. ISO/IEC Standard :2. Information technology generic coding of moving pictures and associated audio information: Video. 3. ISO/IEC Standard :21. Information technology Coding of audio-visual objects Part 2: Visual 4. Heiko Schwarz and Thomas Wiegand, Tree-structured macroblock partition, document VCEG -O17, 15th VCEG meeting, Pattaya, Dec Hideaki Kimata, GMVC and GMC Switched by MV, document JVT-B46, 2nd JVT meeting, Geneva, Jan Shijun Sun and Shawmin Lei, " Global Motion Vector Coding (GMVC)," document JVT-B19, 2nd JVT Meeting, Geneva, Jan Jani Lainema and Marta Karczewicz, Skip mode motion compensation, document JVT-C27, 3rd JVT Meeting, Fairfax, May A. M. Tourapis, H. Y. Cheong, M. L. Liou, and O. C. Au, "Temporal Interpolation of Video Sequences Using Zonal Based Algorithms," in proceedings of the 21 IEEE International Conference on Image Processing (ICIP'1), WP8-5252, Thessaloniki, Greece, October 21.

8 9. JVT Reference Software unofficial version 4.3a, 1. G. Bjontegaard, Calculation of average PSNR differences between RD-Curves, document VCEG-M33, 13th VCEG meeting, Austin TX, Mar 1 Table 1: Performance Evaluation of the Proposed Scheme Sequences Container News Mobile Bus Stefan δbitrate % δpsnr Container QCIF 1kbps PSNR => db Original Proposed Bitrate => kbps Figure 5: RD performance for sequence Container at 1fps 35 Mobile CIF 3kbps PSNR => db Original Proposed Bitrate => kbps Figure 6: RD performance for sequence Mobile at 3fps

h 8x8 chroma a b c d Boundary filtering: 16x16 luma H.264 / MPEG-4 Part 10 : Intra Prediction H.264 / MPEG-4 Part 10 White Paper Reconstruction Filter

h 8x8 chroma a b c d Boundary filtering: 16x16 luma H.264 / MPEG-4 Part 10 : Intra Prediction H.264 / MPEG-4 Part 10 White Paper Reconstruction Filter H.264 / MPEG-4 Part 10 White Paper Reconstruction Filter 1. Introduction The Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG are finalising a new standard for the coding (compression) of natural