VIDEO CODING USING A SELF-ADAPTIVE REDUNDANT DICTIONARY CONSISTING OF SPATIAL AND TEMPORAL PREDICTION CANDIDATES. Author 1 and Author 2

VIDEO CODING USING A SELF-ADAPTIVE REDUNDANT DICTIONARY CONSISTING OF SPATIAL AND TEMPORAL PREDICTION CANDIDATES Author 1 and Author 2 Address - Line 1 Address - Line 2 Address - Line 3 ABSTRACT All standard video coders are based on the prediction plus transform representation of an image block, which predicts the current block using various intra- and inter-prediction modes and then represents the prediction error using a fixed orthonormal transform. We propose to directly represent a mean-removed block using a redundant dictionary consisting of all possible inter-prediction candidates with integer motion vectors (mean-removed) and basis vectors of an orthogonal basis (e.g. DCT). We determine the coefficients by minimizing the L1 norm of the coefficients subject to a constraint on the approximation error. We show that using such a self-adaptive dictionary can lead to a very sparse representation, with significantly fewer non-zero coefficients than using the DCT transform on the prediction error. We further propose to orthonormalize the chosen atoms using a modified Gram-Schmidt process, and quantizes the coefficients associated with the resulting orthonormalized basis vectors. Each image block is represented by its mean, which is predictively coded, the indices of the chosen atoms, and the quantized coefficients. Each variable is coded based on its unconditional distribution. Simulation results show that the proposed coder can achieve significant gain over the H.264 coder (x264). 1. INTRODUCTION Recent progress in sparse representation has shown that signal representation using a redundant dictionary can be more efficient than using an orthonormal transform, because the redundant dictionary can be designed so that a typical signal can be approximated well by a sparse set of dictionary atoms [1]. Instead of using a fixed, learned dictionary based on training image blocks, we propose to represent each image block in a video frame using a self-adaptive dictionary consisting of all possible spatial and temporal prediction candidate blocks following a preset prediction rule. For example, it may include all inter-prediction candidates, which are shifted blocks of the same size in the previous frame within a defined search range, and all possible intra-prediction candidates, which are obtained with various intra-prediction modes in the H.264/HEVC encoder. The rationale for using such prediction candidates as the dictionary atoms is that the current block is likely to be very similar to a few of these candidates and hence only a few candidates may be needed to represent the current block accurately. To anticipate the event that some blocks cannot be represented efficiently by the prediction candidates, we also incorporate some pre-designed fixed dictionary atoms in the redundant dictionary. Essentially these fixed atoms are used to describe the residual prediction error by the chosen prediction candidates.they also serve to mitigate the accumulation of reconstruction errors in previously decoded frames. Currently, we simply use the DCT basis vectors for the fixed part, considering the fact that the current video coders all use DCT (or DCT-like) basis to specify the prediction error. Optimal design of this fixed part is subject to further study. We determine the sparse set of the dictionary atoms and the coefficients associated with them by minimizing the L1 norm of the coefficients subject to a constraint on the approximation error. In all currently prevalent block-based video coding standards [2], a single best prediction candidate is chosen among all prediction candidates to predict the current block, and then the prediction error block is represented with a fixed orthogonal transform (e.g. the Discrete Cosine Transform or DCT). This method essentially represents the current block by a slightly redundant dictionary consisting of a fixed set of dictionary atoms that are basis elements of the orthonormal transform plus the best matching candidate. Furthermore, the coefficient corresponding to the best matching candidate is constrained to be 1. When a fractional pel MV is used or multiple reference frames are used, instead of a single best prediction candidate, it uses a linear combination of a few candidates with some preset constraints on the possible combination of candidates and their weights. It is natural to wonder, if we do not force such constraints, would we be able to present the prediction error with fewer DCT basis vectors? The proposed representation allows any weighted combination of the prediction candidates, and hence covers the above prediction+transform approach as a special case. We have found that using the proposed self-adaptive dictionary can lead to a very sparse representation, with significantly fewer non-zero coefficients, than using the DCT on the error between the original block and the best prediction candidate. Several research groups have attempted using redundant dictionaries for block-based image and video coding, including [3 7]. In all reported dictionary-based video coders, the dictionary atoms are used to represent the motion-compensation error block for interframe video coding. Therefore, they are very different from what is proposed here. Instead of using a single dictionary, [6] uses multiple dictionaries, pre-designed for different residual energy levels. The work in [7] codes each frame in the intra-mode, with a dictionary that is updated in real time based on the previously coded frames. Although such online adaptation can yield a dictionary that matches with the video content very well, it is computationally very demanding. The proposed framework uses a self-adaptive dictionary that only depends on the block location, without requiring realtime design/redesign of the dictionary. A major challenge in applying sparse representation for compression is that the dictionary atoms are generally not orthogonal. Quantizing the coefficients associated with them directly and independently are not efficient. First of all, the quantization errors of the coefficients are related to the errors in the reconstructed samples in a

complicated way. Secondly, these coefficients are likely to have high correlation. To the best of our knowledge, none of the dictionarybased video coders have produced compression performance that is better than the H.264 and HEVC standards. We believe that one reason that these coders have not been more successful is because they quantize the sparse coefficients associated with the chosen atoms directly. We propose to represent the subspace spanned by the chosen atoms by a set of orthonormal vectors. The coefficients corresponding to these orthonormal vectors will be much less correlated and can be quantized and coded independently without losing coding efficiency. We find the orthonormal vectors and their corresponding quantized coefficients jointly through a modified Gram-Schmidt orthogonalization process with embedded quantization. The encoder only specifies which atoms are chosen (which is a subset of the originally chosen atoms) and the quantized coefficients corresponding to the orthonormal vectors. The decoder can perform the same orthonormalization process on the chosen atoms to derive the orthonormal vectors used at the encoder. We note that this method for orthonormalizing the original dictionary atoms and performing quantization and coding in the orthonormalized subspace representation is applicable to any dictionary-based coding method. In the remainder of this paper, we describe the specific algorithms used for different parts of the proposed coder in Sec. 2-4, and show the simulation results in Sec. 5. We conclude the paper in Sec. 6. 2. SPARSE REPRESENTATION USING SPATIAL-TEMPORAL PREDICTION CANDIDATES Instead of representing the original block using the prediction candidates directly, we perform mean subtraction on the original block, and perform mean subtraction and normalization on the candidates. We use N to denote the total number of atoms, which includes all prediction candidates and a predesigned set of atoms (consisting of all 2D DCT basis vectors except the all constant one in our current implementation). We denote the mean-removed block by F and the dictionary atoms by A n, n = 1, 2,..., N.. Note that F and A n are vector representations of 2D blocks, each of dimension M, where M is the number of pixels in a block. Generally, M < N, so that all the atoms form a redundant dictionary. To derive the sparse representation for F using A n with coefficients w n, we solve the following constrained optimization problem: min w n subject to n 1 M 2 w n A n F ɛ 2 1 (1) Note that this is a classical sparse coding LASSO problem, and there are various methods to solve the problem. We use the least angle regression method (LARS) [8], using the MATLAB code provided at [9]. This algorithm is chosen because of its fast convergence and the fact it uses a constrained formulation directly, so that we can control the target representation error ɛ 1 directly. Note that the final reconstruction error is the sum of the sparse representation error plus the error due to quantization of the coefficients, assuming the two types of error are independent. Therefore, the error ɛ 1 should be proportionally smaller than the target final reconstruction error. Given a target reconstruction error, how to optimally allocate between the sparse representation error and the quantization error remains an open research problem. In our current implementation, we choose half of the targeted reconstruction error to be the approximation error ɛ 1. Better rate-distortion performance is expected with optimizing such allocation. n (, 1) ( 1,) (,) (1,) (,1) Fig. 1. Spiral order of 2-D candidate displacements After solving the sparse representation problem, we will have a set of L chosen candidates with coefficients having magnitude larger than a certain threshold ɛ 2 to avoid numerical error. We will use m(l) to denote the index of the l-th chosen atom, and B l = A m(l) the actual atom, with l = 1, 2,..., L. 3. ORTHONORMALIZATION AND QUANTIZATION A straight forward way to find a set of orthonormalized vectors C l from the chosen atoms B l is by applying the well-known Gradm- Schmidt orthogonalization algorithm to the chosen atoms sequentially, using l 1 C = B l (B l, C i )C i, C l = C/ C 2 (2) i=1 where (B, C) denotes the inner product of B and C, and C 2 denotes the 2-norm of C. The coefficients corresponding to the orthonormal vectors can be found easily by using inner product, i.e., t k = (F, C k ). In our current implementation, we apply uniform quantization to each coefficient t k with the same stepsize q. We denoted the quantized value by ˆt k. A problem with the above approach is that the coefficients corresponding to some of the resulting orthonormal vectors may be zero after quantization. Ideally, we want to only keep those vectors (and their corresponding atoms) that have non-zero quantized coefficients. In addition, we would like the resulting orthonormal vectors to have coefficients that are decreasing in magnitude with high likelihood. Towards these goals, we first order the original chosen candidates B l so that their corresponding coefficients are decreasing in magnitude. We then perform orthonormalization and quantization jointly. It uses a Gram-Schmidt-like orthogonalization procedure but with vectors (and their corresponding atoms) that have zero quantized coefficients thrown away. Basically, if a newly obtained orthonormalized vector has a coefficient that is quantized to zero, we will remove this vector and the original atom that is used to derive this vector, move to the next atom, and orthonormalize this atom with respect to all previously derived orthonormal vectors. At the end of this process, we have K (K L) orthonormalized vectors C(k), which correspond to original atoms with indices n(k), and quantized coefficients ˆt(k) with quantization indices t(k). Note that C(k) are equivalent to the orthonormal vectors obtained from the set of original candidates A(n(k)), k = 1, 2,..., K using the original Gram-Schmidt algorithm. Therefore, upon receiving the indices n(k), the decoder can deduce C(k) by applying the original Gram- Schmidt algorithm to the set of atoms with indices n(k). The above algorithm can be iterated several times to further reduce the number of remaining atoms. At the end of each iteration, if the number of chosen atoms is smaller than the last iteration, then all

Original block Reconstructed block, MSE=2.287 the remaining atoms are reordered based on the magnitudes of the coefficients associated with their corresponding orthogonal vectors. Next the same algorithm is applied to this reordered set of atoms. The iteration can continue until no more zero coefficients are identified in the last pass. We have found that two passes are sufficient for most image blocks. 4. ENTROPY CODING FOR CHOSEN ATOM INDICES AND QUANTIZED COEFFICIENTS Original block Best inter prediction candidate, MS Fig. 2. Representation of a sample block shown in using different methods, Top: The chosen atoms by the proposed representation, coefficients magnitudes are [24, 8, 6, 2, 2, 2]; Middle: The normalized vectors obtained from the chosen atoms, coefficients magnitude are [244.7, 71.8, 62.4, 14.5, 11.7, 25.7] ; Bottom: The Best matching block and the DCT basis images to represent the prediction error, coefficients magnitudes are [9, 36, 36, 18, 9, 18, 18, 36, 36, 54, 18, 18, 18, 18, 18, 18]. For each block, we first code the quantized mean value of the block, then the indices of the chosen atoms in the same order used for producing the final orthonormal vectors, and finally the quantized coefficients corresponding to the orthonormalized vectors. For the block mean, we perform predictive coding. We predict the mean value of the current block from the co-located block in the previous frame and quantize the prediction error. We collect the probability distribution of the quantized mean prediction error from training images, and use the entropy of this distribution to estimate the bits needed for coding the mean value. We include a special symbol EOB among the possible symbols to indicate the case that the quantized prediction error is zero and no other non-zero coefficients are needed. This would be the case when a block can be presented by a constant block with the predicted mean value accurately up to the target coding distortion. For specifying which atoms are chosen, we arrange all atoms in a pre-defined order, and code the indices of the chosen atoms successively. Specifically, we put all possible inter-prediction candidates in a 2-D array, based on their displacement vectors, with respect to the current block position. We then convert them to a 1-D array, using a 1-D clockwise spiral path starting from the center. For example, the first (n = 1), second (n = 2) and the third (n = 3) candidate in the spiral path are those candidates with displacements (, ), ( 1, ), and (1, ), respectively, as illustrated in Fig. 1. We attach all DCT basis vectors at the end following the well-known zigzag order. For our current implementation, we do not use intra-prediction candidates because we found through our experiments that these candidates are seldom chosen. To code the indices of the chosen atoms, our experiments show that there is very little correlation between positions of chosen candidates for the same block or across adjacent blocks. However, the probability distribution of the index of the first chosen atom is quite different from that of the second chosen atom, which is different from that of the third chosen atom, and so on. The first few chosen atoms are more likely to be the prediction candidates associated with small motion vectors, whereas the remaining atoms are more randomly distributed. Based on this observation, we code the index of the k-th chosen atom using its own probability distribution. We use the entropy of each distribution to estimate the bit rate needed to code each index. Our experiments have shown that the distributions of coefficients for k > 1 are very similar, and therefore can be coded using the same distribution without introducing noticeable loss in coding efficiency. We include a special symbol EOB among the possible symbols for each distribution to indicate the case that no more atoms are chosen. For the quantized coefficient values, we also code them sequentially, with the k-th coefficient coded using the probability distribution of the k-th coefficient s quantization index. This strategy is motivated by the observation that the distributions of the first few coefficients are somewhat different. However, the distributions of coefficients for k > 1 are very similar, and therefore the coefficients for k > 1 can be coded using the same distribution without loss in the coding efficiency.

3 Histogram of non zero coefficient For the results reported in this paper, we estimate the average number of bits for all symbols to be coded using the entropies derived from their corresponding probability distributions. We note that the current scheme does not exploit the redundancy in the possible patterns of successive coefficient values. Using an arithmetic coding scheme to take advantage of such redundancy, similar to the CABAC method used in H.264 and HEVC [1], is likely to improve the coding efficiency. 5. SIMULATION RESULTS Although the proposed coding framework can accommodate both inter- and intra-prediction candidates, our preliminary simulation results have shown that intra-prediction candidates are very rarely chosen. Therefore, in all results presented here, only inter-prediction candidates are used, together with the DCT basis vectors. We implemented the proposed method with the following parameters: block size of 16 16 with M=256, inter-candidate search range of 24 with integer shifts only. This leads to 234 candidates. Plus the 255 DCT basis vectors (exclude DC), we have a total of N = 2559 atoms. We evaluated the performance with quantization stepsize of q = 2, 36, 5. For the results reported, we choose threshold ɛ 1 = 3 and ɛ 2 = 1 6. Generally these thresholds should be chosen to be smaller than the expected mean square quantization error with a given stepsize. To get some insights into how does the algorithm work, we first show some intermediate results for a sample block from a sample frame (shown in Fig. 6 ) in the sequence trail pink kid (tk) from [11]. In Fig. 2, we show, for each sample block, the chosen atoms to represent this block with their corresponding coefficients using the LAR algorithm, and the orthonormalized vectors with their corresponding quantized coefficients. For comparison, we also show the best matching candidate and the DCT basis vectors with non-zero quantized coefficients. It is clear that for this sample block, the proposed representation is more efficient. The chosen candidates for each block resemble the block very closely. We note that generally there is a high likelihood that the first candidate chosen (the one with the largest coefficient) is the same as the best prediction candidate, as demonstrated in this case. Fig. 3 compares the distributions of the number of non-zero coefficients needed using DCT vs. using the proposed method, calculated using all blocks in the same sample frame. The average number of non-zero coefficients in this example is reduced by 32.48%. It is well known that, for a transform coder, the coding efficiency depends on how fast the coefficient variance drops. A steeper slope leads to a higher coding efficiency. Fig. 5(a) shows the variances of the DCT coefficients on the prediction error using the best prediction candidate, where for each block, only DCT coefficients with non-zero coefficients are considered, and are ordered in decreasing order of the coefficient magnitude. Fig. 5(b) shows the variances of the coefficients w l corresponding to the chosen atoms, where the atoms are ordered in decreasing order of the coefficient magnitude. Fig. 5(c) shows the variances of the coefficients t k associated with the orthonormal vectors derived from the chosen atoms. We see that using the adaptive dictionary directly can help make the coefficient variance drops with a steeper slope than using the DCT. Orthonormalization of the chosen atoms can further improve the steepness significantly, which helps to improve the coding efficiency. Finally we show the coding performance of the proposed coder and other three comparison coders for one test sequence consisting of 5 frames, with frame size of 128 72 and frame rate of 3Hz. We coded the first frame as an I-frame using the H.264 coder with Counts Counts 25 2 15 1 5 3 25 2 15 1 5 1 2 3 4 5 6 7 8 9 1 Number of non zero coefficients Histogram of non zero coefficient using DCT coder 1 2 3 4 5 6 7 8 9 1 Number of non zero coefficients Fig. 3. Distributions of the number of non-zero coefficients. Top: The proposed method, with mean number of non-zeros = 5.4358 and reconstruction PSNR= 42.37; The result is obtained by using ɛ 1 = 3 and q = 2; Bottom: Using DCT on the prediction error, with mean number of non-zeros = 8.512 and reconstruction PSNR= 41.79 ; The result is obtained by using q = 18 PSNR (db) 43 42 41 4 39 38 RD curve for three methods x264 CAVLC x264 CABAC all partitions Proposed Codec 37 2 3 4 5 6 7 8 bitrate (Mbps) Fig. 4. The PSNR vs. rate curves obtained using three different coders for the sequence

a QP of 15 (with reconstruction PSNR=49.19), we code all remaining frames as P-frames using different methods. The rate and PSNR reported are averaged over only the P-frames for all the comparison coders. For the proposed coder, we fixed the target sparse representation error at ɛ 1 = 3 and varied the quantization step size q to obtained different rate points. We determine the probability distributions of each type of variable to be coded (e.g. quantized block mean prediction error, index of k-th chosen candidate in spiral order, and the quantized value of k-th coefficient) based on the occurrence frequency of different symbols in all coded blocks in all frames, and estimate the average number of bits for all symbols to be coded using the entropies derived from their corresponding probability distributions. We compare the proposed coder with following two variations of the H.264 coders: H264CAVLC refers to H.264 using CAVLC for entropy coding, using only 16x16 blocks for both inter and intra prediction and for transform, but with quarter-pel accuracy motion. H264CABAC refers to H.264 using CABAC for entropy coding and all advanced options (including variable size blocks from 4x4 to 32x32). The H.264 results are obtained using the x264 Software [12]. Because the current implementation of the proposed coder uses a fixed blocksize of 16x16 and does not perform arithmetic coding, a relatively fair comparison is with H264CAVLC. Fig. 4 show the PSNR vs. bit rate curves obtained by these three methods. Fig. 6 shows decoded frames for the same sample frame using these three methods under similar bit rates. It is very encouraging that the proposed coder achieved significant gains over H264CAVLC, both using fixed block sizes and non-conditional entropy coding. Even more encouraging, the proposed coder has significant gains even over the H264CABAC with all options enabled. It is expected that with variable block sizes and more efficient entropy coding method, the proposed coder could possibly achieve more significant gain over H.264. 6. CONCLUSION AND OPEN RESEARCH The superior performance of the proposed coder compared to H.264, even when many components are not optimized, is very encouraging, and testifies the great promise of using the self-adaptive dictionary for video block representation. When using a redundant dictionary, it is critical to design efficient quantization and coding method to describe the resulting sparse representation. In addition to the use of self-adaptive dictionary, the proposed joint orthonormalization and quantization process also contributes greatly to the efficiency of the proposed coder. Note that this method for orthonormalizing the original dictionary atoms and performing quantization and coding in the orthonormalized subspace representation is applicable to any dictionary-based coding method. Although this work did not consider intra-frame coding, one can apply a similar idea which uses all intra-prediction candidates in the adaptive part of the dictionary, which can consist of shifted blocks in the previously coded areas in the same frame as well as the intra-prediction candidates using the various intra-prediction modes in H.264 and HEVC. Under the proposed general framework, many components can be further optimized. The current entropy coding scheme does not exploit the redundancy in the possible patterns of successive coefficient values and atom indices. Using an arithmetic coding scheme to take advantage of such redundancy, similar to the CABAC method in H.264 and HEVC, is likely to further improve the coding efficiency. The current coder uses a fixed block size. However, it is relatively straight forward to apply it with variable block sizes, and choose Variance Variance 1 9 8 7 6 5 4 3 2 1 14 12 1 Variance Coefficient variances in Zig zag order using DCT coder 5 1 15 2 25 3 Coefficients 8 6 4 2 Coefficient variances w.r.t original atoms 5 1 15 Coefficients 3.5 3 2.5 2 1.5 1.5 x Coefficient variances 14 4 1 2 3 4 5 6 Coefficients Fig. 5. Coefficient variances in decreasing order. Top: Using DCT on the prediction error; Middle: The variances of the coefficients associated with original chosen atoms; Bottom: The variances of the coefficients associated with the orthonormal vectors.

the block size using a rate-distortion optimization approach. This is expected to provide additional significant gain. Another open question is, given the target reconstruction error or target bit rate, how to choose the sparse representation error threshold and the quantization step size? This may be formulated as a rate-distortion optimized parameter selection problem. Finally, how to design the fixed part of the dictionary is another interesting and challenging research problem. 7. REFERENCES Fig. 6. Sample coded frames using three comparison methods, from top to bottom, (a) proposed coder (PSNR=42.88), (b) x264 with 16x16 partitions and CAVLC (PSNR=41.9), (c) x264 with all advanced options (PSNR=42.58); all three are coded with similar bitrate. [1] M. Aharon, M. Elad, and A. Bruckstein, K-svd: An algorithm for designing overcomplete dictionaries for sparse representation, IEEE Trans. on Signal Processing, vol. 54, no. 11, pp. 4311 4322, 26. [2] Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand, Overview of the high efficiency video coding (hevc) standard, IEEE Trans. on Circuits and Systems for Video Technology, vol. 22, pp. 1649 1668, 212. [3] Karl Skretting and Kjersti Engan, Image compression using learned dictionaries by rls-dla and compared with k-svd, IEEE International Conference on Acoustics, Speech, and Signal Processing, 211. [4] Joaquin Zepeda, Christine Guillemot, and Ewa Kijak, Image compression using sparse representations and the iterationtuned and aligned dictionary, IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 5, pp. 161 173, 211. [5] Philippe Schmid-Saugeon and Avideh Zakhor, Dictionary design for matching pursuit and application to motioncompensated video coding, IEEE Trans. on Circuits and Systems for Video Technology, vol. 14, no. 6, pp. 88 886, 24. [6] Je-Won Kang, C-CJ Kuo, Robert Cohen, and Anthony Vetro, Efficient dictionary based video coding with reduced side information, in Circuits and Systems (ISCAS), 211 IEEE International Symposium on. IEEE, 211, pp. 19 112. [7] Yipeng Sun, Mai Xu, Xiaoming Tao, and Jianhua Lu, Online dictionary learning based intra-frame video coding via sparse representation, in Wireless Personal Multimedia Communications (WPMC), 212 15th International Symposium on, 212. [8] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani, Least angle regression, Annals of Statistics, vol. 32, pp. 47 499, 24. [9] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro, Online learning for matrix factorization and sparse coding, Journal of Machine Learning Research, vol. 11, pp. 19 6, 21. [1] D. Marpe, H. Schwarz, and T. Wiegand, Context-based adaptive binary arithmetic coding in the h.264/avc video compression standard, IEEE Trans. on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 62 636, July 23. [11] Anush Krishna Moorthy, Lark Kwon Choi, Alan Conrad Bovik, and Gustavo de Veciana, Video quality assessment on mobile devices: Subjective, behavioral and objective studies, IEEE Journal of Selected Topics in Signal Processing, vol. 6, no. 6, pp. 652 671, October 212. [12] Laurent Aimar et. al., x264 open-source video encoder, http://www.videolan.org/developers/x264.html.